WebSocket Protocol Explained

WebSocket is the backbone of most realtime web applications. But unless you've dug into RFC 6455 or inspected raw frames in Wireshark, you probably have a vague mental model of how it actually works. This post fills in the gaps — from the HTTP upgrade handshake through frame anatomy, opcodes, fragmentation, ping/pong, and the close sequence.

RFC 6455 in a Nutshell

RFC 6455, published in 2011, is the specification that defines WebSocket as a protocol. Before it existed, developers were hacking around HTTP with techniques like long-polling and SSE to simulate bidirectional communication. RFC 6455 standardized a proper full-duplex channel over a single TCP connection.

The spec defines three core things:

The opening handshake — how an HTTP connection is upgraded to a WebSocket connection.
The framing format — how messages are packaged into binary frames with a specific header structure.
The closing handshake — how either peer can initiate a graceful teardown.

Everything else — subprotocols, authentication, message routing — is left to the application layer.

The Opening Handshake

The connection starts as an ordinary HTTP/1.1 request. The client sends an Upgrade header signaling it wants to switch protocols:

GET /socket HTTP/1.1
Host: api.apinator.io
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

The server responds with 101 Switching Protocols and a derived Sec-WebSocket-Accept value:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

After this exchange, the TCP connection stays open and both sides speak the WebSocket framing protocol. There is no more HTTP — the connection is now a raw binary channel.

Frame Anatomy

Every WebSocket message is carried in one or more frames. Each frame has a compact binary header followed by the payload. Here is what a frame looks like laid out bit by bit:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - -+
|     Extended payload length continued, if payload len == 127  |
+ - - - - - - - - - - - - - - -+-------------------------------+
|                               | Masking-key, if MASK set to 1 |
+-------------------------------+-------------------------------+
| Masking-key (continued)       |          Payload Data         |
+-------------------------------- - - - - - - - - - - - - - - -+
|                     Payload Data continued ...                |
+---------------------------------------------------------------+

Breaking down the fields:

FIN bit — set to 1 if this is the final (or only) frame in a message. Set to 0 when a message is fragmented across multiple frames.
RSV1, RSV2, RSV3 — reserved bits that must be 0 unless a negotiated extension uses them. The permessage-deflate compression extension uses RSV1 to signal that a message is compressed.
Opcode (4 bits) — tells the receiver what kind of frame this is.
MASK bit — set to 1 if the payload is masked. All frames sent from client to server must be masked. Frames from server to client must not be.
Payload length (7 bits) — if the value is 0–125, that is the payload length in bytes. If the value is 126, the next 2 bytes hold a 16-bit length. If the value is 127, the next 8 bytes hold a 64-bit length. This three-tier scheme keeps small frames compact.
Masking key (32 bits) — present only when the MASK bit is 1. The client generates a random 4-byte key and XORs each payload byte against it cyclically. This prevents certain proxy cache poisoning attacks, not eavesdropping.
Payload — the actual data, after unmasking if necessary.

Opcodes Explained

The 4-bit opcode field identifies the frame type:

Opcode	Name	Description
`0x0`	Continuation	Carries the next fragment of a fragmented message
`0x1`	Text	A UTF-8 encoded text message (or first fragment)
`0x2`	Binary	A binary message (or first fragment)
`0x8`	Close	Initiates or acknowledges a connection close
`0x9`	Ping	Keepalive probe sent by either peer
`0xA`	Pong	Response to a ping

Opcodes 0x3 through 0x7 and 0xB through 0xF are reserved for future use.

Text vs Binary Frames

Text frames (0x1) must carry valid UTF-8. The receiver is required to validate this and close the connection with status code 1007 if it receives invalid UTF-8. This makes text frames safe for JSON and other text-based formats.

Binary frames (0x2) carry raw bytes with no encoding constraint. The meaning is entirely application-defined. Common choices include:

MessagePack — compact binary JSON alternative
Protocol Buffers — Google's schema-driven binary format
CBOR — Concise Binary Object Representation
Custom formats — anything goes, as long as both sides agree

For most applications, JSON over text frames is the pragmatic default. If you need to optimize for payload size or serialization speed at scale, binary frames with a structured format are worth the added complexity.

Fragmentation

A single logical message can be split across multiple frames. This is useful when the sender does not know the total message size upfront — for example, when streaming data from a database or file.

The sequence works like this:

The first frame has FIN=0 and opcode 0x1 (or 0x2).
Each intermediate frame has FIN=0 and opcode 0x0 (continuation).
The final frame has FIN=1 and opcode 0x0.

Frame 1: FIN=0, opcode=0x1, payload="Hello, "
Frame 2: FIN=0, opcode=0x0, payload="world"
Frame 3: FIN=1, opcode=0x0, payload="!"

The receiver reassembles these into a single message: "Hello, world!". Control frames (ping, pong, close) can be interleaved between data fragments — they are never fragmented themselves.

Most applications do not need to fragment manually. Sending discrete messages is simpler and sufficient in the vast majority of cases.

Ping and Pong

Either peer can send a ping frame at any time. The recipient must respond with a pong frame carrying the same payload data. This serves as a keepalive mechanism: it confirms the TCP connection is alive and the remote peer is still responsive.

Servers typically send pings on a fixed interval — commonly between 20 and 60 seconds — and close the connection if no pong arrives within a timeout window. This catches silent TCP disconnections where the OS never delivers a RST or FIN (common on mobile networks and through certain NAT devices).

Clients can also send pings, but in practice most leave keepalive responsibility to the server. Libraries generally handle this automatically.

The Close Handshake

A clean WebSocket shutdown requires both sides to exchange close frames. The sequence:

One peer sends a close frame (opcode 0x8), optionally including a 2-byte status code and a UTF-8 reason string.
The other peer sends its own close frame in response.
After sending a close frame, each side stops sending new data frames.
The peer that initiated the close waits to receive the response close frame, then closes the TCP connection.

Common status codes:

Code	Meaning
`1000`	Normal closure — the purpose for which the connection was established has been fulfilled
`1001`	Going away — the server is shutting down or the browser tab is closing
`1006`	Abnormal closure — the connection was closed without a close frame (TCP-level drop)
`1007`	Invalid frame payload data — received invalid UTF-8 in a text frame
`1011`	Internal server error — the server terminated the connection due to an unexpected condition

Status code 1006 is notable because it is never sent in an actual close frame — it is synthesized by the client library to indicate that the connection dropped without a proper close sequence.

Extensions: permessage-deflate

RFC 6455 defines an extension mechanism negotiated during the handshake. The most widely supported extension is permessage-deflate, which compresses the payload of each message using the DEFLATE algorithm (the same one behind gzip).

The handshake looks like:

Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

When active, the RSV1 bit is set on compressed frames. The extension can reduce payload size dramatically for repetitive text data like JSON, sometimes by 60–80%. The tradeoff is CPU overhead on both sides. Whether it is worth enabling depends on your message patterns and infrastructure costs.

What This Means for Application Developers

Here is the practical upshot: you will almost never deal with any of this directly.

WebSocket client and server libraries handle framing, masking, fragmentation reassembly, ping/pong keepalive, and the close handshake on your behalf. When you call socket.send("hello"), the library wraps that string in a properly structured text frame, sets the FIN bit, applies the masking key, and writes the bytes to the TCP socket. When the remote peer sends a ping, the library sends a pong automatically.

What you do need to understand is the abstraction boundary: you send and receive messages, not frames. A message can be text or binary. Messages arrive in order. The connection is full-duplex. If the connection drops, your library fires a close or error event and you reconnect.

Understanding the protocol underneath helps you reason about edge cases — why large binary payloads cost more CPU with compression enabled, why a clean shutdown matters for freeing server resources, why a connection can appear open at the application layer but be silently dead at the TCP layer. That knowledge becomes useful when you are debugging production issues or evaluating infrastructure.

Platforms like Apinator handle the protocol complexity at scale — connection management, keepalive, graceful drains during deploys — so your application code stays focused on messages.