In this post, we’ll look at how binary encodings of JSON speed up repeated queries, then dig into the details of a few real-world encodings and see how VARIANT will ultimately supplant JSON in databases. There’s little discussion of the trade-offs these encodings make, but their design choices have outsized impact on our workloads. You’ll see how even a straightforward binary encoding can materially improve retrieval performance. By the end, we’ll have a clear mental model of the capabilities–and the limitations–of these encodings we’ve come to take for granted.

JSON Isn’t The Bottleneck-Parsing Is

Parsing JSON text is costly, and binary JSON can avoid a lot of that work. There’s no doubt about it. But just how expensive is it? Let’s put some numbers behind our intuition with a microbenchmark. We’ll be using simdjson, one of the fastest JSON parsers available, with its on-demand API hitting multi-GB/s throughput (workload- and machine-dependent, of course). Here are the results on my 2019 Intel MacBook Pro:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
BM_simdjson_object_ondemand/1000/0/min_time:1.000/min_warmup_time:1.000         2267 ns         2265 ns       498638
BM_simdjson_object_ondemand/1000/250/min_time:1.000/min_warmup_time:1.000       3867 ns         3863 ns       367326
BM_simdjson_object_ondemand/1000/500/min_time:1.000/min_warmup_time:1.000       5414 ns         5409 ns       262413
BM_simdjson_object_ondemand/1000/750/min_time:1.000/min_warmup_time:1.000       7207 ns         7193 ns       197417
BM_simdjson_array_ondemand/1000/0/min_time:1.000/min_warmup_time:1.000          1089 ns         1087 ns      1285819
BM_simdjson_array_ondemand/1000/250/min_time:1.000/min_warmup_time:1.000        1941 ns         1926 ns       803453
BM_simdjson_array_ondemand/1000/500/min_time:1.000/min_warmup_time:1.000        2604 ns         2599 ns       500731
BM_simdjson_array_ondemand/1000/750/min_time:1.000/min_warmup_time:1.000        3428 ns         3413 ns       395744
BM_simdjson_twitter/min_time:1.000/min_warmup_time:1.000                      136784 ns       136620 ns        10839

Benchmark setup

This is intentionally minimal. We exclude the data generation, I/O, and parser construction overhead (recommended by simdjson for long-running processes) and measure only parse + query. We are using three workloads here:

  1. Object lookup: a JSON object with 1000 sorted keys, each with an int value. Query different key names: "000", "250", "500", "750".

  2. Array lookup: a JSON array of 1000 integers. Query indices: 0, 250, 500, 750.

  3. Realistic workload: twitter.json (617KB), query: doc["statuses"].at(75)["user"]["name"].

What the numbers reveals

A realistic workload like twitter.json takes ~136784ns to parse and answer a single query. That’s almost two and a half minutes for 1 million rows–enough time to make a cup of coffee! Any modern analytical database can process billions of rows in that amount of time. More interestingly, the synthetic cases show a linearly growing lookup time for later keys/indices in both object and array. BM_simdjson_array_ondemand/1000/750 is about 3x slower than BM_simdjson_array_ondemand/1000/0 (1000 is total length, 750 is the lookup index).

That seems wrong: a lookup in sorted keys should ideally take O(log N) time while array access should be O(1). The real takeaway is that plain text JSON has no random access–you can’t look up later elements without walking past the earlier ones. You always have to decode data that’s not even needed. So yes, simdjson is fast (multi-GB/s parsing), but imagine how much faster things get when we don’t have to parse at all on every retrieval!

Designing a Minimal Binary Encoding

So how do we design a simple binary encoding for JSON? First, we need a way to represent every JSON value type. Per the official spec, JSON supports: string, number, boolean, null, array and object. A compact type tag can cover all of them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
enum class VarType : uint8_t {
    null        = 0,
    object      = 1,
    array       = 2,

    bool_true   = 3,
    bool_false  = 4,
    string      = 5,
    number      = 6
};

Next, we need a node format that can represent both tree structure (arrays/objects) and leaf data (strings/numbers/bools/null). Here’s a minimal structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
typedef struct VarNode {
    // The first field is guaranteed to be `var_type` for quick inspection
    VarType type : 4;

    // For object/array: total elements
    // For string/number: byte length of payload
    // For true/false/null: unused (0)
    uint32_t length : 28;

    // Data Layout:
    // - object/array, [VarMetaEntry * length] followed by [VarNode * length] (and their payloads)
    // - strings: raw UTF-8 data (no null terminator)
    // - number: the original number string for simplicity
} VarNode;

Finally, arrays and objects need an index that lets us jump directly to children without decoding everything before them. Each entry stores a relative offset to the child node:

1
2
3
4
5
6
// Metadata for array/object entries.
// - Array: O(1) access by index.
// - Object: metadata can be sorted by key for O(log N) lookup.
typedef struct VarMetaEntry {
    uint32_t value_ptr; // relative offset from this field to the target VarNode
} VarMetaEntry;

Visualizing the layout

A few example encodings to illustrate the design:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
{"b": 200, "a": 100}  // (note: keys will be sorted!)
[VarNode type=object length=2][VarMetaEntry * 2: offsets for "a" and "b"]
[VarNode type=string length=1][data="a"][VarNode type=number length=3][data="100"]
[VarNode type=string length=1][data="b"][VarNode type=number length=3][data="200"]

[100,true,"string"]
[VarNode type=array length=3][VarMetaEntry * 3: offsets for each element]
[VarNode type=number length=3][data="100"]
[VarNode type=bool_true]
[VarNode type=string length=6][data="string"]

{"a": [100,true,"string"]}
[VarNode type=object length=1][VarMetaEntry * 1]
[VarNode type=string length=1][data="a"]
[VarNode type=array length=3]
[VarMetaEntry * 3]
[VarNode type=number length=3][data="100"]
[VarNode type=bool_true]
[VarNode type=string length=6][data="string"]

There are a lot of interesting aspects even in this simple design. For booleans, the value is encoded in the type tag itself. For null, the entire VarNode is just 0, which makes bitwise comparison trivial. Numbers are preserved in their original string form, so there’s no double-guessing (pun intended). Instead, we can decide what to cast to at query time.

The index metadata is simply a list of relative offsets to element nodes. This allows O(1) random access into arrays, and O(log N) key lookup in objects via binary search (assuming keys are sorted). All structure and data are self-contained by design. This matters for workloads where you run multiple queries and repeatedly extract nested children: you can slice out a subtree by grabbing its start and end, without needing extra side tables. In other words, you can create a std::string_view-style view directly over of an existing binary JSON document.

Putting the design to work

With this simple design, we can parse the JSON once and store the binary encoding for repeated retrieval. Measuring the same workload–this time counting only the lookups–produces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
BM_binaryjson_object/1000/0/min_time:1.000/min_warmup_time:1.000         48.2 ns         48.0 ns     28791182
BM_binaryjson_object/1000/249/min_time:1.000/min_warmup_time:1.000       17.5 ns         17.4 ns     75834313
BM_binaryjson_object/1000/499/min_time:1.000/min_warmup_time:1.000       12.8 ns         12.8 ns    113965680
BM_binaryjson_object/1000/749/min_time:1.000/min_warmup_time:1.000       16.7 ns         16.6 ns     82031570
BM_binaryjson_array/1000/0/min_time:1.000/min_warmup_time:1.000          6.56 ns         6.55 ns    207271683
BM_binaryjson_array/1000/249/min_time:1.000/min_warmup_time:1.000        6.71 ns         6.69 ns    211780759
BM_binaryjson_array/1000/499/min_time:1.000/min_warmup_time:1.000        6.59 ns         6.58 ns    211003215
BM_binaryjson_array/1000/749/min_time:1.000/min_warmup_time:1.000        6.73 ns         6.71 ns    208599360
BM_binaryjson_twitter/min_time:1.000/min_warmup_time:1.000               58.3 ns         58.1 ns     23862443

Now the query over twitter.json takes only 58.3ns, which works out to roughly 0.0583 seconds for 1 million rows! That’s a 2,346x speedup compared to parsing with simdjson on every query. Array access is also constant-time whether you’re reading from the beginning of the array or near the end (subject to cacheline effects). For objects, binary search reduces the key lookup time: “shallower” hits can be as low as 12.8ns, while deeper searches (more comparisons) climb toward 48.2ns.

Here, our brief journey designing a simple binary encoding ends. You could push it further with SIMD traversal, but it’s time to turn our attention to some real-world cases.

Binary JSON Isn’t One Format

We’ve seen the performance benefits of using binary JSON. But which binary JSON format should we adopt? You might think everyone should just use BSON and call it a day. But BSON comes with some notable trade-offs:

  • Arrays are encoded as objects with numeric keys. This simplification adds significant storage overhead.
  • Object keys are stored as null-terminated C-strings rather than length-prefixed strings, so parsers must scan to determine key length.
  • There’s no random access to a field within an object or an index within an array.
  • Keys aren’t sorted, so two semantically identical JSON documents may not have a deterministic representation and can’t be easily compared in canonical form.
  • The encoding includes several deprecated legacy types (eg: DBPointer, Symbol).

There are also formats like CBOR and MessagePack, but they don’t provide random access either. So they are optimized for compact transport and serialization–not for fast, repeated lookups.

What about other databases?

Postgres’ JSONB makes a different set of trade-offs. It supports random access for arrays and faster key lookups, but it’s engineered to play nicely with Postgres internals–especially TOAST. For example, it uses a offset-or-length, with a fixed stride scheme for object/array indexing. Numbers are also parsed into Postgres’ numeric type system.

Meanwhile, YDB has a simple design that’s good enough for most cases, with a tradeoff: because strings and numbers live in a global table, each nested structure is no longer self-contained. You can’t just extract an element–you also need to consult the global table.

The design space: what are we optimizing for?

The point is that every binary encoding is optimized for different goals. Each is a collection of design choices:

Supported Workloads

What kind of queries are we serving? For example, if we mostly extract scalars ($.a.b.c), one simple idea is to flatten all keys and build an inverted index. The keys can still point to a minimal tree structure for verification in case of collision. But if we intend to support extracting a whole array, it’s easier to put all the elements close together and make subtree boundaries easier to compute.

Storage Footprint

Is saving disk space the primary concern? Then we can deduplicate all the string keys and values, put them in a table and reference them from the tree structure. How much indexing metadata can we afford? Also, if we know the maximum document size we need to support, we can use smaller offsets–e.g., 1 byte instead of 4 bytes (the Parquet VARIANT type has this optimization).

Serialization and Materialization Cost

Usually serialization happens once, so are we willing to spend more time up front for better encoding? If so, we can sort keys and build indexes. Materialization matters too because when you query and retrieve a subtree, you may need to materialize the result–or, ideally, provide a “view” into the subtree without new allocations (because runtime memory allocations are expensive and make latency less predictable).

Random Access to Fields

If we want to support O(1) array element access, we’ll need basic indexing. That index should also allow us to skip data without decoding everything that precedes it and make it easy to compute subtree boundaries. For example, if we run $.a[500] on an array of 1,000 elements, we should be able to determine its offset directly.

Keys Ordering and Uniqueness

Do we allow duplicate keys? Sorting enables binary search over the keys. It also helps canonicalize the format, so that two JSON strings with the same values but different key order encode to the same binary JSON. This can be useful for quick bytewise comparison.

Data Types Handling

Modern JSON parsers can detect whether a numeric literal is an integer, double or decimal. They can store values more efficiently, at the cost of parsing. Also, do we accept scalar values as the root? BSON assumes the root is a document, which lets it save a byte from storing type of the root.

So the answer to which binary JSON encoding to adopt is—you guessed it—it depends on your workload. For analytical workloads with repeated retrieval, random access isn’t a nice-to-have; it’s the baseline. In our case, we landed on a simple encoding close to the one above, shaped by what we could execute efficiently inside our execution engine.

Beyond JSON: Parquet’s VARIANT Design

We’ve been focusing on JSON so far, but it’s really just one instance of semi-structured data. XML is another. The more general abstraction is VARIANT: a single type that can represent a wide range of semi-structured data. Think of it as a generic container–JSON with richer data types. Snowflake has supported VARIANT for years, but the design is closed source. For a reference, we can look at Parquet’s VARIANT specification to see what a practical VARIANT encoding looks like. We won’t cover the entire spec here–just a few interesting parts.

First, here’s the value node layout:

1
2
3
4
5
6
7
8
           7                                  2 1          0
          +------------------------------------+------------+
value     |            value_header            | basic_type |  <-- value_metadata
          +------------------------------------+------------+
          |                                                 |
          :                   value_data                    :  <-- 0 or more bytes
          |                                                 |
          +-------------------------------------------------+

Source: Parquet’s VARIANT Encoding

Parquet VARIANT defines four “basic types”: primitive, short string, object and array. It includes 20 primitive types, which is a subset of Parquet’s logical types. There’s also an optimization for short strings (under 64 bytes): string length is packed into 6 bits of value_header, avoiding the need for a separate length field (which saves a few bytes?!). For objects and arrays, value_header also includes a flag indicating whether the indexing offsets are stored as 1 byte or 4 bytes–another small but useful optimization.

For arrays, the indexing scheme is straightforward: there are N + 1 offsets, with the last offset representing the end position of the array payload. Objects are slightly more involved. They use two offset lists (the first is technically a list of IDs): the first list contains key references into a string table stored in global metadata, and the second list points to the corresponding value nodes. It’s an interesting choice to deduplicate only keys, rather than all string values—reasonable, given keys typically have much lower cardinality than values. The key table may be sorted (optionally), though the practical benefits aren’t obvious.

Another subtle design choice is that the offset lists are ordered by key, but the value nodes themselves don’t have to be physically sorted. The benefit of this flexibility isn’t obvious: it can make it harder to compute the start and end boundaries of an individual object entry.

One motivation that does carry through is the same one we used earlier: aside from shared metadata, each nested VARIANT value is stored as a self-contained region. So, here’s a real-world binary encoding for semi-structured data: it has more types and knobs, but the overall structure is similar. To push performance further, Parquet VARIANT also supports shredding: partially or fully extracting JSON into columns to improve compression and enable predicate pushdown. The trade-off, of course, is having to maintain consistency between the binary documents and the shredded columns.

Conclusion

It’s 2026: the question of whether you should store JSON in a database is basically settled–the ecosystem has largely converged on binary representations: MongoDB has BSON, Postgres has JSONB (with SQL/JSON path operators), and systems like Snowflake treat semi-structured data as a first-class type via VARIANT. If you can afford the upfront serialization cost, binary JSON is usually the way to go for fast retrieval.

You should expect more from your data systems—they should be smart enough to encode and shred binary JSON when it makes sense. As we move into lakehouses, Apache Iceberg also supports VARIANT since v3. We started with JSON, but semi-structured data will go beyond JSON with VARIANT. Hopefully our discussion here helps you understand the capabilities and limitations of these binary encodings.

And like most posts these days, we have to mention LLMs at least once. Did you know LLMs also “speak JSON” when returning structured responses through APIs? It feels a bit absurd to ask a powerful natural-language model to squeeze itself back into JSON, but here we are. In fact, it’s an important stage in the pipeline—important enough that we now have TOON: a JSON encoding designed to be more token-efficient and accurate for LLMs, while still staying human-readable.

Semi-structured data is here to stay, and I’m all for it. So—how are you handling semi-structured data in your tooling? Let me know!