return to table of content

Building a high performance JSON parser

eatonphil
29 replies
1d3h

The walkthrough is very nice, how to do this if you're going to do it.

If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.

vjerancrnjak
18 replies
1d3h

If your JSON always looks the same you can also do better than general JSON parsers.

diarrhea
11 replies
1d2h

I wonder: can fast, special-case JSON parsers be dynamically autogenerated from JSON Schemas?

Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…

marginalia_nu
2 replies
22h27m

A fundamental problem with JSON parsing is that it has variable length fields that don't encode their length, in a streaming scenario you basically need to keep resizing your buffer until the data fits. If the data is on disk and not streaming you may get away with reading ahead to find the end of the field first, but that's also not particularly fast.

Schemas can't fix that.

robocat
1 replies
17h41m

Only if you are using pointers/slices into the buffer as an optimisation.

Otherwise there is no need to keep a buffer of anything after it has been parsed.

marginalia_nu
0 replies
16h16m

I'm talking about during parsing.

Let's assume I send you a JSON object that is one very long string and nothing else. It's e.g. 1 GB in size. To know you need to allocate a 1GB buffer, you need to first scan it, and then copy it; or keep reallocating the same buffer until it fits.

It's an absurd case, but shorter strings face similar overhead.

dleeftink
2 replies
1d

Somewhat tangentially related, Fabian Iwand posted this regex prefix tree visualiser/generator last week [0], which may offer some inspiration for prototyping auto generated schemas.

dleeftink
0 replies
14h7m
atombender
0 replies
20h7m

You forgot to include the link?

minhazm
1 replies
1d2h

For json schema specifically there are some tools like go-jsonschema[1] but I've never used them personally. But you can use something like ffjson[2] in go to generate a static serialize/deserialize function based on a struct definition.

[1] https://github.com/omissis/go-jsonschema [2] https://github.com/pquerna/ffjson

atombender
0 replies
20h7m

Hey, go-jsonschema is my project. (Someone else just took over maintaining it, though.) It still relies on the standard Go parser; all it does it generate structs with the right types and tags.

mhh__
0 replies
20h24m

It's relatively common in D application to use the compile time capabilities to generator a parser at compile time

galangalalgol
0 replies
1d1h

Doesn't the serde crate's json support do precisely this? It generates structs that have optional in all the right places and with all the right types anyway. Seems like the llvm optimiser can probably do something useful with that even if the serde feature isn't using apriori knowledge out of the schema.

PartiallyTyped
0 replies
23h55m

Pydantic does that to some extend I think.

lylejantzi3rd
3 replies
1d2h

Andreas Fredriksson demonstrates exactly that in this video: https://vimeo.com/644068002

kaladin_1
1 replies
19h40m

I really enjoyed this video even though he lost me with the SIMD code.

gnuvince
0 replies
17h12m

I like this video because there's a lot of a good actionable advice before he gets into SIMD code.

ykonstant
0 replies
1d

Very enjoyable video!

loeg
1 replies
1d2h

You might also move to something other than JSON if parsing it is a significant part of your workload.

haswell
0 replies
23h27m

Most of the times I’ve had to deal with JSON performance issues, it involved a 3rd party API and JSON was the only option.

If you’re building something net-new and know you’ll have these problems out the gate, something other than JSON might be feasible, but the moment some other system not in the closed loop needs to work with the data, you’re back to JSON and any associated perf issues.

Thaxll
6 replies
1d2h

The fastest json lib in Go is the one done by the company behind Tiktok.

ken47
2 replies
1d1h

Fastest at what?

cannonpalms
1 replies
22h56m

> For all sizes of json and all scenarios of usage, Sonic performs best.

The repository has benchmarks

mananaysiempre
0 replies
20h45m

I’m not seeing simdjson in them though? I must be missing something because the Go port of it is explicitly mentioned in the motivation[1] (not the real thing, though).

[1] https://github.com/bytedance/sonic/blob/main/docs/INTRODUCTI...

rockinghigh
0 replies
1d
pizzafeelsright
0 replies
23h36m

Excellent treat vector.

pizzafeelsright
0 replies
23h36m

Excellent treat vector.

lionkor
1 replies
1d1h

simdjson has not been the fastest for a long long time

jzwinck
0 replies
1d

What is faster? According to https://github.com/kostya/benchmarks#json nothing is.

fooster
0 replies
1d2h

Last time I compared the performance of various json parsers the simd one turned out to be disappointingly slow.

jchw
19 replies
23h49m

Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)

coldtea
18 replies
23h13m

What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!

craigching
7 replies
23h4m

Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.

zlg_codes
4 replies
10h45m

What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

How big is 'large' here?

I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.

I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?

nly
0 replies
7h29m

I've seen tens of millions of market data events from a single day of trading encoded in JSON and used in various post-trade pipelines.

mxmlnkn
0 replies
5h52m

Chrome trace format files also use JSON and can also become large and are a pain to work with.

Edwinr95
0 replies
7h41m

I've seen people dump and share entire databases in JSON format at my job....

EMM_386
0 replies
8h0m

> What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

I've been in the industry for a while. I've probably left more than one client site muttering "I've seen some things ...".

If it can be done, it will be done. And often in a way that shouldn't have even been considered at all.

Many times, "it works" is all that is needed. Not exactly the pinnacle of software design. But hey, it does indeed "work"!

beached_whale
0 replies
17h28m

There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either

ahoka
0 replies
20h40m

Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.

jchw
5 replies
22h36m

Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.

Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.

beached_whale
2 replies
17h31m

The large documents are often fixed by using mmap/virtualalloc of the file, but Boost.JSON has a streaming mode and is reasonably fast and the license is good for pulling into anything. It's not the fastest, but faster than rapid with the interface of nlohmann JSON. For most tasks, it does seem that most of hte libraries taking a JSON document approach are wasting a lot of time/memory to get to the point that we want normal data structures, not a JSON document tree. If we pull that out and parse straight to the data structures there is a lot of win in performance and memory with less/no code, just mappings. That's how I approached it at least.

c-smile
1 replies
13h37m

> that most of hte libraries taking a JSON document approach are wasting a lot of time/memory

I agree. That's the same situation as with XML/HTML. In many cases you don't really need to build a DOM or JSOM in memory. If your task is about deserializing some native structures.

This XML scanner of mine does not allocate any memory at all while parsing HTML/XML: https://www.codeproject.com/Articles/14076/Fast-and-Compact-...

It is even simpler than SAX parser.

beached_whale
0 replies
11h37m

For the interesting JSON of a significant size, an interator/range interface that parses to concrete types works really well. Usually they are large arrays or JSONL like things

CyberDildonics
1 replies
2h33m

If you have files that are large enough that json is a problem, why use json in the first place? Why not use a binary format that will be more compact and easier to memory map?

httgp
0 replies
2h10m

Chances are they can’t control that; they’re perhaps provided by a vendor.

marcosdumay
1 replies
20h1m

I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)

I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).

There are probably lots and lots of reasons to write your own parser.

jbiggley
0 replies
19h42m

This reminds me of my favourite quote about standards.

>The wonderful thing about standards is that there are so many of them to choose from.

And, keeping with the theme, this quote may be from Grace Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.

lgas
0 replies
20h21m

Someone misunderstood the JSONParserFactory somewhere along the line.

crabbone
0 replies
18h6m

I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".

mgaunard
18 replies
1d

"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications

mannyv
6 replies
23h30m

If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.

crabbone
3 replies
17h36m

Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).

chii
2 replies
16h49m

> If so, how do you deal with repeated keys in "hash tables"?

depending on the parser, behaviour might differ. But looking at https://stackoverflow.com/questions/21832701/does-json-synta... , it seems like the "best" option is to have 'last key wins' as the resolution.

This works fine under a SAX like interface in a streaming JSON parser - your 'event handler' code will execute for a given key, and a 2nd time for the duplicate.

crabbone
0 replies
4h22m

> This works fine

This is a very strange way of using the word "fine"... What if the value that lives in the key triggers some functionality in the application that should never happen due to the semantics you just botched by executing it?

Example:

    {
      "commands": {
        "bumblebee": "rm -rf /usr",
        "bumblebee": "echo 'I have done nothing wrong!'"
      }
    }
With the obvious way to interpret this...

So, you are saying that it's "fine" for an application to execute the first followed by second, even though the semantics of the above are that only the second one is the one that should have an effect?

Sorry, I have to disagree with your "works fine" assessment.

cowl
0 replies
9h25m

last key wins is terrible advice and has serious security implications.

see https://bishopfox.com/blog/json-interoperability-vulnerabili... or https://www.cvedetails.com/cve/CVE-2017-12635/ for concrete examples where this treatment causes security issues.

the https://datatracker.ietf.org/doc/html/rfc7493 defines a more strict format where duplicate keys are not allowed.

jjeaff
1 replies
21h26m

I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.

mannyv
0 replies
19h56m

RAM is always expensive no matter where you get it from.

Would you rather do two hours of work or force thousands of people to buy more RAM because your library is a memory hog?

And on embedded systems RAM is a premium. More RAM = most cost.

isuckatcoding
4 replies
1d

Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.

dh2022
3 replies
23h48m

If performance is important why would you keep large datasets in JSON format?

querulous
0 replies
22h30m

sometimes it's not your data

isuckatcoding
0 replies
20h42m

Usually because the downstream service or store needs it

Maxion
0 replies
20h12m

Because you work at or for some bureaucratic MegaCorp, that does weird things with no real logic behind it other than clueless Dilbert managers making decisions based on LinkedIn blogs. Alternatively desperate IT consultants trying to get something to work with too low of a budget and/or no access to do things the right way.

Be glad you have JSON to parse, and not EDI, some custom deliminated data format (with no or old documentation) - or shudders you work in the airline industry with SABRE.

ahoka
2 replies
19h31m

Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.

mgaunard
0 replies
19h6m

On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.

Rapzid
0 replies
18h46m

Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.

e12e
1 replies
22h49m

If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)

mannyv
0 replies
19h55m

In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?

capableweb
0 replies
23h55m
suzzer99
13 replies
23h20m

Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,
coldtea
9 replies
23h10m

It can't have comments because it didn't originally had comments, so now it's too late. And it originally didn't have comments, because Douglas Cockford thought they could be abused for parsing instructions.

As for not having trailing commas, it's probably a less intentional bad design choice.

That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.

tubthumper8
4 replies
22h18m

Does JSONC have a specification or formal definition? People have suggested[1] using JSON5[2] instead for that reason

[1] https://github.com/microsoft/vscode/issues/100688

[2] https://spec.json5.org/

mananaysiempre
3 replies
20h36m

Unfortunately, JSON5 says keys can be ES5 IdentifierName[1]s, which means you must carry around Unicode tables. This makes it a non-option for small devices, for example. (I mean, not really, you technically could fit the necessary data and code in low single-digit kilobytes, but it feels stupid that you have to. Or you could just not do that but then it’s no longer JSON5 and what was the point of having a spec again?)

[1] https://es5.github.io/x7.html#x7.6

debugnik
1 replies
17h45m

Or take the SQLite route, if you don't need to reject strictly invalid JSON5. When they extended SQLite's JSON parser to support JSON5, they specifically relaxed the definition of unquoted keys (in a compatible way) to avoid Unicode tables[1]:

> Strict JSON5 requires that unquoted object keys must be ECMAScript 5.1 IdentifierNames. But large unicode tables and lots of code is required in order to determine whether or not a key is an ECMAScript 5.1 IdentifierName. For this reason, SQLite allows object keys to include any unicode characters greater than U+007f that are not whitespace characters. This relaxed definition of "identifier" greatly simplifies the implementation and allows the JSON parser to be smaller and run faster.

[1]: https://www.sqlite.org/json1.html

lifthrasiir
0 replies
6h34m

Or take the XML route. Unicode offers several strategies to avoid continuously updating the Unicode table [1] which has been adapted by XML 1.1 and also later editions of XML 1.0. The actual syntax can be as simple as the following (based from XML, converted to ABNF as in RFC 8295):

    name = name-start-char *name-char
    name-start-char =
        %x41-5A / %x5F / %x61-7A / %xC0-D6 / %xD8-F6 / %xF8-2FF / %x370-37D /
        %x37F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF /
        %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
    name-char = name-start-char / %x30-39 / %xB7 / %x300-36F / %x203F-2040
In my opinion, this approach is so effective that I believe every programming language willing to support Unicode identifiers but nothing more complex (e.g. case folding or normalization or confusable detections) should use this XML-based syntax. You don't even need to narrow it down because Unicode explicitly avoided identifier characters outsides of those ranges due to the very existence of XML identifiers!

[1] https://unicode.org/reports/tr31/#Immutable_Identifier_Synta...

tubthumper8
0 replies
13h58m

Yeah, definitely for small devices. For things like VS Code's configuration file format (the parent comment) or other such use cases, I don't see a problem

suzzer99
0 replies
17h10m

Yeah if you could get NPM to allow JSONC in package.json, that'd be great.

nigeltao
0 replies
16h20m

If you want commas and comments, another term to search for is JWCC, which literally stands for "JSON With Commas and Comments". HuJSON is another name for it.

mananaysiempre
0 replies
20h44m

Amusingly, it originally did have comments. Removing comments was the one change Crockford ever made to the spec[1].

[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)

explaininjs
0 replies
22h37m

JSONC also supports trailing commas. It is, in effect, "JSON with no downsides".

TOML/Yaml always drive me batty with all their obscure special syntax. Whereas it's almost impossible to look at a formatted blob of JSON and not have a very solid understanding of what it represents.

The one thing I might add is multiline strings with `'s, but even that is probably more trouble than it's worth, as you immediately start going down the path of "well let's also have syntax to strip the indentation from those strings, maybe we should add new syntax to support raw strings, ..."

shepherdjerred
0 replies
23h7m

I don't know the historic reason why it wasn't included in the original spec, but at this point it doesn't matter. JSON is entrenched and not going to change.

If you want comments, you can always use jsonc.

semiquaver
0 replies
23h3m

It’s not that it “can’t”, more like it “doesn’t”. Douglas Crockford prioritized simplicity when specifying JSON. Its BNF grammar famously fits on one side of a business card.

Other flavors of JSON that include support for comments and trailing commas exist, but they are reasonably called by different names. One of these is YAML (mostly a superset of JSON). To some extent the difficulties with YAML (like unquoted ‘no’ being a synonym for false) have vindicated Crockford’s priorities.

crabbone
0 replies
17h30m

There's no real reason for that. It just happened like that. These aren't the only bad decisions made by JSON author, and not the worst either.

What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.

arun-mani-j
7 replies
1d2h

I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)

janmo
3 replies
1d1h

I wrote a small JSON parser in C myself which I called jsoncut. It just cuts out a certain part of a json file. I deal with large JSON files, but want only to extract and parse certain parts of it. All libraries I tried parse everything, use a lot of RAM and are slow.

Link here, if interested to have a look: https://github.com/rgex/jsoncut

vlovich123
2 replies
1d

The words you’re looking for are SAX-like JSON parser or streaming json parser. I don’t know if there’s any command line tools like the one you wrote that use it though to provide a jq-like interface.

janmo
1 replies
1d

I tried JQ and other command line tools, all were extremely slow and seemed to always parse the entire file.

My parser just reads the file byte by byte until it finds the target, then outputs the content. When that's done it stops reading the file, meaning that it can be extremely fast when the targeted information is at the beginning of the JSON file.

vlovich123
0 replies
18h53m

You're still describing a SAX parser (i.e. streaming). jq doesn't use a SAX parser because it's a multi-pass document editor at its core, hence why I said "jq-like" in terms of supporting a similar syntax for single-pass queries. If you used RapidJSON's SAX parser in the body of your custom code (returning false once you found what you're looking for), I'm pretty sure it would significantly outperform your custom hand-rolled code. Of course, your custom code is very small with no external dependencies and presumably fast enough, so tradeoffs.

xoac
0 replies
1d2h

Good for you but what does this have to do with the article?

masklinn
0 replies
1d1h

> I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

> I don't know how "true" that comment is

Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

jjice
0 replies
1d

I guess there are only so many ways to write a JSON parser b cause one I wrote on a train in Python looks very similar!

I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.

visarga
6 replies
1d2h

nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?

gurrasson
1 replies
1d

The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.

Was looking for a library that could do this parsing.

explaininjs
0 replies
1d

See my sibling comment :)

kevingadd
0 replies
1d2h

If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.

explaininjs
0 replies
1d

Perhaps not quite what you're asking for, but along the same lines there's this "Incomplete JSON" parser, which takes a string of JSON as it's coming out of an LLM and parses it into as much data as it can get. Useful for building streaming UI's, for instance it is used on https://rexipie.com quite extensively.

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },
Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.

_dain_
0 replies
1d1h

halloween was last week

RichieAHB
0 replies
1d1h

I feel like trying to infer valid JSON from invalid JSON is a recipe for garbage. You’d probably be better off doing a second pass with the “JSON” through the LLM but, as the sibling commenter said, at this point even the good JSON may be garbage …

kevingadd
5 replies
1d3h

I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.

mgaunard
4 replies
1d1h

Of course you can force-inline.

cbarrick
3 replies
1d

Obviously you can manually inline functions. That's what happened in the article.

The comment is about having a directive or annotation to make the compiler inline the function for you, which Go does not have. IMO, the pre-inline code was cleaner to me. It's a shame that the compiler could not optimize it.

There was once a proposal for this, but it's really against Go's design as a language.

https://github.com/golang/go/issues/21536

mgaunard
2 replies
22h52m

You can in any systems programming language.

Go is mostly a toy language for cloud people.

cbarrick
1 replies
21h58m

> toy language

You may be surprised to hear that Go is used in a ton of large scale critical systems.

mgaunard
0 replies
20h4m

I don't consider cloud technology a critical system.

jensneuse
3 replies
1d

I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools

markl42
1 replies
1d

How big of an issue is this for GQL servers where all queries are known ahead of time (allowlist) - i.e. you can cache/memorize the ast parsing and this is only a perf issue for a few minutes after the container starts up

Or does this bite us in other ways too?

jensneuse
0 replies
1d

I build GraphQL API gateways / Routers for 5+ years now. It would be nice if trusted Documents or persisted operations were the default, but the reality is that a lot of people want to open up their GraphQL to the public. For that reason we've build a fast parser, validator, normalizer and many other things to support these use cases.

romshark
0 replies
8h28m

You might also want to check out this abomination of mine: https://github.com/graph-guard/gqlscan

I've held a talk about this, unfortunately wasn't recorded. I've tried to squeeze as much out of Go as I could and I've went crazy doing that :D

forrestthewoods
2 replies
22h59m

> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.

shoo
1 replies
19h47m

we could re-frame by distinguishing problem statements from implementations

Problem A: read a stream of bytes, parse it as JSON

Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character

Problem B should require fewer resources* to solve than problem A. So in that sense problem B is a relaxation of problem A, and a highly efficient implementation of problem B should be able to process bytes much more efficiently than an "optimal" implementation of problem A.

So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.

We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.

I remember reading somewhere the advice that to set a really solid target for benchmarking, you should avoid measuring the performance of implementations and instead try to estimate a theoretical upper bound on performance, based on say a simplified model of how the hardware works and a simplification of the problem -- that hopefully still captures the essence of what the bottleneck is. Then you can compare any implementation to that (unreachable) theoretical upper bound, to get more of an idea of how much performance is still left on the table.

* for reasonably boring choices of target platform, e.g. amd64 + ram, not some hypothetical hardware platform with surprisingly fast dedicated support for JSON parsing and bad support for anything else.

forrestthewoods
0 replies
15h31m

Everything you said is totally reasonable. I'm a big fan of napkin math and theoretical upper bounds on performance.

simdjson (https://github.com/simdjson/simdjson) claims to fully parse JSON on the order of 3 GB/sec. Which is faster than OP's Go whitespace parsing! These tests are running on different hardware so it's not apples-to-apples.

The phrase "cannot go faster than this" is just begging for a "well ackshully". Which I hate to do. But the fact that there is an existence proof of Problem A running faster in C++ SIMD than OP's Probably B scalar Go is quite interesting and worth calling out imho. But I admit it doesn't change the rest of the post.

wslh
0 replies
18h23m

I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html

wood_spirit
0 replies
18h53m

My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.

thomasvn
0 replies
30m

In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?

rurban
0 replies
4h24m

This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).

romshark
0 replies
9h44m

I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.

rexfuzzle
0 replies
1d3h

Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic

peterohler
0 replies
1d3h

You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.

nwpierce
0 replies
1d

Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb

ncruces
0 replies
22h25m

It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.

mannyv
0 replies
1d1h

These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.

lamontcg
0 replies
23h56m

Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.

isuckatcoding
0 replies
1d

This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.

hknmtt
0 replies
9h31m

what does this bring over goccy's json encoder?

hintymad
0 replies
22h20m

How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson

flaie
0 replies
8h45m

This was a very good read, and I did learn some nice tricks, thank you very much.

evmar
0 replies
23h24m

In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.

I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2

denysvitali
0 replies
1d4h
cratermoon
0 replies
13h18m

My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {
crabbone
0 replies
17h45m

Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?

In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.

EdwardDiego
0 replies
9h29m

A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.

1vuio0pswjnm7
0 replies
19h30m

"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."