By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), June 6, 2022 9:40 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on June 6, 2022 2:43 am wrote:
>
> Going by my intuition, doing parsing and conversion of decimal/hex literals
> into binary form in a single pass would save you more than you can lose because
> of efficiency of parsing itself, mostly due to improved L1D hit rates.
It probably depends hugely on what and why you parse things.
For example: maybe the actual numerical value is never used? Lazily avoiding all validation can be a huge win (not just parsing numbers - particularly floating point - but maybe you can skip doing things like full UTF8 validation until a value is actually used).
That sounds silly, but it's not entirely crazy: the parsing can easily be a prelude to looking for a certain thing, and 99% of the input may not matter, because the user is literally parsing the JSON just to look for one particular field or whatever. And scanning UTF8 input is really really easy if you don't need to worry about things like overlong sequences etc.
And that can be a situation where parsing speed matters - think "grep for JSON data" kind of situations where you might want to act on or look up one entry in a big JSON file, but you're not going to do some big global operation on it.
So depending on circumstances, multiple passes may be the right thing to do, just because you may want to lazily evaluate things.
In my experience (and again - I've never done JSON, I've mainly done parsing of C and XML), the truly nasty part of parsing is that you end up often wanting all this small metadata. So as you tokenize the stream, each token wants to have its position recorded for later, so that when you encounter errors in some (much later) phase, you can say "this line in this file".
Is that inherent in parsing? No. It has nothing directly to do with parsing per se. It's just bookkeeping that is hopefully never even used, but it turns out it's one of those "you have to do it for any kind of quality implementation". And it can be quite ridiculously expensive, because pretty much every token then needs to have that metadata associated with it.
That kind of metadata is easily much bigger than the token was in the stream (one character token? Not unusual). And I'm not even talking about generating the parse tree itself, I'm literally talking about just remembering one single token on its own.
And hey, maybe you can avoid some of that bookkeeping in many situations. Maybe you don't even need error reporting, or maybe you have a slow-path fallback for error cases. So this kind of thing certainly isn't truly fundamental, but it's very common, and it can really be quite the big deal.
Linus
>
> Going by my intuition, doing parsing and conversion of decimal/hex literals
> into binary form in a single pass would save you more than you can lose because
> of efficiency of parsing itself, mostly due to improved L1D hit rates.
It probably depends hugely on what and why you parse things.
For example: maybe the actual numerical value is never used? Lazily avoiding all validation can be a huge win (not just parsing numbers - particularly floating point - but maybe you can skip doing things like full UTF8 validation until a value is actually used).
That sounds silly, but it's not entirely crazy: the parsing can easily be a prelude to looking for a certain thing, and 99% of the input may not matter, because the user is literally parsing the JSON just to look for one particular field or whatever. And scanning UTF8 input is really really easy if you don't need to worry about things like overlong sequences etc.
And that can be a situation where parsing speed matters - think "grep for JSON data" kind of situations where you might want to act on or look up one entry in a big JSON file, but you're not going to do some big global operation on it.
So depending on circumstances, multiple passes may be the right thing to do, just because you may want to lazily evaluate things.
In my experience (and again - I've never done JSON, I've mainly done parsing of C and XML), the truly nasty part of parsing is that you end up often wanting all this small metadata. So as you tokenize the stream, each token wants to have its position recorded for later, so that when you encounter errors in some (much later) phase, you can say "this line in this file".
Is that inherent in parsing? No. It has nothing directly to do with parsing per se. It's just bookkeeping that is hopefully never even used, but it turns out it's one of those "you have to do it for any kind of quality implementation". And it can be quite ridiculously expensive, because pretty much every token then needs to have that metadata associated with it.
That kind of metadata is easily much bigger than the token was in the stream (one character token? Not unusual). And I'm not even talking about generating the parse tree itself, I'm literally talking about just remembering one single token on its own.
And hey, maybe you can avoid some of that bookkeeping in many situations. Maybe you don't even need error reporting, or maybe you have a slow-path fallback for error cases. So this kind of thing certainly isn't truly fundamental, but it's very common, and it can really be quite the big deal.
Linus