By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), June 4, 2022 10:17 am
Room: Moderated Discussions
Eric Fink (eric.delete.delete@this.this.anon.com) on June 3, 2022 12:23 am wrote:
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on June 2, 2022 11:10 pm wrote:
>
> > Looking at Lemire's simdjson as an example of text processing, it seems those ops would also work on
> > RVV, no? https://github.com/simdjson/simdjson/blob/master/src/icelake/dom_parser_implementation.cpp
>
> You are right, text processing is probably a bad example, since text sequences are usually
> longish.
I have the exact reverse reaction.
Text sequences are usually quite short. The whole "I have gigabytes of JSON" seems a very artificial example.
Don't get me wrong: there are lots of situations where you have lots and lots of text. But in my experience there are not really all that many where you process it as a big block. You generally want to chunk it up into much smaller pieces anyway. A lot of text processing is done basically at line boundaries, and in many situations you want to avoid having to have everything in memory at the same time.
I haven't personally done JSON parsing, but I have worked with its older step-brother, XML, who was dropped on his head a few too many times (evil scientist voice: "Oh, look, let's make up a format that is bad for both humans and computers to parse! And then we'll sell it as a generic data exchange format! People are morons, they'll lap it up! Mwhahahahaaa!").
And the data files easily get large, and yes, parsing ends up being a pain, but you want to do it incrementally anyway, and if you care about performance you'll save the end result in some internal format.
In most text parsers I've seen, the technical act of parsing the stream itself is the least of the problems - building up the resulting data tree (or whatever) with allocations etc tends to be the biggest issue. Lots of small allocations, often lots of small data copies for said allocations.
And in the bad cases, the expensive part can be things like doing a good job at floating point number parsing (which can be quite nasty to do right with proper rounding).
For example, I bet that this whole AVX152 JSON parsing mention came from the big splash the benchmark numbers made a couple of weeks ago. And I also bet that nobody on this list actually looked at the numbers. Those numbers are total garbage.
The simdjson JSON parsing code literally has a special mode for "don't allocate memory for the result" (look it up: '-H'), and their performance notes page literally tells people to use that flag (along with largepage allocations, which is at least a bit more reasonable) when reporting parsing benchmark numbers.
Read that paragraph above one more time: it's not that the parsing benchmark doesn't actually do anything else, and doesn't do any useful work with the result - it's that it doesn't even save the results of said "parsing" in the first place, because it turns out that that is the more expensive operation than the "scan the text" part.
If a tree falls in a forest, and no one is around to hear it, does it make a sound?
And if you parse something but don't actually save a result, is it really parsing?
In other words: don't use the simdjson performance numbers as any kind of argument for SIMD processing. They are entirely meaningless.
Now, that said, accelerating text processing is valid. Doing things like string scans is very very common. But it needs to take small strings into account, and not be some silly "parse gigabytes of text mindlessly and without doing anything with it".
If you make a memcpy benchmark, and you only report largepage results for big aligned memory areas, and only report cold-cache numbers, your benchmark is worthless as a benchmark.
It might still be interesting as a technology demonstration, of course. Which is exactly what I think that AVX512 JSON parsing thing is. Nothing less, nothing more.
Linus
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on June 2, 2022 11:10 pm wrote:
>
> > Looking at Lemire's simdjson as an example of text processing, it seems those ops would also work on
> > RVV, no? https://github.com/simdjson/simdjson/blob/master/src/icelake/dom_parser_implementation.cpp
>
> You are right, text processing is probably a bad example, since text sequences are usually
> longish.
I have the exact reverse reaction.
Text sequences are usually quite short. The whole "I have gigabytes of JSON" seems a very artificial example.
Don't get me wrong: there are lots of situations where you have lots and lots of text. But in my experience there are not really all that many where you process it as a big block. You generally want to chunk it up into much smaller pieces anyway. A lot of text processing is done basically at line boundaries, and in many situations you want to avoid having to have everything in memory at the same time.
I haven't personally done JSON parsing, but I have worked with its older step-brother, XML, who was dropped on his head a few too many times (evil scientist voice: "Oh, look, let's make up a format that is bad for both humans and computers to parse! And then we'll sell it as a generic data exchange format! People are morons, they'll lap it up! Mwhahahahaaa!").
And the data files easily get large, and yes, parsing ends up being a pain, but you want to do it incrementally anyway, and if you care about performance you'll save the end result in some internal format.
In most text parsers I've seen, the technical act of parsing the stream itself is the least of the problems - building up the resulting data tree (or whatever) with allocations etc tends to be the biggest issue. Lots of small allocations, often lots of small data copies for said allocations.
And in the bad cases, the expensive part can be things like doing a good job at floating point number parsing (which can be quite nasty to do right with proper rounding).
For example, I bet that this whole AVX152 JSON parsing mention came from the big splash the benchmark numbers made a couple of weeks ago. And I also bet that nobody on this list actually looked at the numbers. Those numbers are total garbage.
The simdjson JSON parsing code literally has a special mode for "don't allocate memory for the result" (look it up: '-H'), and their performance notes page literally tells people to use that flag (along with largepage allocations, which is at least a bit more reasonable) when reporting parsing benchmark numbers.
Read that paragraph above one more time: it's not that the parsing benchmark doesn't actually do anything else, and doesn't do any useful work with the result - it's that it doesn't even save the results of said "parsing" in the first place, because it turns out that that is the more expensive operation than the "scan the text" part.
If a tree falls in a forest, and no one is around to hear it, does it make a sound?
And if you parse something but don't actually save a result, is it really parsing?
In other words: don't use the simdjson performance numbers as any kind of argument for SIMD processing. They are entirely meaningless.
Now, that said, accelerating text processing is valid. Doing things like string scans is very very common. But it needs to take small strings into account, and not be some silly "parse gigabytes of text mindlessly and without doing anything with it".
If you make a memcpy benchmark, and you only report largepage results for big aligned memory areas, and only report cold-cache numbers, your benchmark is worthless as a benchmark.
It might still be interesting as a technology demonstration, of course. Which is exactly what I think that AVX512 JSON parsing thing is. Nothing less, nothing more.
Linus