By: Foo_ (foo.delete@this.nomail.com), June 6, 2022 2:19 am
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on June 5, 2022 5:40 am wrote:
>
> I looked at an earlier version of simdjson and was able to get good speedups by removing unnecessary use of
> SIMD. It uses very complex SIMD processing to determine the start of every token and then write this out into
> a big array (if the average token size is small, this results in a huge expansion of the input, creates lots
> of cachemisses and consumes a lot of memory bandwidth). Then in the 2nd pass a big switch statement reads those
> offsets, figures out what token it might be for the 2nd time, and finally performs the action for each token.
> Removing the unnecessary SIMD tokenization step and using traditional parsing gave a 10-15% speedup.
>
> There are certainly cases where SIMD can speedup parsing, for example skipping comments or #ifdef'd
> out text or UTF-8 processing, but you've got to use SIMD smartly and understand where it helps.
Yeah, it will really depend on the typology of the input. If your JSON input (or CSV, etc.) has a lot of long (string) literals, then SIMD can be useful. If it has a lot of a very short literals, then SIMD may be detrimental. Typical JSON (or CSV, etc.) data probably has mostly very short literals.
If you are really serious about performance, you probably want to have two different code paths and choose between them after a short learning period on a given input.
>
> I looked at an earlier version of simdjson and was able to get good speedups by removing unnecessary use of
> SIMD. It uses very complex SIMD processing to determine the start of every token and then write this out into
> a big array (if the average token size is small, this results in a huge expansion of the input, creates lots
> of cachemisses and consumes a lot of memory bandwidth). Then in the 2nd pass a big switch statement reads those
> offsets, figures out what token it might be for the 2nd time, and finally performs the action for each token.
> Removing the unnecessary SIMD tokenization step and using traditional parsing gave a 10-15% speedup.
>
> There are certainly cases where SIMD can speedup parsing, for example skipping comments or #ifdef'd
> out text or UTF-8 processing, but you've got to use SIMD smartly and understand where it helps.
Yeah, it will really depend on the typology of the input. If your JSON input (or CSV, etc.) has a lot of long (string) literals, then SIMD can be useful. If it has a lot of a very short literals, then SIMD may be detrimental. Typical JSON (or CSV, etc.) data probably has mostly very short literals.
If you are really serious about performance, you probably want to have two different code paths and choose between them after a short learning period on a given input.