By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), June 4, 2022 8:00 pm
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on June 4, 2022 5:56 pm wrote:
>
> Except having "gigabytes of JSON" was never necessary, and I'm willing
> to bet their benchmarks don't have gigabyte sized JSON files.
No, they talk about having multiple different sizes and repeating over the files.
Don't get me wrong - I suspect it's a very good JSON parser.
But to get their "gigabytes per second" thing, they do make it fairly clear that they are ignoring allocation costs. You may be right that that means that buffers were pre-allocated and then filled in, and the benchmark just didn't time the preparatory part. I didn't actually look at the benchmark sources.
But it's exactly like you say:
> I see it more like people touting FLOPS figures - there's a certain appeal to it, even if it's not
> representative of a lot of problems. That's not to say the functionality is useless though.
Right. It's basically theoretical throughput, and it makes the news, and yes, I understand why it happens. But it's basically ignoring the fact that in real life, there's a lot else that goes in when you parse input, and yes, if you write a JSON parsing library, you may want to not look at those other parts (that aren't really your thing - you're not writing a memory allocation library, after all).
And optimizing your code that you can actually improve and control with a synthetic load that intentionally avoids everything outside your control is not wrong, it's part of finding the bottlenecks in your code.
But then when the numbers get quoted as "gigabytes per second" and you have graphs that show 3x improvements (or whatever) over other libraries, I think people need to realize that those "gigabytes per second" were basically done in a vacuum, and then when you take all the things that go on in a full parse into account, it's not actually nearly as noticeable.
And I agree, it's not at all unlike quoting peak FLOPS that will never be attained on real loads due to all the other things going on.
Linus
>
> Except having "gigabytes of JSON" was never necessary, and I'm willing
> to bet their benchmarks don't have gigabyte sized JSON files.
No, they talk about having multiple different sizes and repeating over the files.
Don't get me wrong - I suspect it's a very good JSON parser.
But to get their "gigabytes per second" thing, they do make it fairly clear that they are ignoring allocation costs. You may be right that that means that buffers were pre-allocated and then filled in, and the benchmark just didn't time the preparatory part. I didn't actually look at the benchmark sources.
But it's exactly like you say:
> I see it more like people touting FLOPS figures - there's a certain appeal to it, even if it's not
> representative of a lot of problems. That's not to say the functionality is useless though.
Right. It's basically theoretical throughput, and it makes the news, and yes, I understand why it happens. But it's basically ignoring the fact that in real life, there's a lot else that goes in when you parse input, and yes, if you write a JSON parsing library, you may want to not look at those other parts (that aren't really your thing - you're not writing a memory allocation library, after all).
And optimizing your code that you can actually improve and control with a synthetic load that intentionally avoids everything outside your control is not wrong, it's part of finding the bottlenecks in your code.
But then when the numbers get quoted as "gigabytes per second" and you have graphs that show 3x improvements (or whatever) over other libraries, I think people need to realize that those "gigabytes per second" were basically done in a vacuum, and then when you take all the things that go on in a full parse into account, it's not actually nearly as noticeable.
And I agree, it's not at all unlike quoting peak FLOPS that will never be attained on real loads due to all the other things going on.
Linus