By: Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr), March 24, 2021 4:23 am
Room: Moderated Discussions
Nyan (nyan.delete@this.mailinator.com) on March 24, 2021 3:00 am wrote:
> Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on March 23, 2021 3:36 am wrote:
> > How do you explain then that average IPC can easily fall down below 1 on 100% CPU use,
> > like in "sudo perf stat /usr/bin/md5sum /usr/bin/md5sum" line "insn per cycle"?
> > It looks to me none of those hundreds of in-flight instructions are ready to execute,
> > probably instructions prerequisite not ready
>
> The md5sum executable is like 43KB on my system. For hashing 43KB, a lot of the time is probably spent on all
> sorts of cold cases: like disk I/O, context switching, branch misprediction, memory loading, overheads etc.
>
> With a much larger source, say `dd if=/dev/zero bs=1M count=512 | perf stat /usr/bin/md5sum`
> I get above 1 IPC as expected. Though it won't be much higher due to MD5 being a long dependency
> chain - this limitation likely also affects the above 43KB case as well.
>
> On my system, `perf stat /usr/bin/sha1sum /usr/bin/sha1sum` gets
> above 1 IPC, noting that SHA1 has much better ILP than MD5.
The problem is not that current processors cannot do IPC of 4 for specific loads, it is that most of the time processors do not achieve at least 1 for standard loads.
Saying that you can hash zero much quicker is like telling word-processor users to only type one single letter a massive amount of time to show that their word-processor is very efficient.
If you have 43 KB to hash, and the data is already loaded from disk because that is also the binary you are executing, disk I/O and memory loading should be negligible. You should not have a lot of context switching on such a small workload, and if branch misprediction is such a problem then there is little reasons to have few hundreds instructions in flight.
If you need 43*1024 times the same treatment for the performance to stabilise, then execution will nearly never stabilise on most software (unless you are trying to calculate crypto-currencies).
sha1sum has just been chosen because it is easy to reproduce, most other smaller load (un-accelerated by custom processor hardware) would also do.
> Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on March 23, 2021 3:36 am wrote:
> > How do you explain then that average IPC can easily fall down below 1 on 100% CPU use,
> > like in "sudo perf stat /usr/bin/md5sum /usr/bin/md5sum" line "insn per cycle"?
> > It looks to me none of those hundreds of in-flight instructions are ready to execute,
> > probably instructions prerequisite not ready
>
> The md5sum executable is like 43KB on my system. For hashing 43KB, a lot of the time is probably spent on all
> sorts of cold cases: like disk I/O, context switching, branch misprediction, memory loading, overheads etc.
>
> With a much larger source, say `dd if=/dev/zero bs=1M count=512 | perf stat /usr/bin/md5sum`
> I get above 1 IPC as expected. Though it won't be much higher due to MD5 being a long dependency
> chain - this limitation likely also affects the above 43KB case as well.
>
> On my system, `perf stat /usr/bin/sha1sum /usr/bin/sha1sum` gets
> above 1 IPC, noting that SHA1 has much better ILP than MD5.
The problem is not that current processors cannot do IPC of 4 for specific loads, it is that most of the time processors do not achieve at least 1 for standard loads.
Saying that you can hash zero much quicker is like telling word-processor users to only type one single letter a massive amount of time to show that their word-processor is very efficient.
If you have 43 KB to hash, and the data is already loaded from disk because that is also the binary you are executing, disk I/O and memory loading should be negligible. You should not have a lot of context switching on such a small workload, and if branch misprediction is such a problem then there is little reasons to have few hundreds instructions in flight.
If you need 43*1024 times the same treatment for the performance to stabilise, then execution will nearly never stabilise on most software (unless you are trying to calculate crypto-currencies).
sha1sum has just been chosen because it is easy to reproduce, most other smaller load (un-accelerated by custom processor hardware) would also do.