By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), May 7, 2013 2:07 pm
Room: Moderated Discussions
EduardoS (no.delete@this.spam.com) on May 7, 2013 1:41 pm wrote:
>
> Since AMD already published the Software Optimization Guide, you can make the comparison
> yourself, or just assume that usually Jaguar have a bit more than Silvermont, for example:
Umm. You missed the most important number. L2 size/latency.
Intel tends to do a good memory subsystem, and people seem to always dismiss that. The numbers I've seen for Jaguar are a 24 cycle load-use latency of the 512kB L2. David claims (I think) 14 cycles for the 1MB shared L2 on Silvermont. That looks like a big advantage for Silvermont.
Things like ROB sizes are almost unimportant compared to the really fundamental things. I'm a huge proponent of OoO, but it doesn't even need to be all that deep to get the low-hanging fruit. You need to have good caches: low latency and reasonable associativity. If those are bad, no amount of core improvements will ever make up for it (see the majority of the ARM cores), and if the caches are really good, you can make do with shallower queues.
(Side note: I'd like to see actual benchmark numbers for the Silvermont cache accesses. Sometimes CPU people quote the latency after the L1 miss (rather than the full load-to-use latency), sometimes they quote the "n-1" number of cycles, sometimes it turns out that pointer following adds another few cycles that they don't mention, yadda yadda, just to make their numbers look better than they are).
So if the Silvermont 14 cycles are for "14 cycles after the L1 miss", then that is much worse than if it's a true "14 cycle pointer-to-pointer chasing" latency. But I'm assuming it's true load-to-use latency for now, since that's in the same ballpark that Intel did back in the Merom/Yonah days.
Linus
>
> Since AMD already published the Software Optimization Guide, you can make the comparison
> yourself, or just assume that usually Jaguar have a bit more than Silvermont, for example:
Umm. You missed the most important number. L2 size/latency.
Intel tends to do a good memory subsystem, and people seem to always dismiss that. The numbers I've seen for Jaguar are a 24 cycle load-use latency of the 512kB L2. David claims (I think) 14 cycles for the 1MB shared L2 on Silvermont. That looks like a big advantage for Silvermont.
Things like ROB sizes are almost unimportant compared to the really fundamental things. I'm a huge proponent of OoO, but it doesn't even need to be all that deep to get the low-hanging fruit. You need to have good caches: low latency and reasonable associativity. If those are bad, no amount of core improvements will ever make up for it (see the majority of the ARM cores), and if the caches are really good, you can make do with shallower queues.
(Side note: I'd like to see actual benchmark numbers for the Silvermont cache accesses. Sometimes CPU people quote the latency after the L1 miss (rather than the full load-to-use latency), sometimes they quote the "n-1" number of cycles, sometimes it turns out that pointer following adds another few cycles that they don't mention, yadda yadda, just to make their numbers look better than they are).
So if the Silvermont 14 cycles are for "14 cycles after the L1 miss", then that is much worse than if it's a true "14 cycle pointer-to-pointer chasing" latency. But I'm assuming it's true load-to-use latency for now, since that's in the same ballpark that Intel did back in the Merom/Yonah days.
Linus