Article: AMD's Mobile Strategy
By: Ricardo B (ricardo.b.delete@this.xxxxx.xx), January 6, 2012 7:18 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton@gmail.com) on 1/6/12 wrote:
---------------------------
>
>Why would tag comparison have significantly greater
>latency than a random access to any point on the chip? The
>tag read should be faster than a full read (unless the
>SRAM chips are using the pipelining technique I mentioned
>to reduce latency), and a comparison for equality should
>take less than 1 ns. (I am guessing that the L3 latencies
>include the latency of three tag checks--L1 miss, L2 miss,
>L3 hit determination--as well as buffering [which may be
>necessary if different clocking can be used] and routing
>delay [which may be larger than for a single level of
>cache because placement of the lower levels would be biased
>to reduce their latency/power use.)
I've never designed a cache so I don't really know, but I'd say that it's because the cache's are N-way associative, which means the tag lookups are a tad more complicated than a simple comparisson.
Ie, SandyBridge's L3 is 12 way associative.
Buffering and muxing all the signals around should also weight in heavily.
In the ASIC I'm working on, I got delays as high as ~9 ns as the tool tried to fanout a signal into a bunch of places spread across ~7 mm, just from all the buffering.
>
>I vaguely recall reading that just going off chip can be
>absurdly expensive in terms of latency (and power).
It tends to. But it also depends on what your context and basis of comparison.
An I/O interface with 0.5 ns (2 GHz clock period) scale latencies is... very very hard to do, if not impossible.
But an I/O interface with 10 ns (100 MHz clock period) scale latencies is doable.
As I mentioned, you can buy 36 Mbit SSRAM chips with < 5 ns latency (2.5 clocks at 550 MHz).
That is, the driving chip (ie, FGPA) sets the read request on it's address/control pins and, 2.5 clocks after (< 5 ns), it can latch the data on it's input pins.
Of course, these chips use very simple parallel interfaces which require very large amounts of pins and traces -- even more than DRAM interfaces.
As you move to narrower and narrower interfaces, latencies tend to suffer
---------------------------
>
>Why would tag comparison have significantly greater
>latency than a random access to any point on the chip? The
>tag read should be faster than a full read (unless the
>SRAM chips are using the pipelining technique I mentioned
>to reduce latency), and a comparison for equality should
>take less than 1 ns. (I am guessing that the L3 latencies
>include the latency of three tag checks--L1 miss, L2 miss,
>L3 hit determination--as well as buffering [which may be
>necessary if different clocking can be used] and routing
>delay [which may be larger than for a single level of
>cache because placement of the lower levels would be biased
>to reduce their latency/power use.)
I've never designed a cache so I don't really know, but I'd say that it's because the cache's are N-way associative, which means the tag lookups are a tad more complicated than a simple comparisson.
Ie, SandyBridge's L3 is 12 way associative.
Buffering and muxing all the signals around should also weight in heavily.
In the ASIC I'm working on, I got delays as high as ~9 ns as the tool tried to fanout a signal into a bunch of places spread across ~7 mm, just from all the buffering.
>
>I vaguely recall reading that just going off chip can be
>absurdly expensive in terms of latency (and power).
It tends to. But it also depends on what your context and basis of comparison.
An I/O interface with 0.5 ns (2 GHz clock period) scale latencies is... very very hard to do, if not impossible.
But an I/O interface with 10 ns (100 MHz clock period) scale latencies is doable.
As I mentioned, you can buy 36 Mbit SSRAM chips with < 5 ns latency (2.5 clocks at 550 MHz).
That is, the driving chip (ie, FGPA) sets the read request on it's address/control pins and, 2.5 clocks after (< 5 ns), it can latch the data on it's input pins.
Of course, these chips use very simple parallel interfaces which require very large amounts of pins and traces -- even more than DRAM interfaces.
As you move to narrower and narrower interfaces, latencies tend to suffer