Article: AMD's Mobile Strategy
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), January 6, 2012 2:04 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 1/6/12 wrote:
---------------------------
>Ricardo B (ricardo.b@xxxxx.xxxx) on 1/6/12 wrote:
>---------------------------
>>Single socket latency DRAM is in the 40-50 ns range
>>nowadays.
>>
>>10 ns or less seems quite feasible with SRAM through a
>>parallel interface.
>>
>
>10ns is not feasible. 25ns - may be, but even that is hard.
How much of that is the off-chip penalty (even with fairly
tight integration) and how much the access delay of the
memory itself? (Also would there be any benefit in
variable latency? E.g., placing some memory closer to the
output interface, possibly using prediction to place the
likely critical chunk in the lowest latency position. With
8-beat transmission bursts, some pipelining of the read to
hide latency might be possible. With DRAM, would it be
possible to read all ways of the first chunk while tags are
being checked? Even if only some cache blocks could have
reduced latency, this would seem better than using the
worst-case latency. [Pipelined access might also be
useful with the chip connected to the processor having
tags and the first chunk or two while adjacent cache chips
provided the remaining chunks, though such would presumably
have power issues.] [Yes, I am harping on an oldish idea
of mine!])
An L4 cache might also have advantages in bandwidth and in
avoiding bank conflicts in conventional DRAM. (As I
mentioned earlier, an L4 chip could also be a pin
multiplier by requiring a lower power interface to the
processor chip than even an on-board buffer chip and likely
higher pin efficiency [and possibly lower pin area] and by
potentially providing I/O and coherence links.)
---------------------------
>Ricardo B (ricardo.b@xxxxx.xxxx) on 1/6/12 wrote:
>---------------------------
>>Single socket latency DRAM is in the 40-50 ns range
>>nowadays.
>>
>>10 ns or less seems quite feasible with SRAM through a
>>parallel interface.
>>
>
>10ns is not feasible. 25ns - may be, but even that is hard.
How much of that is the off-chip penalty (even with fairly
tight integration) and how much the access delay of the
memory itself? (Also would there be any benefit in
variable latency? E.g., placing some memory closer to the
output interface, possibly using prediction to place the
likely critical chunk in the lowest latency position. With
8-beat transmission bursts, some pipelining of the read to
hide latency might be possible. With DRAM, would it be
possible to read all ways of the first chunk while tags are
being checked? Even if only some cache blocks could have
reduced latency, this would seem better than using the
worst-case latency. [Pipelined access might also be
useful with the chip connected to the processor having
tags and the first chunk or two while adjacent cache chips
provided the remaining chunks, though such would presumably
have power issues.] [Yes, I am harping on an oldish idea
of mine!])
An L4 cache might also have advantages in bandwidth and in
avoiding bank conflicts in conventional DRAM. (As I
mentioned earlier, an L4 chip could also be a pin
multiplier by requiring a lower power interface to the
processor chip than even an on-board buffer chip and likely
higher pin efficiency [and possibly lower pin area] and by
potentially providing I/O and coherence links.)