By: juanrga (nospam.delete@this.juanrga.com), March 8, 2015 6:12 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on March 8, 2015 2:46 am wrote:
> Brett (ggtgp.delete@this.yahoo.com) on March 7, 2015 2:56 pm wrote:
> > juanrga (nospam.delete@this.juanrga.com) on March 6, 2015 6:22 pm wrote:
> > > I don't expect K12 to be significantly better than Vulcan.
> >
> > Is that per thread? Vulcan is 4 way threaded. Per die size, MIPS/watt?
> > Vulcan slides show 8 fetch, 4 issue, 60 entry scheduler, 6 ports, unknown completion.
> >
> > High end chips tend to be 6 decode, 6 issue, 192 entry scheduler, ~8 ports, 6 completion.
> >
> > The 4 issue sticks out like a sore thumb, I hope there is some opcode merging going on. I also hope
> > that scheduler is not shared four ways, it is to small to be competitive as is. The previous MIPS chip
> > with similar specs was tiny, you can put lots of them on a die, and win benchmarks that way.
> >
> > On the good side it does have three load/store ports, supporting 2 loads or 3
> > stores, which sounds backwards to me, most code has more loads than stores.
> > https://hpcuserforum.com/presentations/santafe2014/Broadcom%20Monday%20night.pdf
> >
> > I am really expecting K12 to be wider than Bulldozer.
>
> On paper, Vulcan sounds good. But too many details are missing:
>
> 1. Branch prediction details
Branching is improved. The XLP problem of unaligned branch targets is corrected. An improved branch predictor helps avoid resteers. The taken-branch penalty is eliminated...
> 2. Prefetching
> 3. Load/store buffering strategy
Vulcan takes advantage of ARM’s LDP (load pair) instruction by allowing each load/store unit to access 128 bits (16 bytes) per cycle—enough to fill two registers at once. The scheduler can issue store micro-ops to the store-data unit before the actual data is ready, as long as the address registers are available. Memory operations that miss the 32KB data cache are queued. Vulcan can hold 64 outstanding loads and 36 outstanding stores. Loads and stores can complete out of order because the CPU detects any address conflicts while the operations are pending. Each CPU has a 256KB level-two cache. It can refill the L1 caches at 64 bytes per cycle. To improve the hit rate, all three caches are 8-way associative. A hardware prefetch unit predicts the next cache line on the basis of access patterns stored in a PC-indexed history table, then preloads it into the data cache. Each load/store unit, as well as the instruction-fetch unit, has its own small TLB for address translation. These are backed by a large 2,048-entry TLB. The L2 TLB supports variable pages sizes from 4KB all the way up to 16GB. The CPU includes a nested page-table walker to accelerate TLB-miss handling.
In short, load bandwidth is doubled compared to MIPS core.
>
> Not to mention system level issues. Are they claiming they will beat a Haswell Xeon E3?
> I could believe that. I am much more skeptical they can beat a Haswell EP or EX.
With 16-cores, 90% of Haswell IPC and frequency target of 3GHz Vulcan would be competitive against some 16-core versions of Haswell E5/E7 (aka Haswell EP/EX).
> Brett (ggtgp.delete@this.yahoo.com) on March 7, 2015 2:56 pm wrote:
> > juanrga (nospam.delete@this.juanrga.com) on March 6, 2015 6:22 pm wrote:
> > > I don't expect K12 to be significantly better than Vulcan.
> >
> > Is that per thread? Vulcan is 4 way threaded. Per die size, MIPS/watt?
> > Vulcan slides show 8 fetch, 4 issue, 60 entry scheduler, 6 ports, unknown completion.
> >
> > High end chips tend to be 6 decode, 6 issue, 192 entry scheduler, ~8 ports, 6 completion.
> >
> > The 4 issue sticks out like a sore thumb, I hope there is some opcode merging going on. I also hope
> > that scheduler is not shared four ways, it is to small to be competitive as is. The previous MIPS chip
> > with similar specs was tiny, you can put lots of them on a die, and win benchmarks that way.
> >
> > On the good side it does have three load/store ports, supporting 2 loads or 3
> > stores, which sounds backwards to me, most code has more loads than stores.
> > https://hpcuserforum.com/presentations/santafe2014/Broadcom%20Monday%20night.pdf
> >
> > I am really expecting K12 to be wider than Bulldozer.
>
> On paper, Vulcan sounds good. But too many details are missing:
>
> 1. Branch prediction details
Branching is improved. The XLP problem of unaligned branch targets is corrected. An improved branch predictor helps avoid resteers. The taken-branch penalty is eliminated...
> 2. Prefetching
> 3. Load/store buffering strategy
Vulcan takes advantage of ARM’s LDP (load pair) instruction by allowing each load/store unit to access 128 bits (16 bytes) per cycle—enough to fill two registers at once. The scheduler can issue store micro-ops to the store-data unit before the actual data is ready, as long as the address registers are available. Memory operations that miss the 32KB data cache are queued. Vulcan can hold 64 outstanding loads and 36 outstanding stores. Loads and stores can complete out of order because the CPU detects any address conflicts while the operations are pending. Each CPU has a 256KB level-two cache. It can refill the L1 caches at 64 bytes per cycle. To improve the hit rate, all three caches are 8-way associative. A hardware prefetch unit predicts the next cache line on the basis of access patterns stored in a PC-indexed history table, then preloads it into the data cache. Each load/store unit, as well as the instruction-fetch unit, has its own small TLB for address translation. These are backed by a large 2,048-entry TLB. The L2 TLB supports variable pages sizes from 4KB all the way up to 16GB. The CPU includes a nested page-table walker to accelerate TLB-miss handling.
In short, load bandwidth is doubled compared to MIPS core.
>
> Not to mention system level issues. Are they claiming they will beat a Haswell Xeon E3?
> I could believe that. I am much more skeptical they can beat a Haswell EP or EX.
With 16-cores, 90% of Haswell IPC and frequency target of 3GHz Vulcan would be competitive against some 16-core versions of Haswell E5/E7 (aka Haswell EP/EX).