By: dmcq (, February 11, 2015 6:44 am
Michael S ( on February 10, 2015 2:57 pm wrote:
> Jouni Osmala ( on February 10, 2015 12:24 pm wrote:
> > > > I'd don't know what changed within ARM application processor
> > > > group, but recently they have very high success
> > > > rate: A7, A12, A53, A17 are all pretty good.
> > >
> > > Focus on incremental improvement maybe? A8, A9 and A15 were all big jumps with significantly different
> > > micro-architectures than their predecessors. All of the ones you listed on the other hand have evolved
> > > from previous cores. They have better branch prediction, tighter coupling of the L1/L2 caches and
> > > other improvements which do not significantly affect the execution core yet provide very significant
> > > benefits, and especially so in integer codes. Most of those improvements can also be shared between
> > > the cores so that might also have helped out in focusing development efforts.
> >
> > Those small tweaks matter a lot. The difference between nehalem and cortex A15 is mostly in
> > those tweaks. Cortex has wider execution stage, but nehalem is more tweaked and has better memory
> > subsystem and maybe better branch predictor,and fetches more instructions per cycle. By wider
> > core I mean nehalem's 3 compute pipelines vs cortex 6 compute pipelines, and both have same
> > number of loads/stores per cycle. A15 scheduler contains more operations than nehalems.
> You can't compare OoO cores based on unified scheduler (Nehalem) with cores based on split scheduler
> (Cortex-A15) in such simplistic manner. If you want to compare CA15 with x86, it would make much
> more sense to compare to another split-scheduler design like AMD K8. K8 can dispatch up to 9 uOPs
> per clock, one more than CA15, but real difference in in dispatch width is even bigger in favor
> of K8 is even bigger, because its dispatch ports are more universal. Most importantly, K8 capable
> to dispatch up to 3 integer ALU/shift instructions per clock (matching fat Intel cores by this metric)
> or resolve to 3 branches while CA15 can only issue 2 integer ALU/shift instructions per clock and
> resolve 1 branch. So, both cores feature split schedulers, but K8 is "less split".
> For reference, CA15 OoO schedulers (clusters, in ARM terms) and dispatch rates per scheduler:
> 1. Simple ALU/shift, 2
> 2. Branch, 1
> 3. Neon/FPU, 2
> 4. Multiply, 1 (also handles integer divide)
> 5. LSU, 2, 1 load, 1 store
> BTW, it's still not clear to me where store data is coming from. Does store unit has 3 read ports (2 for address
> and one for data) into register file and result queue or it somehow steals read port from another EU
> Also, on non-related note, when speaking about width you can't totally ignore in-order
> front end and in-order retirement parts of the core, both of each on CA15 (3 simple
> uOps) are narrower then on both Nehalem (4 fused uOps) and K8 (3 macro-ops).

Well for the ARM A15 a slide says the store operations are issued in order and issue when the address registers are available - not the data. So I guess they are really just generating the address and the actual store is done separately.
