By: David Kanter (dkanter.delete@this.realworldtech.com), February 4, 2015 10:25 am
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on February 4, 2015 8:31 am wrote:
> anon (anon.delete@this.anon.com) on February 4, 2015 6:19 am wrote:
> > Memory disambiguation also does not seem like it would improve
> > efficiency much. It increases the amount of speculation
> > that can be done, which can increase performance of course,
> > but improve perf/watt? I think IBM only implemented
> > this with POWER8, and they haven't been ones to shy away from micro architectural complexity.
> >
>
> Memory disambiguation with a simple predictor rarely incorrectly speculates. The store
> buffer has to be scanned to see if loads hit stores in flight, but most cores have been
> doing this anyway to implement load to store forwarding for ops that were otherwise started
> in-order (even the old Cortex-A8 does this, at least for the scalar part)
>
> The more execution width you have, the more important it becomes. The simple example is a
> loop with a body that loads things at the start and stores things at the end. Without memory
> disambiguation, separate iterations of that loop can't run in parallel. So maybe for A72 such
> a feature would go hand in hand with increased decode width, L/S units, ALUs, etc.
>
> AMD only started doing it with Bulldozer, Apple only started doing it with
> Cyclone, and even Intel only started with Core 2. I don't think any of that
> is an indication of the feature not being an efficiency improvement.
Memory disambiguation would be most useful with another load unit.
> > I would say perhaps improved branch prediction, reorganized cache design, and improved hardware prefetching.
> >
>
> I think they'll add a second load (and possibly store)
> unit, which Cyclone, Denver, and even Cortex-A17 have.
I would do another load unit. I don't think it's very helpful to do 2 ST/clock, especially since it makes your store buffer a lot nastier to deal with.
Prefetching and branch prediction will probably improve.
And yes, hopefully they will fix their cache design...but I think a lot of that is tied to the PD capabilities of clients (which is to say, not much).
> > I think the L2 cache might be brought in and be integrated with the core design as it is with other
> > high performance CPUs.
> > > With a more modular and configurable L3 cache shared within the cluster.
>
> By integrated you mean a separate local smallish L2 cache for each core? Right now only Intel really
> does that with their non-Atom line, although other CPUs share larger L2 caches between two cores. Doesn't
> mean that ARM won't do this, but it'll mean increasing the minimum size of their clusters a lot if
> some L3 is required. And being able to do it without L3 could have some bad design repurcussions (that
> I think the Bulldozer line suffers from) Maybe with 128KB L2 caches it won't be too bad.
> > The low associativity L1 and large shared modular L2 seems like a potential problem to me.
> >
>
> I agree, I always thought this could be a glass jaw for A15. A57 helps a little by increase
> associativity of icache to 3-way. 2-way associative L1 dcache in this day seems like a strange
> choice, even AMD moved away from that. It does give them cheap LRU replacement at least.
2W associativity is idiotic, especially for anything that even smells like a server. I made this point rather extensively when I was visiting Cambridge (Peter do you remember? :) ).
Also, if they want another LD pipe, I think they will want wider decode.
David
> anon (anon.delete@this.anon.com) on February 4, 2015 6:19 am wrote:
> > Memory disambiguation also does not seem like it would improve
> > efficiency much. It increases the amount of speculation
> > that can be done, which can increase performance of course,
> > but improve perf/watt? I think IBM only implemented
> > this with POWER8, and they haven't been ones to shy away from micro architectural complexity.
> >
>
> Memory disambiguation with a simple predictor rarely incorrectly speculates. The store
> buffer has to be scanned to see if loads hit stores in flight, but most cores have been
> doing this anyway to implement load to store forwarding for ops that were otherwise started
> in-order (even the old Cortex-A8 does this, at least for the scalar part)
>
> The more execution width you have, the more important it becomes. The simple example is a
> loop with a body that loads things at the start and stores things at the end. Without memory
> disambiguation, separate iterations of that loop can't run in parallel. So maybe for A72 such
> a feature would go hand in hand with increased decode width, L/S units, ALUs, etc.
>
> AMD only started doing it with Bulldozer, Apple only started doing it with
> Cyclone, and even Intel only started with Core 2. I don't think any of that
> is an indication of the feature not being an efficiency improvement.
Memory disambiguation would be most useful with another load unit.
> > I would say perhaps improved branch prediction, reorganized cache design, and improved hardware prefetching.
> >
>
> I think they'll add a second load (and possibly store)
> unit, which Cyclone, Denver, and even Cortex-A17 have.
I would do another load unit. I don't think it's very helpful to do 2 ST/clock, especially since it makes your store buffer a lot nastier to deal with.
Prefetching and branch prediction will probably improve.
And yes, hopefully they will fix their cache design...but I think a lot of that is tied to the PD capabilities of clients (which is to say, not much).
> > I think the L2 cache might be brought in and be integrated with the core design as it is with other
> > high performance CPUs.
> > > With a more modular and configurable L3 cache shared within the cluster.
>
> By integrated you mean a separate local smallish L2 cache for each core? Right now only Intel really
> does that with their non-Atom line, although other CPUs share larger L2 caches between two cores. Doesn't
> mean that ARM won't do this, but it'll mean increasing the minimum size of their clusters a lot if
> some L3 is required. And being able to do it without L3 could have some bad design repurcussions (that
> I think the Bulldozer line suffers from) Maybe with 128KB L2 caches it won't be too bad.
> > The low associativity L1 and large shared modular L2 seems like a potential problem to me.
> >
>
> I agree, I always thought this could be a glass jaw for A15. A57 helps a little by increase
> associativity of icache to 3-way. 2-way associative L1 dcache in this day seems like a strange
> choice, even AMD moved away from that. It does give them cheap LRU replacement at least.
2W associativity is idiotic, especially for anything that even smells like a server. I made this point rather extensively when I was visiting Cambridge (Peter do you remember? :) ).
Also, if they want another LD pipe, I think they will want wider decode.
David
Topic | Posted By | Date |
---|---|---|
ARM announces A72 | Maynard Handley | 2015/02/03 11:36 AM |
ARM announces A72 | anon | 2015/02/03 12:53 PM |
ARM announces A72 | Hugo Décharnes | 2015/02/03 01:20 PM |
ARM announces A72 | juanrga | 2015/02/03 04:15 PM |
ARM announces A72 | Wilco | 2015/02/04 12:58 AM |
ARM announces A72 | Eric Bron | 2015/02/04 01:48 AM |
ARM announces A72 | none | 2015/02/04 02:24 AM |
ARM announces A72 | Eric Bron | 2015/02/04 02:42 AM |
ARM announces A72 | Exophase | 2015/02/04 07:01 AM |
ARM announces A72 | Anon | 2015/02/04 07:35 AM |
ARM announces A72 | Exophase | 2015/02/04 07:58 AM |
ARM announces A72 | Groo | 2015/02/04 09:24 AM |
ARM Marketing, BS up to my ears | David Kanter | 2015/02/04 10:51 AM |
ARM Marketing, BS up to my ears | Maynard Handley | 2015/02/04 01:59 PM |
ARM Marketing, BS up to my ears | David Kanter | 2015/02/04 02:21 PM |
ARM Marketing, BS up to my ears | Groo | 2015/02/04 02:30 PM |
ARM announces A72 | juanrga | 2015/02/04 04:23 AM |
ARM announces A72 | Wilco | 2015/02/04 03:01 PM |
ARM announces A72 | juanrga | 2015/02/04 04:06 PM |
ARM announces A72 | Anon | 2015/02/04 01:28 AM |
ARM announces A72 | juanrga | 2015/02/04 04:31 AM |
ARM announces A72 | Aaron Spink | 2015/02/04 06:49 AM |
ARM announces A72 | Ronald Maas | 2015/02/03 07:23 PM |
ARM announces A72 | Seni | 2015/02/04 12:19 AM |
ARM announces A72 | Maynard Handley | 2015/02/04 10:42 AM |
ARM announces A72 | Seni | 2015/02/04 12:33 PM |
ARM announces A72 | dmcq | 2015/02/04 12:57 PM |
ARM announces A72 | Ronald Maas | 2015/02/04 06:42 PM |
ARM announces A72 | anon | 2015/02/04 05:19 AM |
ARM announces A72 | Exophase | 2015/02/04 07:31 AM |
ARM announces A72 | David Kanter | 2015/02/04 10:25 AM |
ARM announces A72 | Exophase | 2015/02/04 01:33 PM |
ARM announces A72 | anon | 2015/02/04 10:27 PM |
ARM announces A72 (fixed format) | anon | 2015/02/04 10:29 PM |
ARM announces A72 | Exophase | 2015/02/04 11:11 PM |
ARM announces A72 | anon | 2015/02/05 12:02 AM |
ARM announces A72 | anon | 2015/02/04 05:57 PM |
ARM announces A72 | Wilco | 2015/02/03 01:39 PM |
ARM announces A72 | Maynard Handley | 2015/02/03 02:13 PM |
ARM announces A72 | anon | 2015/02/03 02:29 PM |
ARM announces A72 | Wilco | 2015/02/03 02:44 PM |
ARM announces A72 | David Kanter | 2015/02/04 09:56 AM |
ARM announces A72 | Peter Greenhalgh | 2015/02/04 10:56 AM |
ARM announces A72 | Aaron Spink | 2015/02/04 11:59 AM |
ARM announces A72 | Alberto | 2015/02/07 10:22 AM |
ARM announces A72 | Exophase | 2015/02/07 10:47 AM |
ARM announces A72 | Alberto | 2015/02/07 12:44 PM |
ARM announces A72 | Exophase | 2015/02/07 02:35 PM |
ARM announces A72 | Alberto | 2015/02/08 01:09 AM |
ARM announces A72 | Exophase | 2015/02/08 11:05 AM |
ARM announces A72 | David Kanter | 2015/02/08 12:39 AM |
ARM announces A72 | dmcq | 2015/02/08 04:14 AM |
ARM announces A72 | Michael S | 2015/02/08 04:38 AM |
ARM announces A72 | Gabriele Svelto | 2015/02/10 05:11 AM |
ARM announces A72 | Jouni Osmala | 2015/02/10 11:24 AM |
slit vs unified | Michael S | 2015/02/10 01:57 PM |
slit vs unified | dmcq | 2015/02/11 05:44 AM |
ARM announces A72 | Doug S | 2015/02/08 09:00 AM |
ARM announces A72 | Exophase | 2015/02/08 10:57 AM |
ARM announces A72 | dmcq | 2015/02/04 01:10 PM |
ARM announces A72 | David Kanter | 2015/02/04 02:28 PM |
ARM announces A72 | Wilco | 2015/02/04 01:59 PM |
ARM announces A72 | Aaron Spink | 2015/02/04 09:31 PM |
Intel 32nm vs 14 nm | Michael S | 2015/02/05 01:03 AM |
Intel 32nm vs 14 nm | Wilco | 2015/02/05 02:27 AM |
Intel 32nm vs 14 nm | David Kanter | 2015/02/05 09:05 AM |
Intel 32nm vs 14 nm | carop | 2015/02/05 11:12 AM |
Normalize to drawn or effective width? | David Kanter | 2015/02/05 11:45 AM |
Normalize to drawn or effective width? | carop | 2015/02/05 02:40 PM |
Normalize to drawn or effective width? | David Kanter | 2015/02/06 12:44 PM |