By: Exophase (exophase.delete@this.gmail.com), February 4, 2015 1:33 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on February 4, 2015 11:25 am wrote:
> Memory disambiguation would be most useful with another load unit.
>
> I would do another load unit. I don't think it's very helpful to do 2 ST/clock,
> especially since it makes your store buffer a lot nastier to deal with.
>
I don't think it's a coincidence that Core 2 both increased width across the board and added memory disambiguation, all while not adding a second store port until two uarch generations later. Although I do of course agree that disambiguation would be more effective with a second load port.
I have written lots of code which has a higher than 1:2 load to ALU density for decent sized chunks or loop bodies, so I do think a second load unit would help a lot. Even more so if they're adding a third ALU.
> Prefetching and branch prediction will probably improve.
>
From what I could gather in the TRMs, the prefetching to date seems to be based on observing access patterns in the cache (originally, from cache misses). I don't know if they've moved beyond this already, but if not they'd benefit from having IP-hashed stream detection.
> And yes, hopefully they will fix their cache design...but I think a lot of that
> is tied to the PD capabilities of clients (which is to say, not much).
>
When you say fix it, are you referring to latencies, size, hierarchy arrangement, or what? Size and hierarchy arrangement are really pretty much the same as everyone else in these segments, unless you count Broadwell-Y.
More and more clients will be using the "POP"s that ARM is making, so the baseline will at least be how good ARM and their partners are at this reference implementation.
> 2W associativity is idiotic, especially for anything that even smells like a server. I made
> this point rather extensively when I was visiting Cambridge (Peter do you remember? :) ).
>
I still don't think there was anything to this designed-for-servers point, but on the other hand maybe the original frequency targets were a little too much for the markets they did hit.
2-way is still plenty bad for a lot of non-server tasks.
> Also, if they want another LD pipe, I think they will want wider decode.
Cortex-A17 seems to benefit a lot from the second load pipe while still not even being three wide. I'd go as far as to say that even A53 would benefit a lot from a second load pipe, actually maybe even more than A17 does, but I don't know what that does to the area and power targets for a CPU like that.
Maybe they'll have wider decode only for 64-bit, or only for non-Thumb2 or something. I'm wondering how much stuff they could have really added while still consuming less power at the same clock speed; so never mind improving overall perf/W, that's a much stricter requirement. I think they must have found some big things to optimize.
Oh, and another feature they might draw from the well is instruction fusion, particularly cmp + branch.
> Memory disambiguation would be most useful with another load unit.
>
> I would do another load unit. I don't think it's very helpful to do 2 ST/clock,
> especially since it makes your store buffer a lot nastier to deal with.
>
I don't think it's a coincidence that Core 2 both increased width across the board and added memory disambiguation, all while not adding a second store port until two uarch generations later. Although I do of course agree that disambiguation would be more effective with a second load port.
I have written lots of code which has a higher than 1:2 load to ALU density for decent sized chunks or loop bodies, so I do think a second load unit would help a lot. Even more so if they're adding a third ALU.
> Prefetching and branch prediction will probably improve.
>
From what I could gather in the TRMs, the prefetching to date seems to be based on observing access patterns in the cache (originally, from cache misses). I don't know if they've moved beyond this already, but if not they'd benefit from having IP-hashed stream detection.
> And yes, hopefully they will fix their cache design...but I think a lot of that
> is tied to the PD capabilities of clients (which is to say, not much).
>
When you say fix it, are you referring to latencies, size, hierarchy arrangement, or what? Size and hierarchy arrangement are really pretty much the same as everyone else in these segments, unless you count Broadwell-Y.
More and more clients will be using the "POP"s that ARM is making, so the baseline will at least be how good ARM and their partners are at this reference implementation.
> 2W associativity is idiotic, especially for anything that even smells like a server. I made
> this point rather extensively when I was visiting Cambridge (Peter do you remember? :) ).
>
I still don't think there was anything to this designed-for-servers point, but on the other hand maybe the original frequency targets were a little too much for the markets they did hit.
2-way is still plenty bad for a lot of non-server tasks.
> Also, if they want another LD pipe, I think they will want wider decode.
Cortex-A17 seems to benefit a lot from the second load pipe while still not even being three wide. I'd go as far as to say that even A53 would benefit a lot from a second load pipe, actually maybe even more than A17 does, but I don't know what that does to the area and power targets for a CPU like that.
Maybe they'll have wider decode only for 64-bit, or only for non-Thumb2 or something. I'm wondering how much stuff they could have really added while still consuming less power at the same clock speed; so never mind improving overall perf/W, that's a much stricter requirement. I think they must have found some big things to optimize.
Oh, and another feature they might draw from the well is instruction fusion, particularly cmp + branch.
Topic | Posted By | Date |
---|---|---|
ARM announces A72 | Maynard Handley | 2015/02/03 11:36 AM |
ARM announces A72 | anon | 2015/02/03 12:53 PM |
ARM announces A72 | Hugo Décharnes | 2015/02/03 01:20 PM |
ARM announces A72 | juanrga | 2015/02/03 04:15 PM |
ARM announces A72 | Wilco | 2015/02/04 12:58 AM |
ARM announces A72 | Eric Bron | 2015/02/04 01:48 AM |
ARM announces A72 | none | 2015/02/04 02:24 AM |
ARM announces A72 | Eric Bron | 2015/02/04 02:42 AM |
ARM announces A72 | Exophase | 2015/02/04 07:01 AM |
ARM announces A72 | Anon | 2015/02/04 07:35 AM |
ARM announces A72 | Exophase | 2015/02/04 07:58 AM |
ARM announces A72 | Groo | 2015/02/04 09:24 AM |
ARM Marketing, BS up to my ears | David Kanter | 2015/02/04 10:51 AM |
ARM Marketing, BS up to my ears | Maynard Handley | 2015/02/04 01:59 PM |
ARM Marketing, BS up to my ears | David Kanter | 2015/02/04 02:21 PM |
ARM Marketing, BS up to my ears | Groo | 2015/02/04 02:30 PM |
ARM announces A72 | juanrga | 2015/02/04 04:23 AM |
ARM announces A72 | Wilco | 2015/02/04 03:01 PM |
ARM announces A72 | juanrga | 2015/02/04 04:06 PM |
ARM announces A72 | Anon | 2015/02/04 01:28 AM |
ARM announces A72 | juanrga | 2015/02/04 04:31 AM |
ARM announces A72 | Aaron Spink | 2015/02/04 06:49 AM |
ARM announces A72 | Ronald Maas | 2015/02/03 07:23 PM |
ARM announces A72 | Seni | 2015/02/04 12:19 AM |
ARM announces A72 | Maynard Handley | 2015/02/04 10:42 AM |
ARM announces A72 | Seni | 2015/02/04 12:33 PM |
ARM announces A72 | dmcq | 2015/02/04 12:57 PM |
ARM announces A72 | Ronald Maas | 2015/02/04 06:42 PM |
ARM announces A72 | anon | 2015/02/04 05:19 AM |
ARM announces A72 | Exophase | 2015/02/04 07:31 AM |
ARM announces A72 | David Kanter | 2015/02/04 10:25 AM |
ARM announces A72 | Exophase | 2015/02/04 01:33 PM |
ARM announces A72 | anon | 2015/02/04 10:27 PM |
ARM announces A72 (fixed format) | anon | 2015/02/04 10:29 PM |
ARM announces A72 | Exophase | 2015/02/04 11:11 PM |
ARM announces A72 | anon | 2015/02/05 12:02 AM |
ARM announces A72 | anon | 2015/02/04 05:57 PM |
ARM announces A72 | Wilco | 2015/02/03 01:39 PM |
ARM announces A72 | Maynard Handley | 2015/02/03 02:13 PM |
ARM announces A72 | anon | 2015/02/03 02:29 PM |
ARM announces A72 | Wilco | 2015/02/03 02:44 PM |
ARM announces A72 | David Kanter | 2015/02/04 09:56 AM |
ARM announces A72 | Peter Greenhalgh | 2015/02/04 10:56 AM |
ARM announces A72 | Aaron Spink | 2015/02/04 11:59 AM |
ARM announces A72 | Alberto | 2015/02/07 10:22 AM |
ARM announces A72 | Exophase | 2015/02/07 10:47 AM |
ARM announces A72 | Alberto | 2015/02/07 12:44 PM |
ARM announces A72 | Exophase | 2015/02/07 02:35 PM |
ARM announces A72 | Alberto | 2015/02/08 01:09 AM |
ARM announces A72 | Exophase | 2015/02/08 11:05 AM |
ARM announces A72 | David Kanter | 2015/02/08 12:39 AM |
ARM announces A72 | dmcq | 2015/02/08 04:14 AM |
ARM announces A72 | Michael S | 2015/02/08 04:38 AM |
ARM announces A72 | Gabriele Svelto | 2015/02/10 05:11 AM |
ARM announces A72 | Jouni Osmala | 2015/02/10 11:24 AM |
slit vs unified | Michael S | 2015/02/10 01:57 PM |
slit vs unified | dmcq | 2015/02/11 05:44 AM |
ARM announces A72 | Doug S | 2015/02/08 09:00 AM |
ARM announces A72 | Exophase | 2015/02/08 10:57 AM |
ARM announces A72 | dmcq | 2015/02/04 01:10 PM |
ARM announces A72 | David Kanter | 2015/02/04 02:28 PM |
ARM announces A72 | Wilco | 2015/02/04 01:59 PM |
ARM announces A72 | Aaron Spink | 2015/02/04 09:31 PM |
Intel 32nm vs 14 nm | Michael S | 2015/02/05 01:03 AM |
Intel 32nm vs 14 nm | Wilco | 2015/02/05 02:27 AM |
Intel 32nm vs 14 nm | David Kanter | 2015/02/05 09:05 AM |
Intel 32nm vs 14 nm | carop | 2015/02/05 11:12 AM |
Normalize to drawn or effective width? | David Kanter | 2015/02/05 11:45 AM |
Normalize to drawn or effective width? | carop | 2015/02/05 02:40 PM |
Normalize to drawn or effective width? | David Kanter | 2015/02/06 12:44 PM |