By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), September 26, 2009 6:38 am
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto@gmail.com) on 9/26/09 wrote:
---------------------------
>Skimming through the information available on the Cortex A9 I have noticed that
>there are quite a bit of peculiarities in the design. Here's a quick run-down to fuel discussion:
>
>- Out-of-order completion is mentioned though there is very little information
>about it. The only other mention I found was that an instructions 'releases the
>resources it is consuming early'. Maybe it means that an instruction can write-back
>its result and free the renamed register before completion if it safe to do so?
You can write the result to the register file as soon as it has been computed. It's a big advantage over in-order as you don't have to have to pipeline (and forward) every result to the final writeback stage. This is what you could call OoO completion. If the instruction was not speculated then you could release resources used by that instruction early as you know that it will complete once it starts.
>- It seems that the LSU is skewed as they mention a single-cycle load-to-use penalty
>(same as on the Cortex A8). A hardware prefetcher is also mentioned but there is no information about it.
Single cycle load-use sounds like a special forwarding case. OoO pipelines typically aren't skewed - it doesn't make sense to issue an instruction and do nothing for 1 or 2 cycles when you could issue an independent intruction instead. Unlike in-order, you don't gain anything at all.
>- Some sort of load-store forwarding mechanism is mentioned in the white paper
>though it is not described how it works or what it does exactly. Maybe back-to-back
>load/store couples have a lower latency because the data is allowed to bypass some
>stages and go directly to the store queue?
Historically ldr-str latency has been zero on most ARMs - even mul-str is usually zero latency. The paper suggests ldr-str forwarding happens within the L1. That likely means zero latency and a power saving.
>- The L1s in Cortex A8 were PIPT, I believe this still holds true for the A9. This
>means that the TLB must be fairly small, a potentially significant disadvantage
>in more desktop-oriented workloads. I wonder if there is a second-level TLB in there.
The A8 uses 32-entry fully associative TLB's, but a TLB miss means 2 expensive L2 accesses. The A9 pipeline diagram mentions micro-TLB's, which means they are small (likely 8 to 16 entries) and on a miss use a larger main TLB (like ARM11). I hope the main TLB has at least 128 entries and ideally 256.
>- When configured with an L2 cache, the L2 is exclusive (yay for K7!), certainly
>a good thing for the smaller incarnations of the A9.
>
>- The A9 has fast loop mode for lower-power operation but in the diagrams available
>it is depicted as being before the decode stage. This is similar to Conroe/Penryn
>cores which is strange because it comes after the pre-decode stage, something that
>the A9 shouldn't have. I would have expected it after the decode stage (like on
>Nehalem) so I was wondering if its sole purpose is to shut off the L1 I-cache for power savings.
Doing it after decode would need a lot more area as decoded instructions can be significantly wider than 32 bits. Loops with branches are likely harder to deal with as well. Caching fetched data is easier and works well on Thumb-2 where an average instruction is just 21 bits. L1 fetch is one of the largest power consumers (27% on StrongARM), so avoiding it in most loops helps a lot.
>- Finally it seems that they put a lot of effort into making I/O operations as
>well as thread-related operations very fast. The ACP for example is absolutely brilliant:
>on current consoles it is normal practice to lock part of the L2 to and write the
>GPU command buffer into it, then send it over using DMA to save an unnecessary read-write-read
>copy. Having this done transparently is simply excellent. It seems to me that they've
>done quite some work to make operations which usually disrupt an OoOE actually run
>very fast (peripheral I/O, TLS access and cache-to-cache transfers). On this topic
>I'd like to know more about the GIQ too because interrupt handling is an area which
>is often underestimated from a performance POV.
The ARM11 MPCore TRM may give some insights as to what the A9 will do. MPCore supports global distribution of interrupts as well as handling a specific interrupt on a particular core. You can even send a software interrupt to another core.
Wilco
---------------------------
>Skimming through the information available on the Cortex A9 I have noticed that
>there are quite a bit of peculiarities in the design. Here's a quick run-down to fuel discussion:
>
>- Out-of-order completion is mentioned though there is very little information
>about it. The only other mention I found was that an instructions 'releases the
>resources it is consuming early'. Maybe it means that an instruction can write-back
>its result and free the renamed register before completion if it safe to do so?
You can write the result to the register file as soon as it has been computed. It's a big advantage over in-order as you don't have to have to pipeline (and forward) every result to the final writeback stage. This is what you could call OoO completion. If the instruction was not speculated then you could release resources used by that instruction early as you know that it will complete once it starts.
>- It seems that the LSU is skewed as they mention a single-cycle load-to-use penalty
>(same as on the Cortex A8). A hardware prefetcher is also mentioned but there is no information about it.
Single cycle load-use sounds like a special forwarding case. OoO pipelines typically aren't skewed - it doesn't make sense to issue an instruction and do nothing for 1 or 2 cycles when you could issue an independent intruction instead. Unlike in-order, you don't gain anything at all.
>- Some sort of load-store forwarding mechanism is mentioned in the white paper
>though it is not described how it works or what it does exactly. Maybe back-to-back
>load/store couples have a lower latency because the data is allowed to bypass some
>stages and go directly to the store queue?
Historically ldr-str latency has been zero on most ARMs - even mul-str is usually zero latency. The paper suggests ldr-str forwarding happens within the L1. That likely means zero latency and a power saving.
>- The L1s in Cortex A8 were PIPT, I believe this still holds true for the A9. This
>means that the TLB must be fairly small, a potentially significant disadvantage
>in more desktop-oriented workloads. I wonder if there is a second-level TLB in there.
The A8 uses 32-entry fully associative TLB's, but a TLB miss means 2 expensive L2 accesses. The A9 pipeline diagram mentions micro-TLB's, which means they are small (likely 8 to 16 entries) and on a miss use a larger main TLB (like ARM11). I hope the main TLB has at least 128 entries and ideally 256.
>- When configured with an L2 cache, the L2 is exclusive (yay for K7!), certainly
>a good thing for the smaller incarnations of the A9.
>
>- The A9 has fast loop mode for lower-power operation but in the diagrams available
>it is depicted as being before the decode stage. This is similar to Conroe/Penryn
>cores which is strange because it comes after the pre-decode stage, something that
>the A9 shouldn't have. I would have expected it after the decode stage (like on
>Nehalem) so I was wondering if its sole purpose is to shut off the L1 I-cache for power savings.
Doing it after decode would need a lot more area as decoded instructions can be significantly wider than 32 bits. Loops with branches are likely harder to deal with as well. Caching fetched data is easier and works well on Thumb-2 where an average instruction is just 21 bits. L1 fetch is one of the largest power consumers (27% on StrongARM), so avoiding it in most loops helps a lot.
>- Finally it seems that they put a lot of effort into making I/O operations as
>well as thread-related operations very fast. The ACP for example is absolutely brilliant:
>on current consoles it is normal practice to lock part of the L2 to and write the
>GPU command buffer into it, then send it over using DMA to save an unnecessary read-write-read
>copy. Having this done transparently is simply excellent. It seems to me that they've
>done quite some work to make operations which usually disrupt an OoOE actually run
>very fast (peripheral I/O, TLS access and cache-to-cache transfers). On this topic
>I'd like to know more about the GIQ too because interrupt handling is an area which
>is often underestimated from a performance POV.
The ARM11 MPCore TRM may give some insights as to what the A9 will do. MPCore supports global distribution of interrupts as well as handling a specific interrupt on a particular core. You can even send a software interrupt to another core.
Wilco
Topic | Posted By | Date |
---|---|---|
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/26 01:46 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/26 02:27 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 04:06 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/27 04:29 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 05:01 AM |
Thoughts and questions on the Cortex A9 | Howard Chu | 2009/09/27 09:39 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/27 06:03 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 07:00 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/27 07:17 AM |
Thoughts and questions on the Cortex A9 | David Kanter | 2009/09/27 07:37 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/27 07:46 AM |
Thoughts and questions on the Cortex A9 | Mat | 2009/10/01 12:04 PM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/10/01 05:09 PM |
Thoughts and questions on the Cortex A9 | anon | 2009/10/01 07:19 PM |
Thoughts and questions on the Cortex A9 | RagingDragon | 2009/09/28 04:11 PM |
Thoughts and questions on the Cortex A9 | Linus Torvalds | 2009/09/27 08:05 AM |
OOO hw vs SW&in-order hw | no thanks | 2009/09/27 03:47 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/28 05:22 AM |
OOO hw vs SW&in-order hw | ? | 2009/09/28 10:37 AM |
OOO hw vs SW&in-order hw | RagingDragon | 2009/09/28 04:22 PM |
OOO hw vs SW&in-order hw | Megol | 2009/09/29 03:35 AM |
OOO hw vs SW&in-order hw | Anders Jensen | 2009/09/28 10:50 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/29 06:44 AM |
OOO hw vs SW&in-order hw | Mark Roulo | 2009/09/29 08:58 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/29 09:30 AM |
3- and 4-issue in-order CPUs | Mark Roulo | 2009/09/29 10:06 AM |
3- and 4-issue in-order CPUs | Linus Torvalds | 2009/09/29 10:29 AM |
3- and 4-issue in-order CPUs | Gian-Carlo Pascutto | 2009/09/29 11:35 PM |
3- and 4-issue in-order CPUs | Michael S | 2009/09/30 01:01 AM |
OOO hw vs SW&in-order hw | mpx | 2009/09/30 03:14 AM |
OOO hw vs SW&in-order hw | Pun Zu | 2009/10/02 01:44 AM |
OOO hw vs SW&in-order hw | none | 2009/10/02 04:22 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/02 06:11 AM |
OOO hw vs SW&in-order hw | a reader | 2009/10/02 08:30 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/02 08:59 AM |
Moorestown | David Kanter | 2009/10/02 09:59 AM |
What's the difference between Moorestown and Pine Trail cores? | anon | 2009/10/03 07:37 PM |
Moorestown | none | 2009/11/03 03:34 PM |
Moorestown | Anon | 2009/11/04 02:17 PM |
Moorestown | none | 2009/11/05 12:38 AM |
Moorestown | David Kanter | 2009/11/05 03:45 PM |
Moorestown | IntelUser2000 | 2009/11/06 03:17 AM |
Moorestown | Anon | 2009/11/06 12:51 PM |
Moorestown | none | 2009/11/07 06:07 AM |
OOO hw vs SW&in-order hw | Anon | 2009/10/02 06:55 PM |
Cluebat for graphics | David Kanter | 2009/10/02 08:19 PM |
Cluebat for graphics | Anon | 2009/10/03 04:45 PM |
Cluebat for graphics | David Kanter | 2009/10/04 12:57 AM |
Cluebat for graphics | Anon | 2009/10/04 07:15 PM |
Cluebat for graphics | David Kanter | 2009/10/05 02:09 AM |
Cluebat for graphics | Anon | 2009/10/05 02:36 PM |
Cluebat for graphics | David Kanter | 2009/10/05 08:54 PM |
Cluebat for graphics | Anon | 2009/10/06 04:58 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/03 05:58 AM |
OOO hw vs SW&in-order hw | slacker | 2009/10/02 08:11 PM |
Linux graphics drivers | RagingDragon | 2009/10/03 07:27 PM |
Linux graphics drivers | anon | 2009/10/04 06:15 AM |
Linux graphics drivers | none | 2009/10/04 09:12 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 05:31 PM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 08:30 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 09:09 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/27 10:35 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 10:55 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/28 01:08 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/28 04:58 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/28 05:18 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/28 06:35 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/28 07:25 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/28 10:02 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/29 12:35 AM |
Thoughts and questions on the Cortex A9 | Chuck | 2009/09/28 06:15 PM |
samples | AM | 2009/09/27 10:20 PM |
samples | Wilco | 2009/09/28 12:51 AM |
samples | AM | 2009/09/28 03:16 AM |
Shrinks and process tech | David Kanter | 2009/09/29 12:22 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 10:42 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 11:52 AM |
Atom to stay in-oder or go OoO? | AM | 2009/09/27 10:09 PM |
Atom to stay in-oder or go OoO? | Ungo | 2009/09/28 04:34 AM |
Atom to stay in-oder or go OoO? | a reader | 2009/09/28 09:15 AM |
Atom to stay in-oder or go OoO? | anon | 2009/09/28 06:25 PM |
Atom to stay in-oder or go OoO? | AM | 2009/09/30 02:32 AM |
Atom to stay in-oder or go OoO? | baxeel | 2009/09/30 07:25 AM |
Atom to stay in-oder or go OoO? | AM | 2009/09/30 10:12 PM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/01 02:00 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/01 04:08 AM |
Atom to stay in-oder or go OoO? | anonymous | 2009/10/01 04:33 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/03 06:24 AM |
Atom to stay in-oder or go OoO? | Pun Zu | 2009/10/02 12:30 AM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/02 12:11 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/03 06:22 AM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/03 01:53 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/04 07:44 AM |
Atom to stay in-oder or go OoO? | David Kanter | 2009/10/04 10:02 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/05 06:18 AM |
Atom to stay in-oder or go OoO? | David Kanter | 2009/10/05 10:12 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/06 03:51 AM |
Atom to stay in-oder or go OoO? | anonymous | 2009/10/06 06:58 AM |
Do you have any proof? | David Kanter | 2009/10/06 08:58 AM |
Do you? | AM | 2009/10/06 10:30 PM |
Of course I do! | anonymous | 2009/10/07 04:58 AM |
Thanks :-) | AM | 2009/10/08 02:17 AM |
Thanks :-) | anonymous | 2009/10/08 04:52 AM |
Thanks :-) | AM | 2009/10/09 02:13 AM |
Thanks :-) | anonymous | 2009/10/09 05:03 AM |
Thanks :-) | Foo_ | 2009/10/09 05:47 AM |
Thanks :-) | AM | 2009/10/10 12:15 AM |
That's what I thought... | David Kanter | 2009/10/07 08:00 AM |
That's what I thought... | AM | 2009/10/08 02:26 AM |
That's what I thought... | anonymous | 2009/10/08 05:02 AM |
let's see... | AM | 2009/10/09 02:09 AM |
let's see... | anonymous | 2009/10/09 04:43 AM |
let's see... | AM | 2009/10/09 04:52 AM |
let's see... | anonymous | 2009/10/09 05:15 AM |
let's see... | AM | 2009/10/10 12:18 AM |
Atom to stay in-oder or go OoO? | someone | 2009/09/28 05:09 AM |
I call Troll | hobold | 2009/09/28 03:51 AM |
I call Troll | someone | 2009/09/28 05:15 AM |
OT: categories of motivation in a forum | hobold | 2009/09/29 05:01 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/28 09:43 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/28 03:12 PM |
Thoughts and questions on the Cortex A9 | someone else | 2009/09/28 11:25 PM |
Why Cortex A9? | hobold | 2009/09/29 06:20 AM |
Why Cortex A9? | someone else | 2009/09/29 09:57 AM |
Why Cortex A9? | Richard Cownie | 2009/09/29 05:09 PM |
Why Cortex A9? | hobold | 2009/09/29 11:38 PM |
Why Cortex A9? | Richard Cownie | 2009/09/30 05:49 AM |
Why Cortex A9? | hobold | 2009/09/30 06:46 AM |
Why Cortex A9? | none | 2009/09/30 06:56 AM |
Marvell Sheeva and plug computing | Richard Cownie | 2009/09/30 08:03 AM |
Why Cortex A9? | Michael S | 2009/09/30 09:07 AM |
Why Cortex A9? | none | 2009/09/30 09:40 AM |
Why Cortex A9? | Gabriele Svelto | 2009/09/30 11:43 AM |
ARM architectural license | David Kanter | 2009/09/30 04:57 PM |
ARM architectural license | a reader | 2009/10/01 06:25 AM |
ARM architectural license | Richard Cownie | 2009/10/01 07:21 AM |
Why Cortex A9? | slacker | 2009/09/30 06:12 PM |
ARM architectural license | David Kanter | 2009/09/30 06:16 PM |
Why Cortex A9? | Michael S | 2009/10/01 06:45 AM |
Why Cortex A9? | slacker | 2009/10/02 01:41 AM |
Why Cortex A9? | Richard Cownie | 2009/10/02 09:28 AM |
Questions... | David Kanter | 2009/10/02 09:56 AM |
Questions... | Richard Cownie | 2009/10/02 10:29 AM |
Questions... | Wilco | 2009/10/02 12:05 PM |
Questions... | slacker | 2009/10/02 07:51 PM |
Why Cortex A9? | slacker | 2009/10/02 07:44 PM |
Why Cortex A9? | David W. Hess | 2009/09/30 07:42 AM |
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/28 12:28 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/26 06:38 AM |
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/28 12:38 AM |
Thoughts and questions on the Cortex A9 | Costanza | 2009/10/01 02:45 PM |
Thoughts and questions on the Cortex A9 | sylt | 2009/09/28 04:54 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/29 12:15 AM |