By: David Kanter (dkanter.delete@this.realworldtech.com), November 23, 2010 5:00 pm
Room: Moderated Discussions
someone (someone@somewhere.com) on 11/23/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 11/23/10 wrote:
>---------------------------
>>
>>I don't believe that OOOE is inherently less power efficient than InO. OOOE lets
>>you overlap more cache misses, which can reduce the amount of time the process is stalled.
>>
>>DK
>
>You are talking about one narrow aspect of architectural
>performance figure of merit, not the full performance vs
>design cost metric or computational power efficiency.
>
>The OOOE vs IOE issue is highly workload dependent. For
>a given issue width and frequency the performance gain
>from OOOE vs in-order is as high as 30-50% for branchy
>scalar code with highly unstructured memory accesses to
>less than 5% for code dominated by predictable control
>flow and memory accesses. That is for implementations
>of non-EPIC ISAs at the same issue width and frequency.
>
>Published research into the benefit of OOOE for EPIC ISAs
>is limited and tends to focus on novel simplified dynamic
>scheduling schemes that aren't full classic OOOE.
>Of course OOOE is not free. It adds complexity and power
>consumption (dynamic and static). The power/area cost is
>20 to 40% depending on the issue width and degree of
>OOOE aggressiveness (windows size, speculativity etc).
>Given a fixed amount of resources (silicon area, Watts),
>an in-order implementation can devote more transistors
>and Watts to other CPU functionality, more cache and/or
>higher clock frequency.
I'm not convinced that the power cost is that high or that it necessarily scales directly with area (i.e. I think it costs more area than power if done right). OOOE lets you make implementation choices which can save power. For example:
1. Pseudo-ported (i.e. banked) caches
2. Partitioned bypass networks (like Intel does for x86)
3. Multiple cycle L1 cache latency
>Is OOOE worth the cost for general purpose MPUs (i.e.
>intended for a wide range of applications)? The answer
>is yes for most high performance implementations of
>CISC and RISC ISAs although the appearance of modern
>in-order processors like Atom and Power6 suggests the
>question isn't nearly as settled as some like to claim.
IBM learned from their mistake with POWER6 rather quickly. Now both POWER and IBM mainframes are OOOE. And I expect a next generation Atom to be as well.
For highly regular workloads that are mostly working with arrays and matrices, you're better off with an in-order machine (like a GPU). But that's not the most critical workload for Itanium.
>What about OOOE for implementations of EPIC ISAs? A
>comparison of McKinley vs EV6 vs Power4 suggests that
>what EPIC brings to the table combined with extra CPU
>resources not going OOOE buys makes the question a
>lot more debatable than with non-EPIC ISAs. My guess
>is Fort Collins looked carefully at OOOE but stayed with
>an in-order design for Poulson to maximize performance
>within its die size and power budget.
Itanium has a lot of nice features but a lot of them are better off as microarchitecture. The ALAT for instance - compiler hints welcome...but a lot of aliasing issues are best handled dynamically to avoid code bloat, etc. I think register rotation is kind of obnoxious, since it costs you a whole clock cycle in the pipeline...but without real register renaming. Also, large register files are kind of nice...but I'm not sure it's ideal for multi-threading.
I also wonder how predication impacts branch predictor design...but that's a separate topic entirely.
Perhaps more importantly, I don't believe saving core die area matters for consumer workloads. Client systems won't use >4 cores for a long time...so shaving off 20% of the area doesn't really matter. For servers...it might be a different story with 16 cores...but the benefits of using the same core for client and server is pretty big.
DK
---------------------------
>David Kanter (dkanter@realworldtech.com) on 11/23/10 wrote:
>---------------------------
>>
>>I don't believe that OOOE is inherently less power efficient than InO. OOOE lets
>>you overlap more cache misses, which can reduce the amount of time the process is stalled.
>>
>>DK
>
>You are talking about one narrow aspect of architectural
>performance figure of merit, not the full performance vs
>design cost metric or computational power efficiency.
>
>The OOOE vs IOE issue is highly workload dependent. For
>a given issue width and frequency the performance gain
>from OOOE vs in-order is as high as 30-50% for branchy
>scalar code with highly unstructured memory accesses to
>less than 5% for code dominated by predictable control
>flow and memory accesses. That is for implementations
>of non-EPIC ISAs at the same issue width and frequency.
>
>Published research into the benefit of OOOE for EPIC ISAs
>is limited and tends to focus on novel simplified dynamic
>scheduling schemes that aren't full classic OOOE.
>Of course OOOE is not free. It adds complexity and power
>consumption (dynamic and static). The power/area cost is
>20 to 40% depending on the issue width and degree of
>OOOE aggressiveness (windows size, speculativity etc).
>Given a fixed amount of resources (silicon area, Watts),
>an in-order implementation can devote more transistors
>and Watts to other CPU functionality, more cache and/or
>higher clock frequency.
I'm not convinced that the power cost is that high or that it necessarily scales directly with area (i.e. I think it costs more area than power if done right). OOOE lets you make implementation choices which can save power. For example:
1. Pseudo-ported (i.e. banked) caches
2. Partitioned bypass networks (like Intel does for x86)
3. Multiple cycle L1 cache latency
>Is OOOE worth the cost for general purpose MPUs (i.e.
>intended for a wide range of applications)? The answer
>is yes for most high performance implementations of
>CISC and RISC ISAs although the appearance of modern
>in-order processors like Atom and Power6 suggests the
>question isn't nearly as settled as some like to claim.
IBM learned from their mistake with POWER6 rather quickly. Now both POWER and IBM mainframes are OOOE. And I expect a next generation Atom to be as well.
For highly regular workloads that are mostly working with arrays and matrices, you're better off with an in-order machine (like a GPU). But that's not the most critical workload for Itanium.
>What about OOOE for implementations of EPIC ISAs? A
>comparison of McKinley vs EV6 vs Power4 suggests that
>what EPIC brings to the table combined with extra CPU
>resources not going OOOE buys makes the question a
>lot more debatable than with non-EPIC ISAs. My guess
>is Fort Collins looked carefully at OOOE but stayed with
>an in-order design for Poulson to maximize performance
>within its die size and power budget.
Itanium has a lot of nice features but a lot of them are better off as microarchitecture. The ALAT for instance - compiler hints welcome...but a lot of aliasing issues are best handled dynamically to avoid code bloat, etc. I think register rotation is kind of obnoxious, since it costs you a whole clock cycle in the pipeline...but without real register renaming. Also, large register files are kind of nice...but I'm not sure it's ideal for multi-threading.
I also wonder how predication impacts branch predictor design...but that's a separate topic entirely.
Perhaps more importantly, I don't believe saving core die area matters for consumer workloads. Client systems won't use >4 cores for a long time...so shaving off 20% of the area doesn't really matter. For servers...it might be a different story with 16 cores...but the benefits of using the same core for client and server is pretty big.
DK