By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), June 16, 2022 1:13 pm
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on June 16, 2022 9:39 am wrote:
> hobold (hobold.delete@this.vectorizer.org) on June 16, 2022 5:12 am wrote:
[snip]
>> I was wondering for a while if maybe Apple designed a processor that can sometimes
>> execute two serially dependent instructions within one longer clock cycle.
>>
>> How much % of cycle time is latch overhead these days? What if instead of the usual beat
>> "latch work latch work latch" you built for "latch work work latch work work latch"?
>>
>> One probably wouldn't even try to make this work for arbitrary sequences of two dependent instructions.
>> But maybe a small subset of dependent pairs is statistically dominant enough to focus on?
>
> It sounds like you're suggesting something like Pentium
> 4's "double pumped ALU" (aka "Rapid Execution Engine")
The proposal was to remove latch overhead (width-pipelined/staggered ALUs do not remove the latches between operations). A cascaded ALU does perform one operation after another (plausibly without latches). I think Sun's SuperSPARC implemented something like this
> Essentially the ALU was running at double the frequency. According to Intel that was done to make up for
> the Pentium 4's rather anemic IPC. Not sure how it would work on a CPU that already has very high IPC.
Cascaded ALUs can reduce routing overhead (and possibly latch overhead).
> Running at double the frequency (even if it could be managed to enable/disable that
> at a whim, or have only one core doing at a time, etc.) is likely to have negative
> consequences for efficiency, which is one of Apple's primary design goals.
Cascaded ALUs do not target higher frequency; the reduced routing complexity (and possibly larger scheduling window from scheduling at coarser granularity) could provide greater energy efficiency if utilization of the feature was high.
(Even a width-pipelined ALU would not have to target higher frequency and might provide energy efficiency benefits. At the same frequency, a half-width adder would need less logic.)
> hobold (hobold.delete@this.vectorizer.org) on June 16, 2022 5:12 am wrote:
[snip]
>> I was wondering for a while if maybe Apple designed a processor that can sometimes
>> execute two serially dependent instructions within one longer clock cycle.
>>
>> How much % of cycle time is latch overhead these days? What if instead of the usual beat
>> "latch work latch work latch" you built for "latch work work latch work work latch"?
>>
>> One probably wouldn't even try to make this work for arbitrary sequences of two dependent instructions.
>> But maybe a small subset of dependent pairs is statistically dominant enough to focus on?
>
> It sounds like you're suggesting something like Pentium
> 4's "double pumped ALU" (aka "Rapid Execution Engine")
The proposal was to remove latch overhead (width-pipelined/staggered ALUs do not remove the latches between operations). A cascaded ALU does perform one operation after another (plausibly without latches). I think Sun's SuperSPARC implemented something like this
> Essentially the ALU was running at double the frequency. According to Intel that was done to make up for
> the Pentium 4's rather anemic IPC. Not sure how it would work on a CPU that already has very high IPC.
Cascaded ALUs can reduce routing overhead (and possibly latch overhead).
> Running at double the frequency (even if it could be managed to enable/disable that
> at a whim, or have only one core doing at a time, etc.) is likely to have negative
> consequences for efficiency, which is one of Apple's primary design goals.
Cascaded ALUs do not target higher frequency; the reduced routing complexity (and possibly larger scheduling window from scheduling at coarser granularity) could provide greater energy efficiency if utilization of the feature was high.
(Even a width-pipelined ALU would not have to target higher frequency and might provide energy efficiency benefits. At the same frequency, a half-width adder would need less logic.)
Topic | Posted By | Date |
---|---|---|
M2 benchmarks | - | 2022/06/15 12:27 PM |
You mean "absurd ARM"? ;-) (NT) | Rayla | 2022/06/15 02:18 PM |
It has PPC heritage :) (NT) | anon2 | 2022/06/15 02:55 PM |
Performance per clock | — | 2022/06/15 03:05 PM |
Performance per single clock cycle | hobold | 2022/06/16 05:12 AM |
Performance per single clock cycle | dmcq | 2022/06/16 06:59 AM |
Performance per single clock cycle | hobold | 2022/06/16 07:42 AM |
Performance per single clock cycle | Doug S | 2022/06/16 09:39 AM |
Performance per single clock cycle | hobold | 2022/06/16 12:36 PM |
More like cascaded ALUs | Paul A. Clayton | 2022/06/16 01:13 PM |
SuperSPARC ALU | Mark Roulo | 2022/06/16 01:57 PM |
LEA | Brett | 2022/06/16 02:52 PM |
M2 benchmarks | DaveC | 2022/06/15 03:31 PM |
M2 benchmarks | anon2 | 2022/06/15 05:06 PM |
M2 benchmarks | — | 2022/06/15 07:21 PM |
M2 benchmarks | --- | 2022/06/15 07:33 PM |
M2 benchmarks | Adrian | 2022/06/15 10:11 PM |
M2 benchmarks | Eric Fink | 2022/06/16 12:07 AM |
M2 benchmarks | Adrian | 2022/06/16 02:09 AM |
M2 benchmarks | Eric Fink | 2022/06/16 05:46 AM |
M2 benchmarks | Adrian | 2022/06/16 09:27 AM |
M2 benchmarks | --- | 2022/06/16 10:08 AM |
M2 benchmarks | Adrian | 2022/06/16 11:43 AM |
M2 benchmarks | Dummond D. Slow | 2022/06/16 01:03 PM |
M2 benchmarks | Adrian | 2022/06/17 03:34 AM |
M2 benchmarks | Dummond D. Slow | 2022/06/17 07:35 AM |
M2 benchmarks | none | 2022/06/16 10:14 AM |
M2 benchmarks | Adrian | 2022/06/16 12:44 PM |
M2 benchmarks | Eric Fink | 2022/06/17 02:05 AM |
M2 benchmarks | Anon | 2022/06/16 06:28 AM |
M2 benchmarks => MT | Adrian | 2022/06/16 11:04 AM |
M2 benchmarks => MT | Anon | 2022/06/18 02:38 AM |
M2 benchmarks => MT | Adrian | 2022/06/18 03:25 AM |
M2 benchmarks => MT | --- | 2022/06/18 10:14 AM |
M2 benchmarks | Doug S | 2022/06/16 09:49 AM |
M2 Pro at 3nm | Eric Fink | 2022/06/17 02:51 AM |
M2 benchmarks | Sean M | 2022/06/16 01:00 AM |
M2 benchmarks | Doug S | 2022/06/16 09:56 AM |
M2 benchmarks | joema | 2022/06/16 01:28 PM |
M2 benchmarks | Sean M | 2022/06/16 02:53 PM |
M2 benchmarks | Doug S | 2022/06/16 09:19 PM |
M2 benchmarks | Doug S | 2022/06/16 09:21 PM |
M2 benchmarks | --- | 2022/06/16 10:53 PM |
M2 benchmarks | Doug S | 2022/06/17 12:37 AM |
Apple’s STEM Ambitions | Sean M | 2022/06/17 04:18 AM |
Apple’s STEM Ambitions | --- | 2022/06/17 09:33 AM |
Mac Pro with Nvidia H100 | Tony Wu | 2022/06/17 06:37 PM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/17 10:37 PM |
Mac Pro with Nvidia H100 | Tony Wu | 2022/06/18 06:49 AM |
Mac Pro with Nvidia H100 | Dan Fay | 2022/06/18 07:40 AM |
Mac Pro with Nvidia H100 | Anon4 | 2022/06/20 09:04 AM |
Mac Pro with Nvidia H100 | Simon Farnsworth | 2022/06/20 10:09 AM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/20 10:32 AM |
Mac Pro with Nvidia H100 | Simon Farnsworth | 2022/06/20 11:20 AM |
Mac Pro with Nvidia H100 | Anon4 | 2022/06/20 04:16 PM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/20 10:19 AM |
Mac Pro with Nvidia H100 | me | 2022/06/18 07:17 AM |
Mac Pro with Nvidia H100 | Tony Wu | 2022/06/18 09:28 AM |
Mac Pro with Nvidia H100 | me | 2022/06/19 10:08 AM |
Mac Pro with Nvidia H100 | Dummond D. Slow | 2022/06/19 10:51 AM |
Mac Pro with Nvidia H100 | Elliott H | 2022/06/19 06:39 PM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/19 06:16 PM |
Mac Pro with Nvidia H100 | --- | 2022/06/19 06:56 PM |
Mac Pro with Nvidia H100 | Sam G | 2022/06/19 11:00 PM |
Mac Pro with Nvidia H100 | --- | 2022/06/20 06:25 AM |
Mac Pro with Nvidia H100 | anon5 | 2022/06/20 08:41 AM |
Mac Pro with Nvidia H100 | Sam G | 2022/06/20 07:22 PM |
Mac Pro with Nvidia H100 | Sam G | 2022/06/20 07:13 PM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/20 10:19 PM |
Mac Pro with Nvidia H100 | Sam G | 2022/06/22 12:06 AM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/22 09:18 AM |
Mac Pro with Nvidia H100 | Doug S | 2022/06/20 10:38 AM |
Mac Pro with Nvidia H100 | Sam G | 2022/06/20 07:17 PM |
Mac Pro with Nvidia H100 | Dummond D. Slow | 2022/06/20 05:46 PM |
Apple’s STEM Ambitions | noko | 2022/06/17 07:32 PM |
Quick aside: huge pages also useful for nested page tables (virtualization) (NT) | Paul A. Clayton | 2022/06/18 06:28 AM |
Quick aside: huge pages also useful for nested page tables (virtualization) | --- | 2022/06/18 10:16 AM |
Not this nonsense again | Anon | 2022/06/16 03:06 PM |
Parallel video encoding | Wes Felter | 2022/06/16 04:57 PM |
Parallel video encoding | Dummond D. Slow | 2022/06/16 07:16 PM |
Parallel video encoding | Wes Felter | 2022/06/16 07:49 PM |
Parallel video encoding | --- | 2022/06/16 07:41 PM |
Parallel video encoding | Dummond D. Slow | 2022/06/16 10:08 PM |
Parallel video encoding | --- | 2022/06/16 11:03 PM |
Parallel video encoding | Dummond D. Slow | 2022/06/17 07:45 AM |
Not this nonsense again | joema | 2022/06/16 09:13 PM |
Not this nonsense again | --- | 2022/06/16 11:18 PM |
M2 benchmarks-DDR4 vs DDR5 | Per Hesselgren | 2022/06/16 01:09 AM |
M2 benchmarks-DDR4 vs DDR5 | Rayla | 2022/06/16 08:12 AM |
M2 benchmarks-DDR4 vs DDR5 | Doug S | 2022/06/16 09:58 AM |
M2 benchmarks-DDR4 vs DDR5 | Rayla | 2022/06/16 11:58 AM |