By: Anne O. Nymous (not.delete@this.real.address), February 3, 2023 1:35 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on February 3, 2023 11:35 am wrote:
> Anne O. Nymous (not.delete@this.real.address) on February 2, 2023 11:57 pm wrote:
> > —- (-.delete@this.redheron.com) on February 2, 2023 3:35 pm wrote:
> > > Mark Heath (none.delete@this.none.none) on February 1, 2023 3:45 pm wrote:
> > > > Freddie (freddie.delete@this.witherden.org) on February 1, 2023 8:54 am wrote:
> > > > > Also, I'll note that with static scheduling it is not necessarily true
> > > > > that P cores will operate at the performance level of E cores.
> > > > > So long as OMP_PROC_BIND (and related variables) are not set,
> > > > > the OS scheduler is free to move threads around. Hence,
> > > > > when the P cores finish (and go to sleep) the scheduler can shift a task over from the E cores to them.
> > > > >
> > > > > Contrived example. We have 32 work items on a system with
> > > > > 4P and 4E cores with no SMT. Lets assume a P core can do
> > > > > 2 items a second and an E core can do 1 item a second. With
> > > > > a static schedule each core gets 32 / 8 = 4 items.
> > > > >
> > > > > After t = 2 seconds the P cores are done, and the E cores have
> > > > > 2 items remaining. Scheduler sees this, and shifts the E core
> > > > > threads to the P cores. At t = 3, we're done (maybe sooner as the
> > > > > E cores being idle may allow the P cores to boost higher).
> > > > > In contrast, a system with 8E cores would take until t = 4 to finish.
> > > >
> > > > Thank you for your interesting educational example. You’re right about static scheduling not being as
> > > > bad as I said because the OS scheduler can move threads from E cores to P cores. Suppose there were 48 work
> > > > items in your example. With a static schedule, each core would be assigned 48/8 = 6 work items. The P cores
> > > > would finish at t=3 and each E core would have 3 work items
> > > > left to do at that point. The OS scheduler would
> > > > move the E core threads to the P cores and each P core would complete those 3 items at t=4.5.
> > > >
> > > > Now suppose the programmer used OpenMP’s auto schedule policy and the OpenMP runtime was smart enough
> > > > to notice, after the E cores complete one work item, that the E cores are taking twice as long as the
> > > > P cores per work item. Since the auto schedule policy allows the OpenMP runtime to figure out the best
> > > > schedule, the runtime could, in theory, assign 8 work items to each P core and 4 work items to each E
> > > > core. In this case, the loop would complete at t=4 instead of t=4.5. Does it seem practical for an OpenMP
> > > > runtime to do this when the auto schedule policy is used? Is there any way for a programmer to manually
> > > > give the P cores twice as many iterations as the E cores so the loop completes at t=4?
> > > >
> > > > Regarding Heikki Kultala’s comment: Apple’s hardware does not have SMT so
> > > > splitting a physical P core into two virtual P cores is not possible as a way
> > > > of making the performance of the all threads in the system more uniform.
> > >
> > > Apple, so far anyway, don’t see E cores as throughput cores but as “helper” ores, like the dedicated
> > > cores on some other many core designs (like I think Fugaku does this). Apple is not substantially scaling
> > > up the E cores count as the SOC grows; overall it’s a very different design thinking than Intel.
> > >
> > > So Apple’s answer to the OpenMP question would probably be to put the code
> > > on P only and let the E cores handle whatever OS/IO work arises as they would
> > > naturally, do t bother trying to squeeze out an extra few percent using them.
> >
> > Interesting observation.
> > How much smaller are the E cores compared to the P cores and how much less power do they draw? It might be
> > a question of say having 4 concurrent somewhat slower CPUs
> > versus 1.5 faster one; for lowish priority background
> > jobs higher concurrency might be more useful than higher ST speed, but I am merely speculating.
> >
>
> The numbers vary from design to design, but order of magnitude:
> - E cores are a quarter the size of P cores
> - E cores provide 1/4 to 1/3 the performance of P cores
> - E cores use (at peak power level, which may be misleading in terms of actual usage...)
> about 1/10th the power (so about 1/3 the energy, taking 3x as long, for a specific task)
>
> Essentially
> - Apple optimizes their E-cores for energy-delay product (ie balanced between fast and low energy)
> - ARM optimizes the E-cores for low area
> - Intel optimizes their E-cores for high performance/area
>
> Each is optimizing for a very different goal, so it's not surprising
> that the results are best used in very different ways.
>
> For Apple (at least for now...) it doesn't make sense to run things like OMP, or other highly-threaded code,
> on E-cores unless you are chasing that last few percent of performance AND know something about your task
> lengths and how they balance. Certainly it might be dumb to do this when the set of tasks is variable but
> fairly short, each of unknown length, and with faster tasks having to wait for slower tasks.
> Of course there are some trivial (frequently dick-measuring) workloads like cinebench or handbrake where this
> is not a risk because the tasks are so long lived before dependencies that even the simplest OS scheduler
> will balance everything out OK. But this is not representative of less trivially parallelizable code.
>
> For Intel, on the other hand, E-cores represent some part of their performance future, with many kinda
> high-end designs of the sorts targeting gamers dumping a substantial fraction of their area and performance
> into E-cores, and their eco-system has a more difficult task trying to handle this...
>
> BTW the truly energy-optimized Apple cores are the Chinook cores which are basically very
> fancy ARM M cores speaking AArch64. These are used as controllers all over the chip (for
> the GPU, NPU, ISP, etc) but are, of course, irrelevant to developers outside Apple.
>
Thanks! Glad I asked, more food for thought.
> Anne O. Nymous (not.delete@this.real.address) on February 2, 2023 11:57 pm wrote:
> > —- (-.delete@this.redheron.com) on February 2, 2023 3:35 pm wrote:
> > > Mark Heath (none.delete@this.none.none) on February 1, 2023 3:45 pm wrote:
> > > > Freddie (freddie.delete@this.witherden.org) on February 1, 2023 8:54 am wrote:
> > > > > Also, I'll note that with static scheduling it is not necessarily true
> > > > > that P cores will operate at the performance level of E cores.
> > > > > So long as OMP_PROC_BIND (and related variables) are not set,
> > > > > the OS scheduler is free to move threads around. Hence,
> > > > > when the P cores finish (and go to sleep) the scheduler can shift a task over from the E cores to them.
> > > > >
> > > > > Contrived example. We have 32 work items on a system with
> > > > > 4P and 4E cores with no SMT. Lets assume a P core can do
> > > > > 2 items a second and an E core can do 1 item a second. With
> > > > > a static schedule each core gets 32 / 8 = 4 items.
> > > > >
> > > > > After t = 2 seconds the P cores are done, and the E cores have
> > > > > 2 items remaining. Scheduler sees this, and shifts the E core
> > > > > threads to the P cores. At t = 3, we're done (maybe sooner as the
> > > > > E cores being idle may allow the P cores to boost higher).
> > > > > In contrast, a system with 8E cores would take until t = 4 to finish.
> > > >
> > > > Thank you for your interesting educational example. You’re right about static scheduling not being as
> > > > bad as I said because the OS scheduler can move threads from E cores to P cores. Suppose there were 48 work
> > > > items in your example. With a static schedule, each core would be assigned 48/8 = 6 work items. The P cores
> > > > would finish at t=3 and each E core would have 3 work items
> > > > left to do at that point. The OS scheduler would
> > > > move the E core threads to the P cores and each P core would complete those 3 items at t=4.5.
> > > >
> > > > Now suppose the programmer used OpenMP’s auto schedule policy and the OpenMP runtime was smart enough
> > > > to notice, after the E cores complete one work item, that the E cores are taking twice as long as the
> > > > P cores per work item. Since the auto schedule policy allows the OpenMP runtime to figure out the best
> > > > schedule, the runtime could, in theory, assign 8 work items to each P core and 4 work items to each E
> > > > core. In this case, the loop would complete at t=4 instead of t=4.5. Does it seem practical for an OpenMP
> > > > runtime to do this when the auto schedule policy is used? Is there any way for a programmer to manually
> > > > give the P cores twice as many iterations as the E cores so the loop completes at t=4?
> > > >
> > > > Regarding Heikki Kultala’s comment: Apple’s hardware does not have SMT so
> > > > splitting a physical P core into two virtual P cores is not possible as a way
> > > > of making the performance of the all threads in the system more uniform.
> > >
> > > Apple, so far anyway, don’t see E cores as throughput cores but as “helper” ores, like the dedicated
> > > cores on some other many core designs (like I think Fugaku does this). Apple is not substantially scaling
> > > up the E cores count as the SOC grows; overall it’s a very different design thinking than Intel.
> > >
> > > So Apple’s answer to the OpenMP question would probably be to put the code
> > > on P only and let the E cores handle whatever OS/IO work arises as they would
> > > naturally, do t bother trying to squeeze out an extra few percent using them.
> >
> > Interesting observation.
> > How much smaller are the E cores compared to the P cores and how much less power do they draw? It might be
> > a question of say having 4 concurrent somewhat slower CPUs
> > versus 1.5 faster one; for lowish priority background
> > jobs higher concurrency might be more useful than higher ST speed, but I am merely speculating.
> >
>
> The numbers vary from design to design, but order of magnitude:
> - E cores are a quarter the size of P cores
> - E cores provide 1/4 to 1/3 the performance of P cores
> - E cores use (at peak power level, which may be misleading in terms of actual usage...)
> about 1/10th the power (so about 1/3 the energy, taking 3x as long, for a specific task)
>
> Essentially
> - Apple optimizes their E-cores for energy-delay product (ie balanced between fast and low energy)
> - ARM optimizes the E-cores for low area
> - Intel optimizes their E-cores for high performance/area
>
> Each is optimizing for a very different goal, so it's not surprising
> that the results are best used in very different ways.
>
> For Apple (at least for now...) it doesn't make sense to run things like OMP, or other highly-threaded code,
> on E-cores unless you are chasing that last few percent of performance AND know something about your task
> lengths and how they balance. Certainly it might be dumb to do this when the set of tasks is variable but
> fairly short, each of unknown length, and with faster tasks having to wait for slower tasks.
> Of course there are some trivial (frequently dick-measuring) workloads like cinebench or handbrake where this
> is not a risk because the tasks are so long lived before dependencies that even the simplest OS scheduler
> will balance everything out OK. But this is not representative of less trivially parallelizable code.
>
> For Intel, on the other hand, E-cores represent some part of their performance future, with many kinda
> high-end designs of the sorts targeting gamers dumping a substantial fraction of their area and performance
> into E-cores, and their eco-system has a more difficult task trying to handle this...
>
> BTW the truly energy-optimized Apple cores are the Chinook cores which are basically very
> fancy ARM M cores speaking AArch64. These are used as controllers all over the chip (for
> the GPU, NPU, ISP, etc) but are, of course, irrelevant to developers outside Apple.
>
Thanks! Glad I asked, more food for thought.
Topic | Posted By | Date |
---|---|---|
NYT on SPR | --- | 2023/01/26 10:37 AM |
NYT on SPR | Chris G | 2023/01/26 06:02 PM |
NYT on SPR | me | 2023/01/26 07:44 PM |
NYT on SPR | Anne O. Nymous | 2023/01/27 01:09 AM |
NYT on SPR | Michael S | 2023/01/27 03:22 AM |
NYT on SPR | --- | 2023/01/27 10:31 AM |
Pat has been trimming the Intel product portfolio | Mark Roulo | 2023/01/27 01:29 PM |
NYT on SPR | James | 2023/01/27 02:00 PM |
NYT on SPR | Adrian | 2023/01/28 03:55 AM |
NYT on SPR | anonymou5 | 2023/01/28 04:03 AM |
NYT on SPR | Adrian | 2023/01/28 04:14 AM |
NYT on SPR | Groo | 2023/01/29 09:50 AM |
NYT on SPR | Groo | 2023/01/29 09:46 AM |
NYT on SPR | Brendan | 2023/01/29 01:00 PM |
NYT on SPR | Anon4 | 2023/01/29 04:06 PM |
NYT on SPR | Brendan | 2023/01/29 07:03 PM |
NYT on SPR | Groo | 2023/01/30 07:09 AM |
NYT on SPR | Groo | 2023/01/29 09:39 AM |
NYT on SPR | AnonSoft | 2023/01/30 11:01 AM |
NYT on SPR | hobold | 2023/01/30 12:39 PM |
NYT on SPR | AnonSoft | 2023/01/30 05:34 PM |
NYT on SPR | hobold | 2023/01/31 04:40 AM |
NYT on SPR | Jukka Larja | 2023/01/31 07:13 AM |
Heterogeneous CPU Cores With OpenMP | Mark Heath | 2023/02/01 04:45 AM |
Heterogeneous CPU Cores With OpenMP | Freddie | 2023/02/01 05:05 AM |
Heterogeneous CPU Cores With OpenMP | Mark Heath | 2023/02/01 06:42 AM |
Heterogeneous CPU Cores With OpenMP | Freddie | 2023/02/01 09:54 AM |
Heterogeneous CPU Cores With OpenMP | Mark Heath | 2023/02/01 04:45 PM |
Heterogeneous CPU Cores With OpenMP | —- | 2023/02/02 04:35 PM |
Heterogeneous CPU Cores With OpenMP | Freddie | 2023/02/02 04:39 PM |
Heterogeneous CPU Cores With OpenMP | --- | 2023/02/03 12:15 PM |
Heterogeneous CPU Cores With OpenMP | Freddie | 2023/02/03 03:46 PM |
Heterogeneous CPU Cores With OpenMP | Anne O. Nymous | 2023/02/03 12:57 AM |
Heterogeneous CPU Cores With OpenMP | --- | 2023/02/03 12:35 PM |
Heterogeneous CPU Cores With OpenMP | Anne O. Nymous | 2023/02/03 01:35 PM |
different big/little split.. | Heikki Kultala | 2023/02/03 02:33 PM |
Heterogeneous CPU Cores With OpenMP | Paul H | 2023/02/03 06:51 PM |
Heterogeneous CPU Cores With OpenMP | Jukka Larja | 2023/02/01 06:24 AM |
When heavily loaded, Threads run about equally fast on E-cores than P-cores | Heikki Kultala | 2023/02/01 02:08 PM |
NYT on SPR | Chester | 2023/01/27 09:30 AM |
use archive.org | anon | 2023/01/27 06:08 PM |
Bypassing paywalls | Doug S | 2023/01/28 02:05 PM |
NYT on SPR | Chris G | 2023/01/27 06:54 PM |
Intel On Demand | Chris G | 2023/01/28 04:24 AM |
Intel On Demand | me | 2023/01/28 06:24 AM |
Intel On Demand | Groo | 2023/01/29 09:53 AM |
Intel On Demand | rwessel | 2023/01/28 09:41 AM |
Intel On Demand | --- | 2023/01/28 11:37 AM |
Anit-waste bias | Paul A. Clayton | 2023/01/28 07:57 PM |
Intel On Demand | Groo | 2023/01/29 09:58 AM |
Intel On Demand | Andrey | 2023/01/30 05:04 PM |
Intel On Demand | blaine | 2023/01/28 03:07 PM |
Intel On Demand | me | 2023/01/28 03:25 PM |
Intel On Demand | me | 2023/01/28 03:33 PM |
Intel On Demand | Chris G | 2023/01/28 07:06 PM |
Intel On Demand | me | 2023/01/28 07:43 PM |
Intel On Demand - Validation, certification? | Björn Ragnar Björnsson | 2023/01/28 10:41 PM |
Intel On Demand - Validation, certification? | anonymou5 | 2023/01/29 02:49 AM |
Sapphire Rapids crippleware is a naked money grab | Chris G | 2023/01/29 04:44 AM |
Intel On Demand - Validation, certification? | Groo | 2023/01/29 10:05 AM |
Intel On Demand - Validation, certification? | AnotherAnonymousEngineer | 2023/01/29 10:33 AM |
Intel On Demand - Validation, certification? | Groo | 2023/01/29 11:16 AM |
Intel On Demand - Validation, certification? | dmcq | 2023/01/29 04:32 PM |
Intel On Demand - Validation, certification? | Brendan | 2023/01/29 08:01 PM |
Intel On Demand - Validation, certification? | Groo | 2023/01/30 07:17 AM |
Intel On Demand - Validation, certification? | Freddie | 2023/01/30 11:36 AM |
Intel On Demand - Validation, certification? | anon2 | 2023/01/30 07:41 PM |
Intel On Demand - Validation, certification? | anon2 | 2023/01/31 01:35 AM |
Crippleware | Chris G | 2023/01/31 05:47 AM |
Doctorow calls it "enshittification" (NT) | hobold | 2023/01/31 07:55 AM |
Crippleware | anon2 | 2023/01/31 10:51 AM |
Crippleware | Groo | 2023/02/01 02:06 PM |
Crippleware | anon2 | 2023/02/01 05:10 PM |
Crippleware | Chris G | 2023/02/01 05:52 PM |
Crippleware | anon2 | 2023/02/01 09:15 PM |
SPR Volume | me | 2023/02/02 04:47 AM |
SPR Volume | anon2 | 2023/02/02 07:04 AM |
Crippleware | Chris G | 2023/02/02 08:12 AM |
Crippleware | anon2 | 2023/02/02 08:42 AM |
Crippleware | anon2 | 2023/02/02 08:48 AM |
Crippleware | Charles | 2023/02/01 01:38 AM |
Crippleware | Chris G | 2023/02/01 02:59 AM |
language digression | Matt Sayler | 2023/02/01 04:53 PM |
Crippleware | me | 2023/02/01 06:27 PM |
Crippleware | Chris G | 2023/02/01 07:01 PM |
Crippleware | me | 2023/02/01 07:10 PM |
Crippleware | Chris G | 2023/02/01 09:32 PM |
Crippleware | Tony | 2023/02/01 11:18 PM |
Crippleware | me | 2023/02/02 04:27 AM |
Crippleware | anonymou5 | 2023/02/02 03:47 AM |
Crippleware | Chris G | 2023/02/02 05:59 AM |
Intel On Demand - Enshittification | blaine | 2023/01/30 12:13 AM |
Intel and mobile phones | James | 2023/01/29 09:09 AM |
Intel and mobile phones | Maxwell | 2023/01/29 02:25 PM |
Intel and mobile phones | Groo | 2023/01/30 07:20 AM |
Intel and mobile phones | anonymous2 | 2023/01/30 11:15 AM |
Intel and mobile phones | Doug S | 2023/01/30 12:51 PM |
Intel and mobile phones | Daniel B | 2023/01/31 07:37 AM |
Intel and mobile phones | Groo | 2023/02/01 02:03 PM |
SPR HBM | me | 2023/01/29 09:17 AM |
SPR-W | me | 2023/02/17 05:41 PM |
Accelerators on AMD/ARM | Chester | 2023/01/29 05:41 PM |