Alternatives Implementations

By: Kyle Siefring (, July 13, 2020 6:02 pm
Room: Moderated Discussions
David Kanter ( on July 12, 2020 6:13 pm wrote:
> I did some analysis a while back that might useful to share here.
> 8.0 mm2 SKL core
> 0.9 mm2 AVX512
> 2.0 mm2 1MB L2$
> 2.4 mm2 1.375MB L3$
> 0.4 mm2 Snoop filter
> 2.0 mm2 caching and home agent
> 2.2 mm2 FIVR, PLLs
> 0.4 mm2 Mystery block
> 18.0 mm2 Total SKL-SP tile
> So AVX512 is about 5% of the tile area, the tiles are 72% of the total area of SKL-SP.
> If you removed AVX512 you'd save 28mm2 for the whole chip, which would let you add at most 2 tiles.
> In that vein, it seems like a pretty reasonable trade-off.
> David

This mask is your mask, this mask is my mask, from databases to dsp, this mask belongs to you and me...

We all have got to share the expense for those masks. My guess is that AVX-512 will become cheaper as time goes on. That being said, maybe there should be a little less sharing. This is already happening with AMD being competitive. You could make the case for gpus being part of this.

It would be interesting to see a cpu with low latencies for the 128-bit path with higher latencies for 256-bit and 512-bit. Same throughput, different latencies. On a smaller core, this could be like knights landing with fewer threads.

For reference.
format: reciprocal throughput/latency
knightslanding addps .5/6 mulps .5/6
haswell addps 1/3 mulps .5/5
broadwell addps 1/3 mulps .5/3
skylake addps .5/4 mulps .5/4

You can see that skylake regressed latencies compared to broadwell. Intel clearly didn't do this for giggles. These latencies aren't free.

I work with video codecs. Higher latency would hurt for smaller vectors, but if you are working with larger vectors you tend to have more rows to work with. Latency might hurt a bit with maddwd and maddubs since they are used for horizontal adds and subtracts. Personally, I think intel should add drop in replacements for those.

I have no idea about how much actually supporting wide register cost and whether that cost can be reduced with a slower implementation.

As things slow down, another option could be to alternate what each generation is good at. I don't think this is possible for now with the constant waves of vulnerabilities.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Skylake-SP area breakdownDavid Kanter2020/07/12 06:13 PM
  Skylake-SP area breakdownanon22020/07/12 07:01 PM
    Skylake-SP area breakdownTravis Downs2020/07/12 08:02 PM
      Skylake-SP area breakdownanon2020/07/12 08:44 PM
  Skylake-SP area breakdownTravis Downs2020/07/12 08:03 PM
    Skylake-SP area breakdownDavid Kanter2020/07/12 08:20 PM
      To elaborateDavid Kanter2020/07/12 08:22 PM
        To elaborateTravis Downs2020/07/13 07:03 AM
          To elaborateAnon2020/07/13 07:36 AM
            To elaborateAdrian2020/07/13 01:45 PM
              To elaborateAnon2020/07/13 02:06 PM
                To elaborateChester2020/07/13 08:30 PM
  Alternatives ImplementationsKyle Siefring2020/07/13 06:02 PM
    Alternatives ImplementationsTravis Downs2020/07/13 08:41 PM
    Alternatives ImplementationsMaynard Handley2020/07/13 10:37 PM
      Alternatives ImplementationsDoug S2020/07/13 11:25 PM
        Mask costsDavid Kanter2020/07/14 08:13 AM
        Alternatives Implementationstarlinian2020/07/14 08:22 AM
          Alternatives ImplementationsDoug S2020/07/14 10:03 AM
          Alternatives ImplementationsMaynard Handley2020/07/14 10:12 AM
        Alternatives ImplementationsMaynard Handley2020/07/14 10:10 AM
          Alternatives ImplementationsDoug S2020/07/14 10:47 AM
            Alternatives ImplementationsBrett2020/07/14 01:38 PM
            Alternatives Implementationstarlinian2020/07/14 02:30 PM
Reply to this Topic
Body: No Text
How do you spell avocado?