Article: Knights Landing Details
By: Nicolas Capens (nicolas.capens.delete@this.gmail.com), January 16, 2014 9:23 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on January 11, 2014 1:47 pm wrote:
> Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 11, 2014 12:24 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on January 10, 2014 4:06 pm wrote:
> > > Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 10, 2014 2:22 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on January 9, 2014 6:42 pm wrote:
> > > > > > > I'll further add that I'm willing to wager that I'm correct about the 2 load
> > > > > > > pipelines, and that Nicolas is wrong about a virtually addressed L0 cache.
> > > > > >
> > > > > > What exactly do you mean by 2 load pipelines? If you mean that both vector units would be
> > > > > > able to use a memory operand in the same cycle, then yes, based on Eric's code that seems
> > > > > > to be a necessity. However due to the duplicate loading of the same memory this can be provided
> > > > > > by a dual-ported L0 cache, while the L1 cache itself only requires a single load port.
> > > > >
> > > > > I mean that the KNL core has two AGUs, both VPUs can execute
> > > > > load+op every cycle and that the L1D cache can fill
> > > > > two hits per cycle. I do not believe the cache is dual ported, as I believe it will be heavily banked.
> > > >
> > > > As far as I know that counts as a dual-ported cache.
> > >
> > > No it doesn't.
> >
> > It's what my professor called it: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.3602
> >
> > What did your professor call the thingies that determine the
> > maximum number of simultaneous accesses instead of ports?
>
> I didn't learn computer architecture from a professor.
>
> A cache and an SRAM are totally different entities, e.g., http://courses.engr.illinois.edu/ece512/spring2006/Papers/Itanium/isscc_2002_4s.pdf
> is a good explanation of the differences here (incidentally the distinction between SRAM and caches is something
> you seem to gloss over repeatedly throughout this thread).
With all due respect you don't have to lecture me on such basic matter. I did learn computer architecture from a professor, as well as digital design, embedded systems design, etc.
> The problem here is that CS people are incredibly imprecise and haphazard about terminology (in
> comparison to mathematics). SRAMs and RFs have ports. The definition there is fairly precise.
It helps to discern between hardware designers and software developers, instead of lumping them together as being "CS people", even though some are both. To a software developer, even a low-level compiler engineer, a cache is mostly a concept and has a broad meaning. You can even have a software cache. For low-level software developers it's important how many memory operands you can have per cycle, but it's of practically no importance how that's achieved in hardware. Bank conflicts should be rare enough for a software developer not to have to care about them. To a hardware designer a cache isn't just a concept it's a physical thing and if it's a banked design which can have a bank conflict that's of huge importance. Also, things evolve fast enough that the norm for a multi-ported cache implementation varies every few years. But it's still multi-ported by concept.
So I really don't think the problem is "CS people" being very imprecise about terminology. There's just different contexts, and it appears you're confusing a specific implementation for defining the meaning of certain terms. For what it's worth, Stanford professors consider multi-banking also as a method to implement a multi-ported cache: http://www.stanford.edu/class/ee282/handouts/lect.04.caches2.pdf
> I avoid using the same terminology on caches, or when I do, I try to make
> sure that I am precise (although I'm not successful all the time).
>
> What is a 'dual ported cache'?
>
> Is it a cache comprising a dual-ported tag and data array? That is the most common usage because
> most architects are thinking about circuits a fair bit. In fact, if you ask Intel's architects
> whether the caches are multi-ported, they will tell you 'no, they are multibanked'.
>
> Is it a cache that has two access 'ports'? Well maybe, but what are those 'ports' capable of? How do
> I know whether they are 2 R/W, R+W, 2R, or 2W? How do I know what the address constraints are? E.g.,
> is the underlying data array banked, while the tags are dual ported? What about the controller logic?
>
> There's a lot of confusing terminology here and it pays to be precise.
I'm afraid the confusion is all yours. Read ports are well defined and even if you lacked the terminology I think I've made myself perfectly clear from the start that the L1 cache may not require two reads per cycle to feed two vector units with (unique) data. The existence of an L0 cache with two read ports would potentially explain why the Intel compiler reads operands from memory again despite having the data in a register and despite L1 accesses consuming considerable power. And again, I'm not claiming KNL's to have such an L0 cache. I'm merely observing that in your article you're taking a leap by saying KNL must have two load pipelines (which you've clarified to mean two L1 load ports, not something like an L0 cache to service recently loaded data), which you've concluded solely from x86 being a load-op ISA. There are viable alternatives, of which I'm suggesting just one.
> > > > Anyway, a multi-ported multi-banked L1 cache is a reasonable possibility. I just don't see why it "must" be
> > > > the only possibility, especially with x86 being a load-op
> > > > architecture as the only explanation in >your article.
> > >
> > > > Given that 1 byte/FLOP would suffice and that the code generated
> > > > by ICC has a lot of duplicate >memory accesses,
> > > > it also seems reasonable to me that there's a single-ported L1 cache and a dual-ported L0 cache.
> > >
> > > First, 1B/FLOP isn't sufficient. Look at KNC, it can source 1x64B operand from memory
> > > and 2 from registers while performing 8 FMAs. That's 8B for 2 operations or 4B/FLOP.
> >
> > I was talking about single-precision FLOPS. And it's obvious that KNC had 2 bytes/SPFLOP since providing
> > any less would have required cutting that bus in half and sequencing things, which adds more complexity
> > overall.
>
> It turns to be the same thing, which is half an operand per FLOP.
Yes, but that's a consequence of the architecture. In contrast, Kepler has a SIMD array of 32 load units, which can fetch either 32 or 64-bit operands. So you have to take precision into account when making the comparison, and then you might as well use bytes/FLOP as the metric. GK110 has 0.33 bytes/SPFLOP and 2 bytes/DPFLOP. So with just one load port and two vector units KNL would beat it by having 1 byte/SPFLOP and 2 bytes/DPFLOP. There's no need to go overboard and double that, especially with an L0 cache to handle recently loaded data. KNC did have twice the L1 bandwidth but only because it only had one vector unit and providing a narrower load bus would have complicated things more than it would have saved. So again, that's not a reference. Kepler has low SP bandwidth/FLOP but that's more of a consequence of high SP FLOPs than low cache bandwidth, and works out fine for graphics workloads.
1 byte/SPFLOP, which amounts to one load operand for every two instruction, is the right ratio for the vast majority of workloads running on an architecture with ~32 registers. It's just an inherent property of the code, determined by the data locality in the algorithms. If you need more than 1 byte/SPFLOP, you should probably change something about your algorithms (like merging loops to avoid writing temporary results and having to read them back in the next one, or to average out the memory and arithmetic heavy code).
> >So KNC isn't the gold standard here. Doubling the number of execution units does not mean the
> > number of memory load ports have to be doubled. Kepler has one load port per six FMA units. I'm not saying
> > that's the best ratio, but it's a strong indication that 2 bytes/SPFLOP is completely unnecessary.
>
> Funny, but Intel isn't trying to build Kepler. They are building an HPC-specific part. I'm asserting
> the fact that KNC is the closest shipping approximation to what Intel believes is a good fit for HPC.
If KNC is so perfect then why didn't Intel win back the entire HPC market? I think it's a pipe dream that you can build anything that's head and shoulders above the competition. NVIDIA would have had no problems removing the texture units from Kepler and increasing the bytes/FLOP, if that's the recipe for success in the HPC market. Instead just like their previous architectures they decided to use the same chips for high-end graphics and HPC. Note that AMD doesn't wipe the floor with them either, despite GCN's higher bandwidth/FLOP.
The reality is that performance/Watt rules everything, and you don't win anything by providing more bandwidth than what the vast majority of the code uses because it burns power all the time.
> This should be obvious since that's what they invested millions of dollars in building.
> Notice how Intel specifically chose not to emulate Nvidia GPUs? Maybe it's because
> they think they have a different solution that they believe is a better fit?
Intel's only goal with MIC is to keep x86 relevant in the HPC market. They can't defeat NVIDIA through architecture alone. They even need their process advantage and software compatibility advantage just to have anything competitive and keep the world from switching to ARM and CUDA, or HSA for that matter. So this isn't offence, it's defense.
> > > Haswell also sources 64B/cycle while performing 8 FMAs.
> >
> > Yes, and it has a third and fourth arithmetic execution port, both of which can't have a
> > memory source operand if the other two already utilize the two L1 read ports.
>
> >So that's
> > a 2:4 ratio, while I'm suggesting a 1:2 ratio for KNL, augmented with an L0 cache so you
> > can actually have a 1:1 ratio to lower register pressure while staying power efficient.
>
> Notice how I was talking about bytes per FLOP? I'm sure you noticed but those other
> arithmetic units don't perform FLOATING POINT OPS. So the ratio still works fine.
Haswell has four arithmetic execution ports instead of two, despite 'only' two load ports, because they expect to have a purpose for them! Those two extra ports can also require a load operand. If it was really necessary to have two load ports just for the two floating-point execution ports, you couldn't execute many other instructions, let alone two! But Intel has explicitly mentioned that a fourth execution unit was added to offload the vector ports. As a consequence they also added a third AGU to sustain two loads per cycle, but nothing exceeding that 1:2 ratio.
So I think bytes/FLOP is only meaningful when every execution port is the same. Kepler for instance can do a floating-point or integer operation on any of its execution units, but obviously it's going to be rare for all of them to be doing floating-point exclusively. Also note that while FMA capabilities slash the bytes/FLOP figure in half, the number of FMA instructions in actual code is typically not that high (I recall a study saying 30% on average for floating-point heavy algorithms). So all I'm saying is that you shouldn't take the bytes/FLOP metric too literal. If one of Kepler's six SP execution units would be capable of integer operations only, that would increase their theoretical bytes/FLOP, but not result in a particularly better architecture.
> Moreover, even if you want to count them, it's still approximately the same.
> Haswell can perform 32 SPFLOP + 2 integer ops per clock. It can load 64B/clock.
> 64B / 34 ops = 1.88B/op. Notice how that's REALLY REALLY CLOSE to 2B/FLOP?
You can't count every SP operation individually and add that to the scalar operations. If an 8-bit integer instruction requires a memory source operand, it occupies an entire load port which could otherwise deliver 256-bit. In other words it's not worth 1 byte out of your 64 byte/clock bandwidth, but 32 bytes. You have to think in terms of instructions/cycle, regardless of whether it's scalar or vector instructions.
> > > So if you look at these designs, it's very clear that Intel believes that 4B/FLOP is the right
> > > amount of L1D bandwidth. Frankly, you can analyze compiler output all you want...but Intel
> > > invested millions in making the right decision and that's infinitely more convincing.
> >
> > So what do you think NVIDIA and AMD invested in making the right decision?
>
> NV and AMD are targeting different products with radically different workloads; they have to design products
> that are first and foremost good at DirectX. KNC and KNL don't run DirectX and therefore should be quite different.
> Also, Intel's resources are different than NV or AMD, so the solution will be different looking.
KNC is the remnant of Intel's failed attempt to create a discrete GPU. To have any success in the consumer market it had to be first and foremost a GPU. An x86 compatible GPGPU/HPC device would have just been a bonus. So what they considered the right decision for such a product back then shouldn't be very different from the decisions NVIDIA has to make to offer its architectures both as a consumer GPU and HPC device. However instead of something closer to 0.33 bytes/SPFLOP, Larrabee and KNC could do 2 bytes/SPFLOP. So that should really question your faith in their ability to make the right decision back then. If anything, the failure of Larrabee and lower bandwidth of GPUs and their relative success in the HPC market should be in indication they should do things differently next time. And again, with a single vector unit per core they didn't really have any other option, so stop considering it a right decision. There was no actual decision, and they ended up with something that failed as a GPU and barely made it as an HPC device.
Note also that Larrabee didn't primarily fail because of a bad architecture for executing programmable code, but because of a larger number of decisions such as the lack of several fixed-function units that made it significantly slower at running contemporary games. And by significant I mean enough for consumers to lose interest, which is actually a very small margin. The 'war' between NVIDIA and ATI has frequently been 'won' by only a 20% difference in bang for buck. Intel would have lost billions selling it in the price class for GPUs it could actually compete with. Any additional value from being more programmable would not have paid off for many years until Intel gained a dominant position and developers started to take notice. As an HPC device, only differences in the execution cores matter (and cache hierarchy), and they matter much less because more programmability itself has a lot of value.
My point is that just because NVIDIA and ATI got things right for a consumer graphics device, doesn't mean they got it wrong for HPC. Intel has the advantage of x86 compatibility, which helped compensate for any wrong decisions. If KNL has two vector units per core, thus offering them the option to save on expensive L1 bandwidth, there is no guarantee they'll go for 2 bytes/SPFLOP again just because x86 is a load-op ISA.
The stake are higher now for getting it right, because GPUs became a lot more programmable since the Larrabee days and Intel is now also competing with itself (people need a reason to upgrade from KNC, but will also soon have the option to get Xeon's with AVX-512). So KNL has to offer the highest possible performance/Watt for parallel workloads. With L1 accesses costing 40% more power, there's a lot to be gained from keeping the number of actual accesses low.
> > Looking at typical x86 assembly output teaches us a lot
> > of things. Not every instruction takes a memory source
> > operand. The ratio is rarely over 1:2 when the accesses are
> > unique, and having 32 registers helps a lot to ensure
> > that for AVX-512. However Eric's experiments show that the
> > Intel compiler for KNL isn't shy to use as many memory
> > operands as possible. And knowing how much an L1 access cost
> > in power, that doesn't make sense unless they have
> > 'something' to make it cheap for very locally reused data. I call that something an L0 cache.
>
> You are assuming that 'typical x86 assembly output' today is the same as what will ship in 2015. That's a
> bad assumption, unless you think that all of Intel's compiler people are going on vacation for a year.
This isn't a metric that varies wildly. An average MEM:ALU ratio of around 1:2 is very common across the board, because that's how many operations are performed on every input that isn't short-lived enough to be stored in a register. The Pentium had a 1:2 ratio, then they went to 1:3 with P6, up until Sandy Bridge which could squeeze in another load op when no store is happening, until finally Haswell restored the ratio to 1:2. GPUs also have a 1:2 ratio at best. So what would make KNL so different it needs a 1:1 ratio, giving it the ability to load data from memory on every single instruction, even though AVX-512F has 32 registers?
To me that only makes sense if the load ports are used in a novel way, such as with an L0 cache to efficiently reference recently loaded data again, therefore acting somewhat like a register cache. That is, you don't have to manually load data from memory into a register and reuse that, but you can keep referencing the memory and the hardware will figure out when it can fetch it from a tiny buffer instead of L1. This is the only way you can dramatically change the typical 1:2 ratio, which assumes optimizing for register reuse, without paying a hefty price.
> > > It won't be. L1 caches are fine for KNL and the code you're
> > > seeing is probably the result of an immature compiler.
> >
> > The Intel compiler has been around for a long time, and AVX-512 is a very minor variation on KNC's
> > 512-bit ISA extension. Also as someone pointed out elsewhere you need a somewhat representative
> > compiler to generate traces that can be used during the design of the new architecture. And lastly
> > this behavior isn't coming from some complicated experimental optimization. This is part of basic
> > register allocation and for decades the overall goal has been to minimize the number of memory operands.
> > To suddenly see that basic rule reversed cannot possibly be something Intel has overlooked. It's
> > a deliberate choice and we have to find the explanation in the micro-architecture.
>
> Or maybe it's just a compiler that still has a year worth of tuning to be done?
First of all this isn't a tiny tweak. There's a massive difference in how memory operands are used between the code generated for AVX2 and the code generated for AVX-512F. They need a somewhat representative compiler for designing the hardware, so they can't let this pass and make the change later. Secondly, it's straightforward for compiler engineers to switch back and forth between these two approaches. Not a year's worth or work.
> > > Caches might cost power, but the real problem is large scale coherency. The L1D is fine.
> >
> > The paper above says it takes 40% more power. That's a huge deal. Even if large scale coherency
> > is a bigger problem, you can't ignore this. 40% is a lot of opportunity for doing something a little
> > different when doubling the number of execution units, like, an L0 cache perhaps? Even if the L0
> > cache itself costs 10% of power per access, that's a huge saving over a second L1 access.
>
> Uh, the paper says that reading 512b from the L1D takes 40% more power than reading a register. First, that's
> pretty similar power consumption, and definitely not worth the complexity of an additional structure. Especially
> considering that there's also a store buffer running around - it would be nice if they modeled that.
You've got to be kidding. 40% is a huge difference. It definitely leaves room for a small structure like an L0 cache to help reduce that. Many substantial architectural changes offer a benefit that's only in the single digit percentages.
> I just don't see enough of a gain, especially since the structure you described
> adds overhead to register file reads, which is a terrible plan.
It doesn't add overhead to register file reads at all. It helps reduce register pressure.
> > > > > I'm willing to wager money that there is no L0 as you have described it.
> > > >
> > > > Gamblers wage a lot of money even though they know they have less than a 50% chance of winning
> > > > but choose to ignore it. So your money doesn't mean anything to me even though you say this
> > > > must be the only option for KNL. I won't wager anything because to me it's a coin on it's
> > > > side. I think a dual-ported L1 is not the only option. That doesn't mean I think the other
> > > > option is more likely. And neither does that doubt mean I think it's any less likely.
> > >
> > > What I hear is that you aren't very confident of the L0 being the right
> > > solution, whereas I'm highly confident that it is the wrong solution.
> >
> > There's a difference between it being the wrong solution and two L1 read ports being the right
> > solution. The L0 cache is just one way Intel might avoid a 40% increase in power consumption,
> > but the most likely one I've been able to come up with so far based on the compiler output.
> > That you think that output is "probably" the result of an immature compiler, which it clearly
> > isn't, is a much more interesting expression of doubt. So what's your other explanation?
>
> I'm confident there is no L0 cache. I never said that the L1D
> is dual ported, I said that it supports 2x64B reads/clock.
Then it needs two read ports. Even if it's implemented with multiple banks with a single read port each, it adds complexity to receive two addresses, select banks, and deal with conflicts, and you're also consuming twice the SRAM lookup power on every dual access in the same cycle. With a tiny L0 cache you could avoid all that and even reduce register file pressure.
> You're just going to have to accept that given the current set of facts today, there is no way you
> will convince me you are correct. I think the L0 cache is a terrible idea because it is a marginal
> gain, very brittle, and adds complexity to an area that is rife with critical paths already.
40% is not a marginal gain, there's nothing brittle about it since it can easily be kept coherent with L1, and it can fit in the LSU without affecting any critical paths. All your reasons for thinking it's a terrible idea are plain wrong.
> When the facts change, I might change my conclusion. Until then, I'm quite confident
> of where I stand. And I'm still open to that wager about the existence of the L0...
I'm not trying to convince you that I'm correct about the L0 cache. I'm trying to convince you that a dual-ported L1 is not a given due to x86 being a load-op ISA. Other x86 architectures, as well as GPUs, have a lower MEM:ALU ratio and they're doing fine. But a single L1 load port wouldn't explain the code Eric has posted. So I proposed the L0 cache as something which would explain all the given "facts" at once. Whether that's the solution Intel ended up implementing, I don't know. It's just one of the viable possibilities besides assuming it "must have" two L1 read ports.
> Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 11, 2014 12:24 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on January 10, 2014 4:06 pm wrote:
> > > Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 10, 2014 2:22 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on January 9, 2014 6:42 pm wrote:
> > > > > > > I'll further add that I'm willing to wager that I'm correct about the 2 load
> > > > > > > pipelines, and that Nicolas is wrong about a virtually addressed L0 cache.
> > > > > >
> > > > > > What exactly do you mean by 2 load pipelines? If you mean that both vector units would be
> > > > > > able to use a memory operand in the same cycle, then yes, based on Eric's code that seems
> > > > > > to be a necessity. However due to the duplicate loading of the same memory this can be provided
> > > > > > by a dual-ported L0 cache, while the L1 cache itself only requires a single load port.
> > > > >
> > > > > I mean that the KNL core has two AGUs, both VPUs can execute
> > > > > load+op every cycle and that the L1D cache can fill
> > > > > two hits per cycle. I do not believe the cache is dual ported, as I believe it will be heavily banked.
> > > >
> > > > As far as I know that counts as a dual-ported cache.
> > >
> > > No it doesn't.
> >
> > It's what my professor called it: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.3602
> >
> > What did your professor call the thingies that determine the
> > maximum number of simultaneous accesses instead of ports?
>
> I didn't learn computer architecture from a professor.
>
> A cache and an SRAM are totally different entities, e.g., http://courses.engr.illinois.edu/ece512/spring2006/Papers/Itanium/isscc_2002_4s.pdf
> is a good explanation of the differences here (incidentally the distinction between SRAM and caches is something
> you seem to gloss over repeatedly throughout this thread).
With all due respect you don't have to lecture me on such basic matter. I did learn computer architecture from a professor, as well as digital design, embedded systems design, etc.
> The problem here is that CS people are incredibly imprecise and haphazard about terminology (in
> comparison to mathematics). SRAMs and RFs have ports. The definition there is fairly precise.
It helps to discern between hardware designers and software developers, instead of lumping them together as being "CS people", even though some are both. To a software developer, even a low-level compiler engineer, a cache is mostly a concept and has a broad meaning. You can even have a software cache. For low-level software developers it's important how many memory operands you can have per cycle, but it's of practically no importance how that's achieved in hardware. Bank conflicts should be rare enough for a software developer not to have to care about them. To a hardware designer a cache isn't just a concept it's a physical thing and if it's a banked design which can have a bank conflict that's of huge importance. Also, things evolve fast enough that the norm for a multi-ported cache implementation varies every few years. But it's still multi-ported by concept.
So I really don't think the problem is "CS people" being very imprecise about terminology. There's just different contexts, and it appears you're confusing a specific implementation for defining the meaning of certain terms. For what it's worth, Stanford professors consider multi-banking also as a method to implement a multi-ported cache: http://www.stanford.edu/class/ee282/handouts/lect.04.caches2.pdf
> I avoid using the same terminology on caches, or when I do, I try to make
> sure that I am precise (although I'm not successful all the time).
>
> What is a 'dual ported cache'?
>
> Is it a cache comprising a dual-ported tag and data array? That is the most common usage because
> most architects are thinking about circuits a fair bit. In fact, if you ask Intel's architects
> whether the caches are multi-ported, they will tell you 'no, they are multibanked'.
>
> Is it a cache that has two access 'ports'? Well maybe, but what are those 'ports' capable of? How do
> I know whether they are 2 R/W, R+W, 2R, or 2W? How do I know what the address constraints are? E.g.,
> is the underlying data array banked, while the tags are dual ported? What about the controller logic?
>
> There's a lot of confusing terminology here and it pays to be precise.
I'm afraid the confusion is all yours. Read ports are well defined and even if you lacked the terminology I think I've made myself perfectly clear from the start that the L1 cache may not require two reads per cycle to feed two vector units with (unique) data. The existence of an L0 cache with two read ports would potentially explain why the Intel compiler reads operands from memory again despite having the data in a register and despite L1 accesses consuming considerable power. And again, I'm not claiming KNL's to have such an L0 cache. I'm merely observing that in your article you're taking a leap by saying KNL must have two load pipelines (which you've clarified to mean two L1 load ports, not something like an L0 cache to service recently loaded data), which you've concluded solely from x86 being a load-op ISA. There are viable alternatives, of which I'm suggesting just one.
> > > > Anyway, a multi-ported multi-banked L1 cache is a reasonable possibility. I just don't see why it "must" be
> > > > the only possibility, especially with x86 being a load-op
> > > > architecture as the only explanation in >your article.
> > >
> > > > Given that 1 byte/FLOP would suffice and that the code generated
> > > > by ICC has a lot of duplicate >memory accesses,
> > > > it also seems reasonable to me that there's a single-ported L1 cache and a dual-ported L0 cache.
> > >
> > > First, 1B/FLOP isn't sufficient. Look at KNC, it can source 1x64B operand from memory
> > > and 2 from registers while performing 8 FMAs. That's 8B for 2 operations or 4B/FLOP.
> >
> > I was talking about single-precision FLOPS. And it's obvious that KNC had 2 bytes/SPFLOP since providing
> > any less would have required cutting that bus in half and sequencing things, which adds more complexity
> > overall.
>
> It turns to be the same thing, which is half an operand per FLOP.
Yes, but that's a consequence of the architecture. In contrast, Kepler has a SIMD array of 32 load units, which can fetch either 32 or 64-bit operands. So you have to take precision into account when making the comparison, and then you might as well use bytes/FLOP as the metric. GK110 has 0.33 bytes/SPFLOP and 2 bytes/DPFLOP. So with just one load port and two vector units KNL would beat it by having 1 byte/SPFLOP and 2 bytes/DPFLOP. There's no need to go overboard and double that, especially with an L0 cache to handle recently loaded data. KNC did have twice the L1 bandwidth but only because it only had one vector unit and providing a narrower load bus would have complicated things more than it would have saved. So again, that's not a reference. Kepler has low SP bandwidth/FLOP but that's more of a consequence of high SP FLOPs than low cache bandwidth, and works out fine for graphics workloads.
1 byte/SPFLOP, which amounts to one load operand for every two instruction, is the right ratio for the vast majority of workloads running on an architecture with ~32 registers. It's just an inherent property of the code, determined by the data locality in the algorithms. If you need more than 1 byte/SPFLOP, you should probably change something about your algorithms (like merging loops to avoid writing temporary results and having to read them back in the next one, or to average out the memory and arithmetic heavy code).
> >So KNC isn't the gold standard here. Doubling the number of execution units does not mean the
> > number of memory load ports have to be doubled. Kepler has one load port per six FMA units. I'm not saying
> > that's the best ratio, but it's a strong indication that 2 bytes/SPFLOP is completely unnecessary.
>
> Funny, but Intel isn't trying to build Kepler. They are building an HPC-specific part. I'm asserting
> the fact that KNC is the closest shipping approximation to what Intel believes is a good fit for HPC.
If KNC is so perfect then why didn't Intel win back the entire HPC market? I think it's a pipe dream that you can build anything that's head and shoulders above the competition. NVIDIA would have had no problems removing the texture units from Kepler and increasing the bytes/FLOP, if that's the recipe for success in the HPC market. Instead just like their previous architectures they decided to use the same chips for high-end graphics and HPC. Note that AMD doesn't wipe the floor with them either, despite GCN's higher bandwidth/FLOP.
The reality is that performance/Watt rules everything, and you don't win anything by providing more bandwidth than what the vast majority of the code uses because it burns power all the time.
> This should be obvious since that's what they invested millions of dollars in building.
> Notice how Intel specifically chose not to emulate Nvidia GPUs? Maybe it's because
> they think they have a different solution that they believe is a better fit?
Intel's only goal with MIC is to keep x86 relevant in the HPC market. They can't defeat NVIDIA through architecture alone. They even need their process advantage and software compatibility advantage just to have anything competitive and keep the world from switching to ARM and CUDA, or HSA for that matter. So this isn't offence, it's defense.
> > > Haswell also sources 64B/cycle while performing 8 FMAs.
> >
> > Yes, and it has a third and fourth arithmetic execution port, both of which can't have a
> > memory source operand if the other two already utilize the two L1 read ports.
>
> >So that's
> > a 2:4 ratio, while I'm suggesting a 1:2 ratio for KNL, augmented with an L0 cache so you
> > can actually have a 1:1 ratio to lower register pressure while staying power efficient.
>
> Notice how I was talking about bytes per FLOP? I'm sure you noticed but those other
> arithmetic units don't perform FLOATING POINT OPS. So the ratio still works fine.
Haswell has four arithmetic execution ports instead of two, despite 'only' two load ports, because they expect to have a purpose for them! Those two extra ports can also require a load operand. If it was really necessary to have two load ports just for the two floating-point execution ports, you couldn't execute many other instructions, let alone two! But Intel has explicitly mentioned that a fourth execution unit was added to offload the vector ports. As a consequence they also added a third AGU to sustain two loads per cycle, but nothing exceeding that 1:2 ratio.
So I think bytes/FLOP is only meaningful when every execution port is the same. Kepler for instance can do a floating-point or integer operation on any of its execution units, but obviously it's going to be rare for all of them to be doing floating-point exclusively. Also note that while FMA capabilities slash the bytes/FLOP figure in half, the number of FMA instructions in actual code is typically not that high (I recall a study saying 30% on average for floating-point heavy algorithms). So all I'm saying is that you shouldn't take the bytes/FLOP metric too literal. If one of Kepler's six SP execution units would be capable of integer operations only, that would increase their theoretical bytes/FLOP, but not result in a particularly better architecture.
> Moreover, even if you want to count them, it's still approximately the same.
> Haswell can perform 32 SPFLOP + 2 integer ops per clock. It can load 64B/clock.
> 64B / 34 ops = 1.88B/op. Notice how that's REALLY REALLY CLOSE to 2B/FLOP?
You can't count every SP operation individually and add that to the scalar operations. If an 8-bit integer instruction requires a memory source operand, it occupies an entire load port which could otherwise deliver 256-bit. In other words it's not worth 1 byte out of your 64 byte/clock bandwidth, but 32 bytes. You have to think in terms of instructions/cycle, regardless of whether it's scalar or vector instructions.
> > > So if you look at these designs, it's very clear that Intel believes that 4B/FLOP is the right
> > > amount of L1D bandwidth. Frankly, you can analyze compiler output all you want...but Intel
> > > invested millions in making the right decision and that's infinitely more convincing.
> >
> > So what do you think NVIDIA and AMD invested in making the right decision?
>
> NV and AMD are targeting different products with radically different workloads; they have to design products
> that are first and foremost good at DirectX. KNC and KNL don't run DirectX and therefore should be quite different.
> Also, Intel's resources are different than NV or AMD, so the solution will be different looking.
KNC is the remnant of Intel's failed attempt to create a discrete GPU. To have any success in the consumer market it had to be first and foremost a GPU. An x86 compatible GPGPU/HPC device would have just been a bonus. So what they considered the right decision for such a product back then shouldn't be very different from the decisions NVIDIA has to make to offer its architectures both as a consumer GPU and HPC device. However instead of something closer to 0.33 bytes/SPFLOP, Larrabee and KNC could do 2 bytes/SPFLOP. So that should really question your faith in their ability to make the right decision back then. If anything, the failure of Larrabee and lower bandwidth of GPUs and their relative success in the HPC market should be in indication they should do things differently next time. And again, with a single vector unit per core they didn't really have any other option, so stop considering it a right decision. There was no actual decision, and they ended up with something that failed as a GPU and barely made it as an HPC device.
Note also that Larrabee didn't primarily fail because of a bad architecture for executing programmable code, but because of a larger number of decisions such as the lack of several fixed-function units that made it significantly slower at running contemporary games. And by significant I mean enough for consumers to lose interest, which is actually a very small margin. The 'war' between NVIDIA and ATI has frequently been 'won' by only a 20% difference in bang for buck. Intel would have lost billions selling it in the price class for GPUs it could actually compete with. Any additional value from being more programmable would not have paid off for many years until Intel gained a dominant position and developers started to take notice. As an HPC device, only differences in the execution cores matter (and cache hierarchy), and they matter much less because more programmability itself has a lot of value.
My point is that just because NVIDIA and ATI got things right for a consumer graphics device, doesn't mean they got it wrong for HPC. Intel has the advantage of x86 compatibility, which helped compensate for any wrong decisions. If KNL has two vector units per core, thus offering them the option to save on expensive L1 bandwidth, there is no guarantee they'll go for 2 bytes/SPFLOP again just because x86 is a load-op ISA.
The stake are higher now for getting it right, because GPUs became a lot more programmable since the Larrabee days and Intel is now also competing with itself (people need a reason to upgrade from KNC, but will also soon have the option to get Xeon's with AVX-512). So KNL has to offer the highest possible performance/Watt for parallel workloads. With L1 accesses costing 40% more power, there's a lot to be gained from keeping the number of actual accesses low.
> > Looking at typical x86 assembly output teaches us a lot
> > of things. Not every instruction takes a memory source
> > operand. The ratio is rarely over 1:2 when the accesses are
> > unique, and having 32 registers helps a lot to ensure
> > that for AVX-512. However Eric's experiments show that the
> > Intel compiler for KNL isn't shy to use as many memory
> > operands as possible. And knowing how much an L1 access cost
> > in power, that doesn't make sense unless they have
> > 'something' to make it cheap for very locally reused data. I call that something an L0 cache.
>
> You are assuming that 'typical x86 assembly output' today is the same as what will ship in 2015. That's a
> bad assumption, unless you think that all of Intel's compiler people are going on vacation for a year.
This isn't a metric that varies wildly. An average MEM:ALU ratio of around 1:2 is very common across the board, because that's how many operations are performed on every input that isn't short-lived enough to be stored in a register. The Pentium had a 1:2 ratio, then they went to 1:3 with P6, up until Sandy Bridge which could squeeze in another load op when no store is happening, until finally Haswell restored the ratio to 1:2. GPUs also have a 1:2 ratio at best. So what would make KNL so different it needs a 1:1 ratio, giving it the ability to load data from memory on every single instruction, even though AVX-512F has 32 registers?
To me that only makes sense if the load ports are used in a novel way, such as with an L0 cache to efficiently reference recently loaded data again, therefore acting somewhat like a register cache. That is, you don't have to manually load data from memory into a register and reuse that, but you can keep referencing the memory and the hardware will figure out when it can fetch it from a tiny buffer instead of L1. This is the only way you can dramatically change the typical 1:2 ratio, which assumes optimizing for register reuse, without paying a hefty price.
> > > It won't be. L1 caches are fine for KNL and the code you're
> > > seeing is probably the result of an immature compiler.
> >
> > The Intel compiler has been around for a long time, and AVX-512 is a very minor variation on KNC's
> > 512-bit ISA extension. Also as someone pointed out elsewhere you need a somewhat representative
> > compiler to generate traces that can be used during the design of the new architecture. And lastly
> > this behavior isn't coming from some complicated experimental optimization. This is part of basic
> > register allocation and for decades the overall goal has been to minimize the number of memory operands.
> > To suddenly see that basic rule reversed cannot possibly be something Intel has overlooked. It's
> > a deliberate choice and we have to find the explanation in the micro-architecture.
>
> Or maybe it's just a compiler that still has a year worth of tuning to be done?
First of all this isn't a tiny tweak. There's a massive difference in how memory operands are used between the code generated for AVX2 and the code generated for AVX-512F. They need a somewhat representative compiler for designing the hardware, so they can't let this pass and make the change later. Secondly, it's straightforward for compiler engineers to switch back and forth between these two approaches. Not a year's worth or work.
> > > Caches might cost power, but the real problem is large scale coherency. The L1D is fine.
> >
> > The paper above says it takes 40% more power. That's a huge deal. Even if large scale coherency
> > is a bigger problem, you can't ignore this. 40% is a lot of opportunity for doing something a little
> > different when doubling the number of execution units, like, an L0 cache perhaps? Even if the L0
> > cache itself costs 10% of power per access, that's a huge saving over a second L1 access.
>
> Uh, the paper says that reading 512b from the L1D takes 40% more power than reading a register. First, that's
> pretty similar power consumption, and definitely not worth the complexity of an additional structure. Especially
> considering that there's also a store buffer running around - it would be nice if they modeled that.
You've got to be kidding. 40% is a huge difference. It definitely leaves room for a small structure like an L0 cache to help reduce that. Many substantial architectural changes offer a benefit that's only in the single digit percentages.
> I just don't see enough of a gain, especially since the structure you described
> adds overhead to register file reads, which is a terrible plan.
It doesn't add overhead to register file reads at all. It helps reduce register pressure.
> > > > > I'm willing to wager money that there is no L0 as you have described it.
> > > >
> > > > Gamblers wage a lot of money even though they know they have less than a 50% chance of winning
> > > > but choose to ignore it. So your money doesn't mean anything to me even though you say this
> > > > must be the only option for KNL. I won't wager anything because to me it's a coin on it's
> > > > side. I think a dual-ported L1 is not the only option. That doesn't mean I think the other
> > > > option is more likely. And neither does that doubt mean I think it's any less likely.
> > >
> > > What I hear is that you aren't very confident of the L0 being the right
> > > solution, whereas I'm highly confident that it is the wrong solution.
> >
> > There's a difference between it being the wrong solution and two L1 read ports being the right
> > solution. The L0 cache is just one way Intel might avoid a 40% increase in power consumption,
> > but the most likely one I've been able to come up with so far based on the compiler output.
> > That you think that output is "probably" the result of an immature compiler, which it clearly
> > isn't, is a much more interesting expression of doubt. So what's your other explanation?
>
> I'm confident there is no L0 cache. I never said that the L1D
> is dual ported, I said that it supports 2x64B reads/clock.
Then it needs two read ports. Even if it's implemented with multiple banks with a single read port each, it adds complexity to receive two addresses, select banks, and deal with conflicts, and you're also consuming twice the SRAM lookup power on every dual access in the same cycle. With a tiny L0 cache you could avoid all that and even reduce register file pressure.
> You're just going to have to accept that given the current set of facts today, there is no way you
> will convince me you are correct. I think the L0 cache is a terrible idea because it is a marginal
> gain, very brittle, and adds complexity to an area that is rife with critical paths already.
40% is not a marginal gain, there's nothing brittle about it since it can easily be kept coherent with L1, and it can fit in the LSU without affecting any critical paths. All your reasons for thinking it's a terrible idea are plain wrong.
> When the facts change, I might change my conclusion. Until then, I'm quite confident
> of where I stand. And I'm still open to that wager about the existence of the L0...
I'm not trying to convince you that I'm correct about the L0 cache. I'm trying to convince you that a dual-ported L1 is not a given due to x86 being a load-op ISA. Other x86 architectures, as well as GPUs, have a lower MEM:ALU ratio and they're doing fine. But a single L1 load port wouldn't explain the code Eric has posted. So I proposed the L0 cache as something which would explain all the given "facts" at once. Whether that's the solution Intel ended up implementing, I don't know. It's just one of the viable possibilities besides assuming it "must have" two L1 read ports.
Topic | Posted By | Date |
---|---|---|
Knights Landing details (new article) | David Kanter | 2014/01/03 12:58 AM |
eDRAM as cache | iz | 2014/01/03 04:39 AM |
eDRAM options | Eric Bron | 2014/01/09 03:45 AM |
Knights Landing details (new article) | Emil Briggs | 2014/01/03 06:06 AM |
Knights Landing details (new article) | Michael S | 2014/01/03 07:05 AM |
PCI-E and QPI | David Kanter | 2014/01/03 12:11 PM |
eDRAM still seems too expensive ... | Mark Roulo | 2014/01/03 10:48 AM |
Nevermind ... I see that you addressed this :-) | Mark Roulo | 2014/01/03 10:51 AM |
eDRAM still seems too expensive ... | Eric Bron | 2014/01/03 01:42 PM |
eDRAM or stacked DRAM? | Patrick Chase | 2014/01/03 11:21 AM |
eDRAM or stacked DRAM? | Wes Felter | 2014/01/03 03:00 PM |
eDRAM or stacked DRAM? | Patrick Chase | 2014/01/03 07:26 PM |
eDRAM or stacked DRAM? | tarlinian | 2014/06/23 09:59 PM |
eDRAM or stacked DRAM? | Maynard Handley | 2014/06/24 01:47 AM |
eDRAM or stacked DRAM? | Michael S | 2014/06/24 03:13 AM |
eDRAM or stacked DRAM? | David Kanter | 2014/06/24 12:09 PM |
eDRAM or stacked DRAM? | anon | 2014/06/24 07:50 PM |
eDRAM or stacked DRAM? | Eric Bron | 2014/06/24 10:02 PM |
eDRAM or stacked DRAM? | anon | 2014/06/24 10:39 PM |
eDRAM or stacked DRAM? | Michael S | 2014/06/25 01:46 AM |
eDRAM or stacked DRAM? | Michael S | 2014/06/25 01:29 AM |
eDRAM or stacked DRAM? | Eric Bron | 2014/06/24 05:37 AM |
eDRAM or stacked DRAM? | tarlinian | 2014/06/24 08:53 AM |
eDRAM or stacked DRAM? | Eric Bron | 2014/06/24 09:09 AM |
eDRAM or stacked DRAM? | tarlinian | 2014/06/24 09:40 AM |
eDRAM or stacked DRAM? | Eric Bron | 2014/06/24 10:10 AM |
eDRAM or stacked DRAM? | Eric Bron | 2014/06/24 10:12 AM |
eDRAM or stacked DRAM? | Wes Felter | 2014/06/24 10:09 PM |
eDRAM or stacked DRAM? | Michael S | 2014/06/25 02:02 AM |
Why not tag-inclusive L3? | Paul A. Clayton | 2014/01/03 04:28 PM |
Why not tag-inclusive L3? | Eric Bron | 2014/01/04 03:22 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/04 05:43 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/04 06:20 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/04 02:55 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/04 03:27 PM |
Knights Landing L/S bandwidth | hobold | 2014/01/04 04:23 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/04 05:20 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/05 03:42 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/05 03:49 AM |
Knights Landing L/S bandwidth | Patrick Chase | 2014/01/11 08:13 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/13 08:39 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/05 03:18 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 04:09 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 05:11 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 05:40 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 05:54 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 09:00 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/07 03:31 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/07 04:17 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/07 09:55 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 01:42 AM |
Knights Landing L/S bandwidth | Gabriele Svelto | 2014/01/08 08:30 AM |
Occam's razor | Nicolas Capens | 2014/01/08 02:33 PM |
Occam's razor | Gabriele Svelto | 2014/01/08 02:51 PM |
Occam's razor | Eric Bron | 2014/01/08 03:28 PM |
Occam's razor | bakaneko | 2014/01/09 04:45 AM |
Occam's razor | anon | 2014/01/09 05:02 AM |
Occam's razor | bakaneko | 2014/01/09 06:24 AM |
Occam's razor | bakaneko | 2014/01/09 06:51 AM |
Occam's razor | anon | 2014/01/09 07:18 AM |
Occam's razor | anon | 2014/01/09 07:16 AM |
Occam's razor | bakaneko | 2014/01/09 08:43 AM |
Occam's razor | anon | 2014/01/09 09:17 AM |
Occam's razor | bakaneko | 2014/01/09 11:12 AM |
Occam's razor | Eric Bron | 2014/01/09 11:18 AM |
Occam's razor | bakaneko | 2014/01/09 11:58 AM |
Occam's razor | anon | 2014/01/09 12:35 PM |
Occam's razor | bakaneko | 2014/01/12 10:48 AM |
99.9% not a new extension | Nicolas Capens | 2014/01/10 11:39 AM |
Compiler complexity | Gabriele Svelto | 2014/01/11 03:58 AM |
Compiler complexity | Nicolas Capens | 2014/01/11 01:20 PM |
Compiler complexity | Gabriele Svelto | 2014/01/11 03:17 PM |
Patent pending | Nicolas Capens | 2014/01/14 07:21 PM |
99.9% not a new extension | bakaneko | 2014/01/12 11:08 AM |
L0 data cache | Eric Bron | 2014/01/08 04:52 PM |
Occam's razor | David Kanter | 2014/01/08 04:53 PM |
Occam's razor | Nicolas Capens | 2014/01/09 03:07 AM |
Occam's razor | Ricardo B | 2014/01/09 05:21 AM |
Virtually indexed, untagged | Nicolas Capens | 2014/01/10 11:27 AM |
Virtually indexed, untagged | Gabriele Svelto | 2014/01/11 04:08 AM |
Virtually indexed, untagged | Nicolas Capens | 2014/01/11 09:45 PM |
Virtually indexed, untagged | David Kanter | 2014/01/12 02:13 AM |
Virtually indexed, untagged | anon | 2014/01/12 04:02 AM |
Virtually indexed, untagged | Nicolas Capens | 2014/01/16 09:55 AM |
Virtually indexed, untagged | Michael S | 2014/01/12 04:09 AM |
Virtually indexed, untagged | Nicolas Capens | 2014/01/16 10:47 AM |
Occam's razor | David Kanter | 2014/01/09 06:42 PM |
Occam's razor | Nicolas Capens | 2014/01/10 02:22 PM |
Occam's razor | David Kanter | 2014/01/10 04:06 PM |
MEM : ALU ratio | Nicolas Capens | 2014/01/11 12:24 AM |
MEM : ALU ratio | Gabriele Svelto | 2014/01/11 03:47 AM |
MEM : ALU ratio | Eric Bron | 2014/01/11 04:41 AM |
MEM : ALU ratio | Eric Bron | 2014/01/11 05:06 AM |
MEM : ALU ratio | David Kanter | 2014/01/11 08:28 PM |
MEM : ALU ratio | Eric Bron nli | 2014/01/12 02:54 AM |
MEM : ALU ratio | Gabriele Svelto | 2014/01/11 10:15 AM |
MEM : ALU ratio | Nicolas Capens | 2014/01/14 06:56 PM |
Etiquette in linking to papers | Paul A. Clayton | 2014/01/14 07:44 PM |
MEM : ALU ratio | anon | 2014/01/14 08:32 PM |
L0 power cost | Nicolas Capens | 2014/01/16 02:05 PM |
L0 power cost | anon | 2014/01/16 10:01 PM |
L0 power cost | Nicolas Capens | 2014/01/19 12:30 AM |
Links revealed | Paul A. Clayton | 2014/01/19 04:47 PM |
L0 power cost | anon | 2014/01/20 01:19 AM |
L0 power cost | Nicolas Capens | 2014/01/20 02:49 PM |
L0 power cost | anon | 2014/01/21 01:18 AM |
Q.E.D. | Nicolas Capens | 2014/01/21 08:44 PM |
Q.E.D. | anon | 2014/01/21 09:24 PM |
Straw man | Nicolas Capens | 2014/01/23 11:56 PM |
Straw man | anon | 2014/01/25 06:46 AM |
Still waiting for an explanation | Nicolas Capens | 2014/01/26 12:19 AM |
Still waiting for an explanation | Exophase | 2014/01/26 01:13 PM |
Still waiting for an explanation | bakaneko | 2014/01/26 11:52 PM |
Q.E.D. | Ricardo B | 2014/01/22 06:58 PM |
Q.E.D. | Michael S | 2014/01/23 04:59 AM |
L0 entry count | Nicolas Capens | 2014/01/24 01:11 AM |
L0 entry count | Eric Bron | 2014/01/24 02:08 AM |
L0 entry count | Michael S | 2014/01/24 06:18 AM |
L0 entry count | Eric Bron | 2014/01/24 07:15 AM |
L0 entry count | Michael S | 2014/01/24 08:10 AM |
L0 entry count | Eric Bron | 2014/01/24 08:20 AM |
L0 entry count | Nicolas Capens | 2014/01/24 02:33 PM |
L0 entry count | Eric Bron | 2014/01/24 03:20 PM |
L0 entry count and L1 read port orthogonality | Nicolas Capens | 2014/01/26 01:14 AM |
L0 entry count and L1 read port orthogonality | Eric Bron | 2014/01/26 03:49 AM |
L0 hit rate | Nicolas Capens | 2014/01/24 12:49 AM |
L0 hit rate | Ricardo B | 2014/01/24 06:42 AM |
L0 hit rate | Exophase | 2014/01/24 01:37 PM |
L0 hit rate | Eric Bron | 2014/01/24 02:12 PM |
L0 vs RF power | Nicolas Capens | 2014/01/24 02:43 PM |
MEM : ALU ratio | David Kanter | 2014/01/11 01:47 PM |
MEM : ALU ratio | Nicolas Capens | 2014/01/16 09:23 AM |
MEM : ALU ratio | Stubabe | 2014/01/17 12:58 PM |
MEM : ALU ratio | Stubabe | 2014/01/17 01:42 PM |
MEM : ALU ratio | Michael S | 2014/01/18 04:57 PM |
MEM : ALU ratio | bakaneko | 2014/01/19 12:47 AM |
MEM : ALU ratio | Nicolas Capens | 2014/01/20 03:48 PM |
It's called "tunnel vision" (NT) | iz | 2014/01/20 04:36 PM |
MEM : ALU ratio | Michael S | 2014/01/20 04:37 PM |
MEM : ALU ratio | Stubabe | 2014/01/21 04:54 PM |
MEM : ALU ratio | Nicolas Capens | 2014/01/21 10:07 PM |
MEM : ALU ratio | Michael S | 2014/01/22 08:17 AM |
MEM : ALU ratio | Nicolas Capens | 2014/01/24 03:33 PM |
MEM : ALU ratio | Stubabe | 2014/01/21 04:32 PM |
MEM : ALU ratio | Michael S | 2014/01/22 08:56 AM |
MEM : ALU ratio | Stubabe | 2014/01/23 09:06 AM |
MEM : ALU ratio | Eric Bron | 2014/01/23 09:45 AM |
edit | Eric Bron | 2014/01/23 09:49 AM |
MEM : ALU ratio | Michael S | 2014/01/23 09:58 AM |
MEM : ALU ratio | Eric Bron | 2014/01/23 10:29 AM |
MEM : ALU ratio | Michael S | 2014/01/23 10:33 AM |
MEM : ALU ratio | Stubabe | 2014/01/24 04:50 AM |
MEM : ALU ratio | bakaneko | 2014/01/23 10:36 AM |
MEM : ALU ratio | NoSpammer | 2014/01/11 03:39 PM |
L1 vs L0 access cost | Nicolas Capens | 2014/01/16 03:17 PM |
L1 vs L0 access cost | NoSpammer | 2014/01/19 01:48 PM |
L1 vs L0 access cost | dmcq | 2014/01/22 05:45 AM |
L1 vs L0 access cost | Gabriele Svelto | 2014/01/22 07:29 AM |
L1 vs L0 access cost | dmcq | 2014/01/22 01:33 PM |
L1 vs L0 access cost | Gabriele Svelto | 2014/01/22 04:33 PM |
L1 vs L0 access cost | dmcq | 2014/01/24 04:19 AM |
L1 vs L0 access cost | Nicolas Capens | 2014/01/24 02:16 AM |
Occam's razor | Patrick Chase | 2014/01/13 11:19 AM |
Occam's razor | Nicolas Capens | 2014/01/09 12:40 AM |
Occam's razor | Gabriele Svelto | 2014/01/09 02:41 AM |
Occam's razor | Eric Bron | 2014/01/09 02:54 AM |
Occam's razor | Gabriele Svelto | 2014/01/09 06:35 AM |
Occam's razor | Eric Bron | 2014/01/09 07:14 AM |
avoiding redundant loads | Eric Bron | 2014/01/09 07:18 AM |
AVX2 version | Eric Bron | 2014/01/09 07:32 AM |
Occam's razor | Amiba Gelos | 2014/01/09 03:01 AM |
Occam's razor | Eric Bron | 2014/01/09 03:06 AM |
Occam's razor | Amiba Gelos | 2014/01/09 03:43 AM |
Occam's razor | Eric Bron | 2014/01/09 04:02 AM |
L0 access latency | Nicolas Capens | 2014/01/09 04:27 AM |
L0 access latency | Amiba Gelos | 2014/01/09 05:16 AM |
compared to L0$ i would say banking is far more likely (NT) | Amiba Gelos | 2014/01/09 05:20 AM |
L0 access latency | Nicolas Capens | 2014/01/10 03:20 PM |
Occam's razor | Nicolas Capens | 2014/01/09 04:19 AM |
Occam's razor | NoSpammer | 2014/01/09 12:55 PM |
Occam's razor | Nicolas Capens | 2014/01/10 03:40 PM |
Occam's razor | Michael S | 2014/01/11 10:21 AM |
Occam's razor | Michael S | 2014/01/12 03:21 PM |
KNC compiler output | Nicolas Capens | 2014/01/16 06:39 PM |
KNC compiler output | Michael S | 2014/01/18 05:13 PM |
L0 cache coherency | David Kanter | 2014/01/11 08:39 PM |
Occam's razor | anon | 2014/01/09 05:12 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 10:46 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 11:23 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/08 02:02 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 02:29 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 02:54 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 03:00 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 03:13 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 03:28 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 03:32 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 03:40 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 03:51 PM |
Knights Landing L/S bandwidth | Michael S | 2014/01/09 12:18 PM |
Knights Landing L/S bandwidth | Patrick Chase | 2014/01/12 10:03 PM |
Also page/line splits? | David Kanter | 2014/01/12 10:50 PM |
Also page/line splits? | anon | 2014/01/13 01:44 AM |
Also page/line splits? | none | 2014/01/13 03:09 AM |
Also page/line splits? | anon | 2014/01/13 04:19 AM |
Knights Landing L/S bandwidth | Exophase | 2014/01/13 12:15 AM |
Knights Landing L/S bandwidth | anon | 2014/01/13 01:41 AM |
Knights Landing L/S bandwidth | Patrick Chase | 2014/01/13 11:14 AM |
Aliased writes | Nicolas Capens | 2014/01/14 09:46 PM |
Knights Landing L/S bandwidth | Ricardo B | 2014/01/07 04:27 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/07 10:28 PM |
Knights Landing L/S bandwidth | Ricardo B | 2014/01/08 02:13 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 11:10 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/08 03:31 PM |
Knights Landing L/S bandwidth | Ricardo B | 2014/01/08 03:58 PM |
Knights Landing L/S bandwidth | G. Gouvine | 2014/01/09 09:10 AM |
Knights Landing L/S bandwidth | Ricardo B | 2014/01/09 11:19 AM |
Efficient load queue vs. efficient L0 cache | Nicolas Capens | 2014/01/11 12:28 PM |
Efficient load queue vs. efficient L0 cache | G. Gouvine | 2014/01/13 02:11 AM |
Efficient load queue vs. efficient L0 cache | Michael S | 2014/01/13 03:43 AM |
Register file read port requirements | Nicolas Capens | 2014/01/11 12:55 AM |
Register file read port requirements | Ricardo B | 2014/01/11 05:24 AM |
Register file read port requirements | Eric Bron | 2014/01/11 05:32 AM |
Register file read port requirements | Michael S | 2014/01/11 09:57 AM |
Register file read port requirements | Eric Bron | 2014/01/11 11:16 AM |
Register file read port requirements | Michael S | 2014/01/11 11:46 AM |
Register file read port requirements | Eric Bron | 2014/01/11 12:12 PM |
Register file read port requirements | Michael S | 2014/01/11 12:36 PM |
Register file read port requirements | Eric Bron | 2014/01/11 12:51 PM |
Register file read port requirements | Patrick Chase | 2014/01/13 02:27 PM |
Register file read port requirements | Eric Bron | 2014/01/13 04:24 PM |
Register file read port requirements | Patrick Chase | 2014/01/13 06:02 PM |
Register file read port requirements | Eric Bron | 2014/01/14 04:50 AM |
Register file read port requirements | Michael S | 2014/01/14 11:36 AM |
Register file read port requirements | Eric Bron nli | 2014/01/14 01:04 PM |
Register file read port requirements | Patrick Chase | 2014/01/13 02:17 PM |
Register file read port requirements | Michael S | 2014/01/15 04:27 AM |
Register file read port requirements | Eric Bron | 2014/01/11 11:28 AM |
Register file read port requirements | Michael S | 2014/01/11 12:07 PM |
Register file read port requirements | Patrick Chase | 2014/01/13 02:40 PM |
Register file read port requirements | Patrick Chase | 2014/01/13 02:34 PM |
Register file read port requirements | Ricardo B | 2014/01/11 12:55 PM |
Register file read port requirements | Eric Bron | 2014/01/11 01:17 PM |
Register file read port requirements | Ricardo B | 2014/01/11 02:36 PM |
Register file read port requirements | Eric Bron | 2014/01/11 02:42 PM |
Register file read port requirements | Ricardo B | 2014/01/11 03:20 PM |
Register file read port requirements | Eric Bron | 2014/01/11 03:26 PM |
Register file read port requirements | Michael S | 2014/01/11 04:07 PM |
Register file read port requirements | Ricardo B | 2014/01/11 04:38 PM |
Register file read port requirements | Michael S | 2014/01/11 04:49 PM |
Register file read port requirements | Eric Bron | 2014/01/11 03:39 PM |
Register file read port requirements | Eric Bron | 2014/01/11 03:41 PM |
Register file read port requirements | Ricardo B | 2014/01/11 04:30 PM |
Register file read port requirements | Nicolas Capens | 2014/01/11 12:09 PM |
Knights Landing L/S bandwidth | anon | 2014/01/05 06:55 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/05 07:30 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 01:07 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 02:38 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 04:01 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 04:44 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 05:39 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 06:00 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 06:44 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 08:54 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 10:11 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 10:14 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 11:37 AM |
Knights Landing L/S bandwidth | Ricardo B | 2014/01/08 06:25 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 08:36 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/08 08:41 AM |
KNC code generator with EVEX back-end? | Michael S | 2014/01/08 09:43 AM |
KNC code generator with EVEX back-end? | Exophase | 2014/01/08 10:00 AM |
KNC code generator with EVEX back-end? | Ricardo B | 2014/01/08 11:39 AM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 12:15 PM |
KNC code generator with EVEX back-end? | Exophase | 2014/01/08 01:17 PM |
KNC code generator with EVEX back-end? | Ricardo B | 2014/01/08 02:06 PM |
KNC code generator with EVEX back-end? | Exophase | 2014/01/08 02:24 PM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 02:38 PM |
KNC code generator with EVEX back-end? | Michael S | 2014/01/08 01:54 PM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 10:25 AM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 10:35 AM |
KNC code generator with EVEX back-end? | Michael S | 2014/01/08 11:07 AM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 11:24 AM |
KNC code generator with EVEX back-end? | Michael S | 2014/01/08 11:43 AM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 01:23 PM |
KNC code generator with EVEX back-end? | Eric Bron | 2014/01/08 10:43 AM |
AVX2 code much different than AVX-512 | Eric Bron | 2014/01/08 08:52 AM |
evil question | hobold | 2014/01/08 10:22 AM |
evil question | Eric Bron | 2014/01/08 10:27 AM |
evil question | hobold | 2014/01/08 02:33 PM |
evil question | Michael S | 2014/01/08 02:37 PM |
stupid question (was: evil question) | hobold | 2014/01/09 05:41 AM |
stupid question (was: evil question) | Eric Bron | 2014/01/09 05:52 AM |
stupid question (was: evil question) | Michael S | 2014/01/09 08:00 AM |
stupid question (was: evil question) | Michael S | 2014/01/09 08:12 AM |
stupid question (was: evil question) | Eric Bron | 2014/01/09 10:47 AM |
stupid question (was: evil question) | Michael S | 2014/01/09 11:48 AM |
more decisive (hopefully) test case | Michael S | 2014/01/09 12:01 PM |
more decisive (hopefully) test case | Eric Bron | 2014/01/09 12:08 PM |
more decisive (hopefully) test case | Michael S | 2014/01/09 12:24 PM |
more decisive (hopefully) test case | Eric Bron | 2014/01/09 12:27 PM |
more decisive (hopefully) test case | Michael S | 2014/01/09 12:33 PM |
AVX2 | Eric Bron | 2014/01/09 12:14 PM |
AVX2 | Michael S | 2014/01/09 12:30 PM |
AVX2 | Eric Bron | 2014/01/09 12:40 PM |
another try | Michael S | 2014/01/09 03:02 PM |
another try | Eric Bron | 2014/01/09 03:33 PM |
another try | Michael S | 2014/01/09 04:20 PM |
another try - ignore misformated mess above | Michael S | 2014/01/09 04:24 PM |
another try - ignore misformated mess above | Gabriele Svelto | 2014/01/10 01:01 AM |
another try - ignore misformated mess above | Eric Bron | 2014/01/10 03:05 AM |
another try - ignore misformated mess above | Michael S | 2014/01/11 10:23 AM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 11:08 AM |
another try - ignore misformated mess above | Michael S | 2014/01/11 12:09 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 12:12 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 12:24 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 01:24 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 02:11 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 02:18 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 02:27 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 02:29 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 02:46 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 02:46 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 03:28 PM |
another try - ignore misformated mess above | Eric Bron | 2014/01/11 02:17 PM |
another try - ignore misformated mess above | Michael S | 2014/01/11 02:24 PM |
KNC version | Michael S | 2014/01/11 05:19 PM |
KNC version | Eric Bron nli | 2014/01/12 02:59 AM |
KNC version | Gabriele Svelto | 2014/01/12 09:06 AM |
evil question | Eric Bron | 2014/01/08 02:41 PM |
Knights Landing L/S bandwidth | Patrick Chase | 2014/01/05 11:20 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 02:45 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 04:12 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 04:17 AM |
Knights Landing L/S bandwidth | anon | 2014/01/06 05:20 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/04 05:34 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/04 05:44 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/05 12:25 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/05 01:50 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/05 03:34 PM |
Might even help with gather | Nicolas Capens | 2014/01/05 03:40 PM |
What is an L0 cache? | David Kanter | 2014/01/05 10:44 PM |
What is an L0 cache? | anon | 2014/01/06 05:57 AM |
What is an L0 cache? | Nicolas Capens | 2014/01/06 12:57 PM |
What is an L0 cache? | anon | 2014/01/06 02:18 PM |
Knights Landing L/S bandwidth | David Kanter | 2014/01/04 10:58 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/04 04:24 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/04 04:46 PM |
Knights Landing L/S bandwidth | Konrad Schwarz | 2014/01/08 12:48 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/08 02:45 AM |
Knights Landing L/S bandwidth | David Kanter | 2014/01/05 01:44 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/05 03:55 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/05 12:18 PM |
Knights Landing L/S bandwidth | Maynard Handley | 2014/01/05 11:33 PM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 04:02 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 04:23 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 04:35 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 05:20 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 05:32 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 05:36 AM |
Knights Landing L/S bandwidth | Michael S | 2014/01/06 06:00 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 06:07 AM |
Knights Landing L/S bandwidth | Eric Bron | 2014/01/06 06:14 AM |
edits | Eric Bron | 2014/01/06 06:22 AM |
optimized version | Eric Bron | 2014/01/06 06:35 AM |
yet more optimized version | Eric Bron | 2014/01/06 06:42 AM |
latest version for today | Eric Bron | 2014/01/06 06:51 AM |
Probably just L2 bandwith limited | Nicolas Capens | 2014/01/06 11:48 AM |
yet more optimized version | Maynard Handley | 2014/01/06 06:54 PM |
optimized version | Maynard Handley | 2014/01/06 06:52 PM |
optimized version | Michael S | 2014/01/07 10:42 AM |
optimized version | Nicolas Capens | 2014/01/07 12:36 PM |
optimized version | Michael S | 2014/01/07 03:41 PM |
optimized version | Nicolas Capens | 2014/01/07 10:52 PM |
optimized version | Michael S | 2014/01/08 02:10 AM |
optimized version | Eric Bron | 2014/01/07 02:34 PM |
optimized version | Michael S | 2014/01/07 03:18 PM |
optimized version | Eric Bron | 2014/01/07 03:30 PM |
optimized version | Eric Bron | 2014/01/07 03:33 PM |
optimized version | Michael S | 2014/01/07 03:57 PM |
optimized version | Maynard Handley | 2014/01/07 06:50 PM |
optimized version | Michael S | 2014/01/08 02:39 AM |
Knights Landing L/S bandwidth | Maynard Handley | 2014/01/06 06:47 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/06 09:18 AM |
Knights Landing L/S bandwidth | Maynard Handley | 2014/01/06 06:56 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/07 12:18 PM |
Knights Landing L/S bandwidth | NoSpammer | 2014/01/05 01:15 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/05 03:06 PM |
Knights Landing L/S bandwidth | NoSpammer | 2014/01/06 04:20 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/06 11:54 AM |
Knights Landing L/S bandwidth | NoSpammer | 2014/01/06 01:24 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/06 09:15 PM |
Knights Landing L/S bandwidth | NoSpammer | 2014/01/07 03:58 AM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/07 03:18 PM |
Knights Landing L/S bandwidth | NoSpammer | 2014/01/08 01:38 PM |
Knights Landing L/S bandwidth | Nicolas Capens | 2014/01/08 11:14 PM |
AVX512F question | Michael S | 2014/01/06 10:18 AM |
AVX512F question | Nicolas Capens | 2014/01/06 12:01 PM |
Knights Landing - time for obituary? | Michael S | 2018/07/31 03:00 PM |
Knights Landing - time for obituary? | Adrian | 2018/07/31 09:24 PM |
Knights Landing - time for obituary? | SoftwareEngineer | 2018/08/01 02:15 AM |
auto-vectorization is a dead end | Michael S | 2018/08/01 03:48 AM |
Auto-vectorization of random C is a dead end | Mark Roulo | 2018/08/01 11:07 AM |
Auto-vectorization of random C is a dead end | Passing Through | 2018/08/01 01:35 PM |
Auto-vectorization of random C is a dead end | David Kanter | 2018/08/01 10:44 PM |
Auto-vectorization of random C is a dead end | Passing Through | 2018/08/02 01:51 AM |
Auto-vectorization of random C is a dead end | SoftwareEngineer | 2018/08/02 01:19 AM |
Auto-vectorization of random C is a dead end | Mark Roulo | 2018/08/02 09:50 AM |
Auto-vectorization of random C is a dead end | Michael S | 2018/08/02 12:11 PM |
Auto-vectorization of random C is a dead end | j | 2018/08/02 11:37 PM |
Auto-vectorization of random C is a dead end | Michael S | 2018/08/03 03:50 AM |
Auto-vectorization of random C is a dead end | rwessel | 2018/08/03 11:06 PM |
Auto-vectorization of random C is a dead end | Ricardo B | 2018/08/03 04:20 AM |
Auto-vectorization of random C is a dead end | Michael S | 2018/08/03 05:37 AM |
Auto-vectorization of random C is a dead end | Ricardo B | 2018/08/03 11:22 AM |
Auto-vectorization of random C is a dead end | Travis | 2018/08/03 07:58 PM |
Potential way to autovectorization in the future. | Jouni Osmala | 2018/08/03 10:22 PM |
Potential way to autovectorization in the future. | Jukka Larja | 2018/08/04 04:03 AM |
Potential way to autovectorization in the future. | Passing Through | 2018/08/04 06:47 AM |
Potential way to autovectorization in the future. | Travis | 2018/08/04 01:50 PM |
Potential way to autovectorization in the future. | Michael S | 2018/08/04 02:33 PM |
Potential way to autovectorization in the future. | Travis | 2018/08/04 02:48 PM |
Potential way to autovectorization in the future. | Passing Through | 2018/08/04 02:58 PM |
Skylake server/client AVX PRF speculation | Jeff S. | 2018/08/04 05:42 PM |
Skylake server/client AVX PRF speculation | anonymou5 | 2018/08/04 06:21 PM |
Skylake server/client AVX PRF speculation | Jeff S. | 2018/08/04 06:38 PM |
Skylake server/client AVX PRF speculation | anonymou5 | 2018/08/04 07:45 PM |
Skylake server/client AVX PRF speculation | Jeff S. | 2018/08/04 08:08 PM |
Skylake server/client AVX PRF speculation | anonymou5 | 2018/08/04 08:18 PM |
Skylake server/client AVX PRF speculation | Nomad | 2018/08/05 11:10 PM |
Skylake server/client AVX PRF speculation | anonymou5 | 2018/08/06 12:14 PM |
Skylake server/client AVX PRF speculation | Travis | 2018/08/06 08:43 PM |
Skylake server/client AVX PRF speculation | Travis | 2018/08/06 08:39 PM |
Auto-vectorization of random C is a dead end | Brett | 2018/08/04 01:55 PM |
Auto-vectorization of random C is a dead end | Travis | 2018/08/04 02:38 PM |
Auto-vectorization of random C is a dead end | Passing Through | 2018/08/04 03:00 PM |
New record for shortest post by Ireland - AI crashed? (NT) | Travis | 2018/08/04 03:34 PM |
New record for shortest post by Ireland - AI crashed? | Passing Through | 2018/08/04 04:12 PM |
New record for shortest post by Ireland - AI crashed? | anonymou5 | 2018/08/04 06:00 PM |
New record for shortest post by Ireland - AI crashed? | Brett | 2018/08/04 06:40 PM |
New record for shortest post by Ireland - AI crashed? | anonymou5 | 2018/08/04 07:38 PM |
Auto-vectorization of random C is a dead end | noko | 2018/08/04 09:46 PM |
The story of ispc (a 12 entry blog series) | Simon Farnsworth | 2018/08/01 03:50 AM |
the 1st link is empty (NT) | Michael S | 2018/08/01 04:05 AM |
the 1st link is empty | Simon Farnsworth | 2018/08/01 06:42 AM |
Interesting read, thanks! (NT) | SoftwareEngineer | 2018/08/01 06:57 AM |
Amazing read | Laurent | 2018/08/01 09:00 AM |
Amazing read | Passing Through | 2018/08/01 01:13 PM |
Amazing read | Doug S | 2018/08/01 02:30 PM |
Amazing read | Passing Through | 2018/08/01 02:49 PM |
ISPC vs OpenCL? | j | 2018/08/02 11:41 PM |
ISPC vs OpenCL? | coppcie | 2018/08/03 03:55 AM |
ISPC vs OpenCL? | Passing Through | 2018/08/03 04:07 AM |
Go away | Forum Reader | 2018/08/03 08:11 AM |
ISPC vs OpenCL? | Gian-Carlo Pascutto | 2018/09/11 06:50 AM |
ISPC vs OpenCL? | SoftwareEngineer | 2018/08/03 04:18 AM |
Knights Landing - time for obituary? | Kevin G | 2018/08/01 07:14 AM |
Knights Landing - time for obituary? | SoftwareEngineer | 2018/08/01 07:29 AM |
Knights Landing - time for obituary? | Passing Through | 2018/08/01 07:38 AM |
Knights Landing - time for obituary? | Eric Bron | 2018/08/02 06:57 AM |
Knights Landing - time for obituary? | Passing Through | 2018/08/02 12:29 PM |
Knights Landing - time for obituary? | Eric Bron | 2018/08/02 01:49 PM |
Knights Landing - time for obituary? | Passing Through | 2018/08/02 02:17 PM |
chess algorithms vs, low level optimizations | Eric Bron | 2018/08/02 07:15 AM |
AlphaZero vs Stockfish | Michael S | 2018/08/02 07:55 AM |
AlphaZero vs Stockfish | Eric Bron | 2018/08/02 08:24 AM |
AlphaZero vs Stockfish | Michael S | 2018/08/02 09:01 AM |
AlphaZero vs Stockfish | Eric Bron | 2018/08/02 09:11 AM |
Leela 4th vs all others | Eric Bron nli | 2018/09/11 03:40 AM |
AlphaZero vs Stockfish | Gian-Carlo Pascutto | 2018/09/11 06:31 AM |
AlphaZero vs Stockfish | Eric Bron | 2018/09/11 09:26 AM |
AlphaZero vs Stockfish | Eric Bron | 2018/09/11 09:58 AM |
AlphaZero vs Stockfish | Per Hesselgren | 2018/12/31 10:04 AM |
Leela Chess Zero | Per Hesselgren | 2018/12/31 12:00 PM |
AlphaZero vs Stockfish (on Xeon) | Per Hesselgren | 2018/12/31 09:59 AM |
C/C++ and vector/parallel/distributed | RichardC | 2018/08/02 05:50 AM |
Knights Landing - time for obituary? | Passing Through | 2018/08/01 07:52 AM |
Knights Landing - time for obituary? | Kevin G | 2018/08/01 02:03 PM |
Knights Landing - time for obituary? | Passing Through | 2018/08/01 02:33 PM |
Knights Landing - time for obituary? | Kevin G | 2018/08/01 08:26 AM |
Knights Landing - time for obituary? | Kevin G | 2018/08/01 08:26 AM |
Knights Landing - time for obituary? | juanrga | 2018/08/01 02:26 PM |
Knights Landing - time for obituary? | hobel | 2018/08/02 05:46 AM |
Knights Landing - time for obituary? | juanrga | 2018/07/31 11:25 PM |
Right, time for obituary for whole LRB lineage | AM | 2018/08/02 11:46 AM |
Right, time for obituary for whole LRB lineage | Adrian | 2018/08/02 11:46 PM |
LRBNI, AVX512, etc... | Michael S | 2018/08/03 05:23 AM |
Right, time for obituary for whole LRB lineage | juanrga | 2018/08/03 04:11 AM |