By: Nicolas Capens (nicolas.capens.delete@this.gmail.com), February 3, 2011 10:26 pm
Room: Moderated Discussions
Hi Davhid,
>Nicholas,
>
>Due to the length, I need to trim a fair number of comments, especially those related
>to SW rasterization techniques (and respond in a separate post).
Fair enough.
>>>David Kanter (dkanter@realworldtech.com) on 1/27/11 wrote:
>>>---------------------------
>>>That shows nothing about compression, that merely tells about the change in performance
>>>due to larger textures. It's also largely about an older game that isn't designed for 2560x1600.
>>
>>The change in performance is the whole point.
>
>No it's not. You made a specific claim about texture compression. Comparing large
>vs. small textures says NOTHING about absolute importance of compression, but only of relative compression.
I'm sorry but I'm still having a hard time understanding what you're trying to say here. What exactly do you mean by large vs. small textures? Just to be clear again; at Ultra High settings, Doom 3 uses the exact same textures, only uncompressed. They're the same dimensions, just bigger in storage size. So how is that saying nothing about absolute importance of compression? It's exactly the situation for a software renderer: It uncompresses the texture entirely so it can be sampled like any other texture.
Doom 3 is the only game I currently know of which allows testing the absolute difference in performance when using uncompressed textures of the same dimension. But while it's old, it should be more texture bandwidth limited than today's games with low TEX:ALU ratio so I still think this result is very relevant to my claim: "Hardware support for compressed textures is not critical to the viability of software rendering."
>>Doom 3's uncompressed textures are equal in dimensions to >the compressed ones.
>
>I didn't say dimensions, I said size and bandwidth. Again, you claimed that texture
>compression does not save meaningful amounts of bandwidth. I see no proof of that claim.
No, I claimed it only helps by about 10%. And I was obviously talking about absolute performance since the previous sentence was the claim about viability.
And it only helps that little simply because (a) applications are not bottlenecked by bandwidth all the time, and (b) a signficant portion of the bandwidth is also use for other purposes.
>>And I've tested 3DMark06 with SwiftShader
>3DMark06 is not reflective of modern games. It's 5 years old now!
May I ask what your expectation is for modern games if a stress test like 3DMark06 doesn't show any sign of being bandwidth limited? Do you honestly expect a sudden difference in texturing bandwidth so significant it makes this result meaningless?
>> while forcing >the mipmap LOD down one
>>level (which is equivalent to reducing the texture bandwidth by a factor 4), and
>>the SM3.0 score went from 250 to 249. Yes, due to some >statistical variance the
>>score was actually lower. If texture bandwidth was of >great significance, you'd expect a much higher score.
>Yes, but by reducing the texture size you have screwed up the compute:bandwidth
>ratio. Compressing textures will always improve that ratio...simply using smaller textures will preserve that ratio.
>
>By using smaller textures you changed the workload substantially.
I didn't touch the compute workload at all. I merely made it sample from a smaller mipmap level, reducing the bandwidth per sample footprint by a factor of 4, which is about the same as hardware would benefit from compressed textures.
>>>You're claiming that for #2 the difference is 10% and I don't see any real evidence
>>>of that. Compression should be vastly more effective.
>>
>>Texture compression rates of 1:2 and 1:4 are very common, >but that doesn't translate
>>into big performance improvements. Most of the time >there's sufficient bandwidth
>>headroom to allow uncompressed textures without an impact >on performance.
>
>And what about power? If I can transfer 2X or 4X less bytes, I can use a smaller (or slower) memory interface.
You can't use a smaller (or slower) memory interface without affecting all other applications. So that's just not an option. The memory subsystem is what it is so you might as well make use of it. It has ample bandwidth and large caches, which compensate for the lack of dedicated texture decompression hardware.
Also, while the additional transfers indeed take some power, dedicated texture decompression hardware takes power too.
>>And even
>>in bandwidth limited situations, there's already a large >chunk of it used by color,
>>z and geometry. So the performance won't drop by much.
>
>I don't believe any numbers you've provided on this. The simulator results you
>posted were very old and the author of the simulator basically indicated that they weren't valid or useful.
>
>[snip]
>
>>>I happen to know the author of that study in question. The data is INCREDIBLY
>>>OLD. It's from a simulator that did not use any Z or color compression, so the results cannot be taken seriously.
>>
>>Yes, it's from a simulator called ATTILA. And for the >record it did use z compression to generate that graph.
>
>I am familiar with the tool and with the author. The data in question came from
>an old version of Attila, and as per my discussion with Victor Moya (the author)...the data is pretty much useless.
Ok, then please show me recent data you think is more relevant. Regardless of the age of the data I used I think it's very reasonable to conclude that applications aren't bandwidth limited all the time and a significant portion of the bandwidth is used by data other than texels. It's up to you to disprove this now.
>>>I think you underestimate the cost of adding pins to your >>memory controller.
>>
>>I'm not suggesting adding extra pins to make software >rendering viable. It's already viable bandwidth-wise.
>
>You're suggesting getting rid of texture compression. That will increase bandwidth usage.
Yes it will, but it's insignficant since often there's some bandwidth headroom left, and other data consumes bandwidth too so the absolute increase in bandwidth is far less than the 1:2 or 1:4 texture compression ratio. And finally the large caches are compensating for it as well.
>>>Actually for the mobile space it does have to come close to dedicated hardware. Battery life matters, a lot.
>>
>>For laptops we see the graphics solution range from IGPs >to lower clocked high-end
>>desktop GPUs. So while battery life is important, it doesn't mean the TDP has to
>>be the lowest possible. It just has to be acceptable. A cheaper IGP which consumes
>>more power is likely to sell better than a more expensive >efficient one.
>
>Power consumption != TDP.
>
>And I strongly suspect that even a cheap IGP is going to be more efficient (power-wise)
>than software rendering on the CPU.
Quite possibly. And indeed the battery life depends on the average power consumption and not TDP. It is likely to include a lot of idle time or simple things like scrolling though a web page. But note again that not so long ago this was all handled adequately by the CPU and DMA operations. Nowadays CPUs have a lot better performance per Watt, so I'm not worried about battery life during light work.
>>Also note
>>that today's GPUs have far more features than the average >consumer will really use,
>>meaning they are less energy efficient than they could >have been. But the TDP is
>>still acceptable for a good battery life.
>
>No offense, but you don't seem to understand the difference between TDP and dynamic
>power consumption. They are only loosely related. Both are important, but I suspect
>SW rendering falls flat on its face for the latter.
There can be some difference in power consumption, but it's not an order of magnitude or so. And my point was that even a GPU consumes more power than what is strictly necessary for the task, but that's acceptable when you get other things in return. TDP is the wrong word to use in this context since it only determines the worst case battery life and not the average battery life, but what I meant is that a somewhat higher power consumption doesn't mean it's not viable. I've owned laptops with a battery life of two hours and ten hours when idle. You may think I unquestionably prefer the latter, but the two hour one was much cheaper and more powerful.
>>>>Most likely the bandwidth will just steadily keep increasing, helping all (high
>>>>bandwidth) applications equally. DDR3 is standard now accross the entire market,
>>>>and it's evolving toward higher frequencies and lower voltage. Next up is DDR4,
>>>>and if necessary the number of memory lanes can be >increased.
>>>
>>>More memory lanes substantially increases cost, which is >>something that everyone wants to avoid.
>>
>>They'll only avoid it till it's the cheapest solution. >Just like dual-channel and
>>DDR3 became standard after some time, things are still >evolving toward higher bandwidth
>>technologies. Five years ago the bandwidth achieved by >today's budget CPUs was unthinkable.
>
>Actually it was quite predictable. We're still using the same basic memory architecture - dual channel DDRx.
No it wasn't quite predictable. Even in late 2007 it wasn't clear what the benefit of dual-channel was: http://www.tomshardware.com/reviews/PARALLEL-PROCESSING,1705-11.html
It's the de facto standard now, even for mobile and budget systems, but it still took some steady evolution to get to this point. Nowadays triple channel is pretty high-end and quad channel is coming too, but in due time it will be affordable for more market segments. 256-bit memory busses have been pretty common for graphics cards for many years now. So while extra pins are expensive they're not prohibitively expensive, when the need arrives.
Anyway, we're nowhere near that point yet. And by that time the caches will be even more massive so it's unlikely that graphics will require anything in addition to the steady evolution.
>>>>And again CPU technology is not at a standstill. With >T-RAM just around the corner
>>>>we're looking at 20+ MB of cache for mainstream CPUs in >the not too distant future.
>>>
>>>T-RAM is not just around the corner.
>>
>>This news item suggests otherwise: http://www.businesswire.com/portal/site/home/permalink/?ndmViewId=news_view&newsId=20090518005181
>
>Please stop making me repeat myself. How about we make a gentleman's wager?
>
>You're apparently confident that TRAM will be shipping in products soon. I'm confident it won't.
>
>So let's define what soon means, and then we can step back and see who's right and who is wrong.
>
>For me, soon means a year.
The news sites say T-RAM is being prepared for GlobalFoundries's 32 and 22 nm nodes. Since it's unlikely to be used by Bulldozer, it will probably slip to the 22 nm revision.
So my soon means two years.
>>But even if it does take longer, it doesn't really matter to the long-term viability
>>of software rendering. There will be a breakthrough at some point and it will advance
>>the convergence by making dedicated decompression hardware totally unnecessary (if
>>it even has any relevance left today).
>
>It totally matters. If you are expecting a magical 2-4X increase in cache density
>that is iso-process, then you might as well just give up. And yes, it seems to
>me that much of what you are claiming is predicated on a magical increase in cache density.
It's not a necessity. Before the end of the decated a cache size of 256 MB will be perfectly common even without any technology breakthrough. Of course a breakthrough would help to rapidly expand the viability of software rendering to other markets, but it doesn't predicate the viability for the initial market.
>>>>And while the expectations for 'adequate' graphics go up as well, it's only a slow
>>>>moving target. First we saw the appearance of IGPs as an adequate solution for a
>>>>large portion of the market, and now things are evolving >in favor of software rendering.
>>>
>>>I think if you look at the improvement in IGPs, that's a >>very FAST improving target.
>>
>>The hardware isn't the target.
>
>Yes it is. To be competitive, your solution needs to have comparable power and
>adequate performance to a hardware solution.
It depends on your definition of comparable. Looking at the demise of dedicated audio processing again, it's clear that some difference in power efficiency is acceptable. The cost reduction is well worth it.
For argument's sake, let's say the successor to Sandy Bridge is identical except for the fact that the IGP consumes only one tenth of the power, at a 5$ cost increase. Does that mean nobody will buy Sandy Bridge any more? Of course not.
So again, the power consumption of other solutions is not the target. If the power consumption is within the consumer's expectation and it costs less, it will sell.
>>>Swiftshader is swiftshader - other SW rendering systems >work differently. They
>may (or may not) see similar >benefits.
>>
>>Again, why does that make SwiftShader's results only "meaningful" for SwiftShader?
>
>That's easy. Because other SW renders may work *differently* from swiftshader.
>If they work *differently*, they will have *different* performance and *different*
>bottlenecks and *different* performance gains from the changes in Sandy Bridge.
>
>Do you see a theme in the above statement?
No offence, but I only see handwaving. Exactly how do you imagine another software rendreer to work differently, in a way that makes the 30% performance increase at 55% bandwidth not meaningful?
Note that the shaders have a fixed number of arithmetic operations and data accesses. There's not all that much you could do "differently" to execute them as fast as possible without wasting cycles or bandwidth.
What you're saying is equivalent to saying that the results for one video codec implementation are meaningless to other implementations of the same codec or similar codecs. Yes they'll differ, but unless they perform completely differently they'll have closely resembling characteristics and the benchmark results of one implementation on a new CPU will be meaningful to the other implementations as well.
Anyway, I'd love to proven wrong and learn about a totally different software rendering approach...
>Let me give you a hypothetical example here. Say GCC runs 20% faster on Sandy
>Bridge than Nehalem. Now what does that tell you about the performance gain for LLVM or MSVC on Sandy Bridge?
It tells me that LLVM and MSVC will see similar performance increases. Each of them manipulates small data elements (string characters), does lots of pointer chasing, contains lots of branches, etc. If you're compiling the same source code and applying the same optimizations, the performance gain on Sandy Bridge will be comparable. If not, it means one of them is not using an optimal implementation in the first place. After switching to a better implementation, the performance gain will be more in line with the other compilers.
Just look at the problem space. You have similar input and the same CPU architecture to begin with. You can't have two optimal implementations with vastly different characteristics.
In particular for the case of software rendering I do not know of any different but still efficient approach for which SwiftShader's results are meaningless.
>>Anyway, to meet you in the middle I downclocked my >i7-920's memory from 1600 MHz
>>to 960 MHz, and the 3DMark06 SM3.0 score went from 250 to >247. So once again, reducing
>>the bandwidth to 60% has no significant impact on >performance.
>
>Alright...now we are getting somewhere, thanks for taking the effort to look into
>this! So you definitely have shown that for 3DMark06, there is not a big bandwidth bottleneck.
>
>However, I care about modern games. Could you try and run something like Crysis or Civ 5 or 3D Mark Vantage?
Crysis at 1680x1050, 4xMSAA, High detail, 64-bit: 0.89 FPS versus 0.85 FPS at 60% bandwidth.
Now, before you say this is evidence that bandwidth is sligtly more critical with modern games, I've also tested without AA and got 1.23 FPS versus 1.22 FPS. So clearly it would be more useful to focus on color compression than texture compression.
Also, these framerates are obviously unbearable but I've also tested my laptop's Quadro FX 380M and only got 0.77 FPS. And even at low detail playing Crysis on a previous generation IGP doesn't seem like a great experience: http://www.youtube.com/watch?v=qwrNP0CgPa8
Note once again that SwiftShader isn't using AVX yet, let alone gather/scatter...
>>>Reducing the number of loads and stores isn't really relevant. It's the number
>>>of operations that matters. If you are gathering 16 different addresses, you are
>>>really doing 16 different load operations.
>>
>>Not with Larrabee's implementation. It only takes as many >uops as the number of
>>cache lines that are needed to collect all elements, per >load unit.
>
>Yes, that's exactly what I said. It's the access pattern that's a problem.
It's not. Texture sampling can use a blocking technique (also called tiling or address swizzling) to ensure that texels have a high locality in both directions. Lookup tables, ROP operations, vertex attribute fetch, interpolant setup, etc. it all has a high locality.
Since Sandy Bridge already has two load units, the worst case for gathering eight 32-bit elements would be a mere 4 cycles, which is already way better than the extract/insert emulation. The typical throughput will be much closer to 1 every clock cycle though.
>>>No, that's totally different. With unified shaders you can easily use more or
>>>less geometry...until you run out of physical shaders to execute on. You should
>>>look at Kayvon's work on micro-polygon rendering.
>>
>>Easily? I think you're seriously underestimating the >complexity of adapting your
>>software to the hardware. Checking whether you're vertex >or pixel processing limited
>>wasn't feasible in actual games ten years ago, and it >still isn't.
>
>Sure you can, there are profiling tools for that. Nvidia and ATI both have them.
That only gives you global results. Different shaders can still be heavily bottlenecked by different resources. For instance a post filter effect is likely limited by texel fetch while procedural shaders or complex SH lighting is likely compute limited. There's only so much the developer can do about that.
A better hardware architecture is to get rid of the dedicated texel addressing units and filtering. This frees up space for more shader units and more generic load/store units, lessening the bottlenecks for most workloads. Only a very well balanced shader could be slightly worse off due to taking some computing resources for filtering. But that's the exception. This is very similar to the unification of the vertex and pixel pipelines. The average workload is better off with a unified architecture.
Note that some GPUs already use the shader units for part of the texel address calculations. And they might perform FP32 filtering in the shader units as well. So mark my works: the days of dedicated texture sampling are numbered.
>>>>It's clear that telling software developers what (not) to >do doesn't result in
>>>>a succesful next generation hardware architecture. With >non-unified architectures,
>>>>there were numerous applications which were vertex >processing limited, and numerous
>>>>ones which were pixel processing limited. And even those >in the middle have a fluctuating workload.
>>>
>>>Yes, except a unified shader architecture doesn't really preclude that many options.
>>
>>That's what I'm saying.
>
>Um no. You said that programmable shaders are too limited, and you want programmable rasterization and texturing.
Yes, but making these things programmable is equivalent to unification, as I've shown above.
Texturing is close to becoming programmable/unified. Alpha blending is also a perfect candidate for being computed in the shader cores. Note that DX10 already made the alpha test stage programmable.
Rasterization is a bigger challenge so it will probably take longer to become programmable, but it's already actively being researched and the results are "encouraging": http://research.microsoft.com/en-us/um/people/cloop/eg2010.pdf
>I am not convinced that those two should be programmable, and I'm not convinced
>that FF hardware is really that restrictive.
With all due respect, that's because you lack vision of what would be possible with a fully programmable architecture. Don't get me wrong, I'm not claiming I know everything it would be used for. The only thing I'm sure of is that it would be pretty exciting. There's a lot of GPGPU research which remains unsuccesful for mainstream hardware, and there's lots of talk about ray-tracing, Reyes, micropolygons, and many other rendering techniques. A lot of new techniques even still need to see the light of day. But we first need the flexible hardware to enable them.
If you're still not convinced, please recall that John Carmack was practically ridiculed for wanting floating-point shaders. Most people were not convinced it would be necessary and that integer operations are too restrictive. It's pretty obvious nowadays that these people were wrong. And while back then Carmack only envisioned Shader Model 2.0, we're now at Shader Model 5.0 and developers continue to push the limits of the hardware's capabilities. OpenCL is only in its infancy.
>>>First, you do need to count the increases in resolution. Pixels have definitely
>>>increased over time.
>>
>>Yes the resolution has increased but everything else scaled accordingly. More pixels
>>doesn't mean higher benfit from texture compression. In fact TEX:ALU is going down,
>>meaning pixel shaders are more compute limited than >bandwidth limited.
>
>I'm skeptical. Available bandwidth always grows more slowly than compute...
Which is fine. As the computational complexity of software increases, it uses more temporary data. The cache hierarchy can keep up with the core's bandwidth needs to store temporary results. End results got to RAM memory. But since the resolution only increases slowly the current steady evolution in RAM bandwidth suffices.
Note that before pixel shaders, color operations were essentially performed by the ROP units, and you needed multiple passes to create various effects. Shaders store the intermediate results locally, in the registers. The bandwidth needs went down (but obviously it then allowed other things to consume the freed up bandiwdth). Nowadays, the GPU doesn't just have massive register files, it also has caches. They're still tiny in comparison to the register files and the computing power, but they'll need to grow to accommodate the working sets of increasingly more complex software. The new data is uncompressed, so there's no way around using caches.
For CPUs the cache size per FLOP is going down. CPUs and GPUs are converging on every front. Of course if it increases again due to a breakthrough in density that's obviously not a bad thing. So let me rephrase that as the die area per FLOP spent on cache is converging.
>>>>In ten more years caches could be around 256 MB, and >that's without taking revolutionary
>>>>new technologies like T-RAM into account. So it's really >hard to imagine that this
>>>>won't suffice to compensate for the texture bandwidth >needs of the low-end graphics market.
>>>
>>>Because you are imagining that the low-end market stays put. It won't.
>>
>>I didn't say it stays put. I said it's a slow moving target. Evidence of this is
>>the ever growing gap between high-end and low-end graphics hardware. IGPs were born
>>out of the demand for really cheap but adequate 3D graphics support. They cover the majority of the market:
>>http://unity3d.com/webplayer/hwstats/pages/web-2011Q1-gfxvendor.html
>>
>>This massive market must obviously have a further division >in price and performance
>>expectations. Some people want a more powerful CPU for the >same price by sacrificing
>>a bit of graphics performance, while others simply want a >cheaper system that isn't
>>aimed at serious gaming. As the CPU performance continues >to increase exponentially,
>>and things like gather/scatter can make a drastic >difference in graphics efficiency,
>>software rendering can satisfy more and more people's expectations, even if those
>>expectations themselves incease slowly.
>
>In essence, what you are saying is that some people would be fine with lower performance
>graphics. That's something I agree with.
>
>I just don't know what the relative performance of SW rendering is to dedicated
>hardware, and how that curve will change over time.
It started to converge ever since the end of the MHz race and they started to focus on performance per Watt. GPUs are essentially limited by performance per Watt as well. But they're pretty much out of architectural options to increase it. They have to rely on new semiconductor processes. CPUs benefit from that equally, but at the same time they started to keep the clock frequency moderate, increased the number of cores and increased the vector width. This has massively increased the performance per Watt, beyond what the advancement in process technology offers. And they still have FMA up their sleeves.
As for translating this into software rendering performance, it's unfortunately held back by the lack of gather/scatter. But I've shown you plenty of data now which shows that the IGP is going to go the way of the dodo.
>My sense is that hardware is probably getting relatively faster, given the attention Intel is paying.
You have to take into account how much of that is due to increasing the area and/or power consumption. It's easy to mistake a performance increase as an architectural advancement. For several years GPUs were able to increase their ALU density dramatically, but now they're hitting walls. Instead their best hope now is to increase the utilization of these ALUs for complex workloads, using unification and extending the memory hierarchy.
Back to Intel's IGPs, their increased performance is due to additional Execution Units and higher clock frequency. It means that from a cost and power consumption point of view they're not improving relative to software rendering.
>>>That's because it won't.
>>
>>It will. The only strengths the GPU has left are all >components based on the ability
>>to load/store lots of data in parallel. The CPU cores >already achieve higher GFLOPS
>>than the IGP
>
>That's true today, but I suspect it won't be true in the future.
Do you have any indication to back up this suspicion? CPUs still have FMA to come, and I see no indication that the CPU core count is stagnating. Ever more applications see some benefit from quad-core, and Bulldozer will default to eight cores. Haswell is claimed to default to eight (beefier) cores as well.
Note that the non-K models of Sandy Bridge have half the EUs. So the average IGP has a lot of catching up to do to exceed the CPU's GFLOPS. Against an 8-core Haswell CPU it would have to become 8 times more powerful to exceed it. Fat chance.
It's far more interesting to give the CPU cores gather/scatter support and ditch that sorry IGP.
>Also remember
>that you have to share those FLOP/s with other tasks.
If you replace the IGP with CPU cores you get about the same amount of FLOPs in return. Furthermore, with an IGP every game is GPU limited. As unification proves, sharing FLOPs is not a bad thing. Even if the other tasks need lots of FLOPs, it's typically only temporary. For instance to compute physics. At other times the CPU is sitting idle waiting for the IGP to finish.
By unifying the whole thing the bottlenecks are gone and you get closer interaction. Currently a lot of GPGPU applications are unsuccesful because of the round-trip latency. You need to send a massive amount of data-parallel workload to the GPU to compensate for that. That doesn't always work out so well. A lot of applications want to process small amounts of work instead and get the results fast to interact with it.
Software rendering allows to break free from the legacy graphics pipeline and perform only exactly the operations you want. For instance when doing a post filter effect you don't need perspective correction or mipmap LOD calculations. You can already do this with a compute shader, but then again a compute shader IS software rendering...
>>You
>>can either ditch the IGP to make things cheaper, or >replace it with additional CPU
>>cores so you get a really powerful processor for any >workload.
>
>I think to achieve IGP level of performance, using an IGP is the most efficient in terms of power and area.
Yes, but you have to look a the total picture. The IGP is only used intensively during gaming. At other times its transistors are largely a waste of die space (money). Having more CPU performance/dollar can be of greater importance to the consumer. So just like the most cost effective solution for audio processing is a software codec, we're close to the point where for some markets an IGP-less system offers the best balance.
>>Quoting t-ram.com: "T-RAM Semiconductor has successfully developed the Thyristor-RAM
>>technology from concept to production-readiness. Our Thyristor-RAM technology has
>>been successfully implemented on both Bulk and SOI CMOS. "
>>
>>Sounds like production ready to me.
>
>Not even close.
What makes you so sure? Yes, Z-RAM appears to be a failure but I'd expect them to think twice before making statements like jointly developing it for application in 32 and 22 nm nodes. Since the work on Bulldozer started many years ago and they probably didn't want to take any risks it's highly unlikely to use 32 nm T-RAM, but it could appear in a 22 nm refresh in about two years time.
So do you have any other information aside from the irrelevant comparison to Z-RAM to claim T-RAM is not even close to production?
>>>>2nd gen Z-RAM
>>>
>>>Doesn't work at all.
>>
>>Maybe not as cache memory, but it's hopeful as a DRAM replacement: http://www.z-ram.com/en/pdf/Z-RAM_LV_and_bulk_PR_Final_for_press.pdf
>>
>
>Just a moment ago, you were suggesting it as a cache replacement. Now you suddenly
>are back-tracking? And nobody really wants a proprietary DRAM replacement.
Yes, I wasn't aware that 2nd gen. Z-RAM is not considered as an SRAM replacement. Either way the point was that cache technology is not at a standstill and one less candidate for improving the density doesn't put a lid on it.
>>>It's possible, but they will need to become more competitive from an energy perspective with fixed function stuff.
>>
>>There's not a lot of fixed-function stuff left. The >majority of the GPU's die space
>>consist of programmable or generic components.
>>
>>And I've shown before that the CPUs FLOPS/Watt is in the >same league as GPUs:
>>- Core i7-282QM: 150 GFLOPS / 45 Watt (more with Turbo Boost)
>>- GeForce GT 420: 134.4 GFLOPS / 50 Watt
>
>The GT 420 is ancient. A better comparison would be the GT 440, which is 96 shaders,
>1.6GHz and 65W. That's ~300 GFLOP/s for 65W, or a roughly 2X advantage.
Ancient? Both the GT 420 and 440 use a GF108 chip.
And my calculator says it's only a 50% advantage. But that's without taking into account that Sandy Bridge's TDP includes the IGP and includes Turbo Mode. So that advantage likely goes up in smoke.
FMA will double the GFLOP/s score for a minimal increase in transistor count. Add to this powerhouse gather/scatter support and NVIDIA has a serious problem.
>>Obviously software rendering requires a bit more >arithmetic power to implement
>>the remaining fixed-function functionality, but >programmable shaders take the bulk.
>>
>>So there's no lack of energy efficiency. The CPU simply >can't utilize its computing power effectively
>
>GPUs are definitely more power efficient than CPUs.
At effective graphics performance, yes, but that's easy enough to fix with FMA, integer AVX, and gather/scatter.
Advances in power gating technology ensures that CPUs don't waste power on things which are idle. Also the Physical Register File of Sandy Bridge singificantly reduces the power consumption of the out-of-order execution logic.
So today's CPUs are pretty lean and mean and they're only going to get better. But they don't have to match the effective power efficiency of GPUs to make software rendering viable.
>>All applications that contain loops can benefit from >gather/scatter. That's all applications.
>
>If that's true, then what % performance increase could we expect to see in SPECint?
A 15% reduction in memory accesses and 28% of scalar instructions converted into vector instructions: http://personals.ac.upc.edu/mpajuelo/papers/ISCA02.pdf. This paper uses a dynamic technique, but with a bit of assistance from the programmer (like the use of the 'restrict' keyword) the same results can be achieved statically.
Without gather/scatter a large number of loops can't be vectorized: http://hpc.cs.tsinghua.edu.cn/research/cluster/SPEC2006Characterization/auto_para.html
>>With sather/scatter support every scalar operation would >have a parallel equivalent.
>>So any loop with independent iterations can be >parallelized and execute up to 8 times faster.
>
>That's assuming there is no control flow divergence.
Simple flow control is still worth vectorizing.
>>And I don't think the hardware cost is that high. All you >need is a bit of logic
>>to check which elements are located in the same cache >line, and four byte shift
>>units per 128-bit load units instead of one, to collect >the individual elements.
>>Note that logic for sequentially accessing the cache lines >is already largely in
>>place to support load operations which straddle a cache >line boundary.
>
>You are saying that because you don't design hardware. What you are suggesting
>is in fact, quite complicated and large.
I don't currently design hardware but I have a masters degree in computer engineering, with a minor in embedded systems. I've read 'Digital Integrated Circuits - A Design Perspective' by Rabaey et al. front-to-back so by all means please elaborate on just how complicated and large it would be.
Also please tell me how Larrabee can have 512-bit wide gather/scatter support for each of its tiny cores while a pair of 128-bit gather/scatter units would be quite complicated and large.
Unless of course by quite complicated and large you meant about as complicated and large as texel fetching logic. Sure, it's definitely not trivial to design and the area is not insignificant. But it seems well worth it given that it will allow the vectorization of code which previously wasn't vectorizable.
>>>Really? Have you heard of Vertica? They do an awful lot >>of lossless compression of data in memory.
>>
>>No, I hadn't heard about them before. Could you point me >to some document where
>>they detail how they added hardware support for compressed >memory transfers to reduce bandwidth?
>
>They don't need hardware to do lossless compression. They have a clever column
>oriented database. Check vertica.com. One of their big performance gains is from reducing memory (and disk) bandwidth.
You were previously talking about the importance of dedicated texture decompression hardware. Now you're telling me about Vertica and how they don't need hardware to do lossless compression...
I don't see the relevance of Vertica to this discussion, unless you're actually trying to say dedicated hardware isn't that important after all.
Indeed there are also lossy and lossless techniques to reduce memory bandwidth which can be implemented in software. That includes textures. It's currently not worth the cycles though, as software rendering isn't bandwidth limited.
>>>Many applications use adjacent values.
>>
>>Yes, and many applications also use non-adjacent values.
>>
>>If a loop contains just one load or store at an address >which isn't consecutive,
>>it can't be vectorized (unless you want to resort to >serially extracting/inserting
>>addresses and values). So even if the majority of values >are adjacent, it doesn't
>>take a lot of non-adjacent data to cripple the performance.
>
>You can still vectorize it, you just need to have a bunch of scalar loads/stores
>to deal with the non-adjacent addresses.
That's exactly what I said (note the "unless"). You don't generally want to do that though. Note that you need two instructions (extract and insert) to emulate a single scalar load. So you risk making things slower than the scalar code. See slide 44 here: http://sc.tamu.edu/help/softwareDocs/intel/tutorial/compiler_1.pdf
It would help to have an instruction which takes a vector element as address offset (e.g. "mov ymm0.3, dword ptr [rax+ymm1.3]"), but to really tackle Amdahl's Law we need gather/scatter support which in the ideal case takes a single cycle.
>>>>Why? It only accesses the cache lines it needs. If all >elements are from the same
>>>>cache line, it's as fast as accessing a single element.
>>>
>>>And exactly as fast as using AVX! i.e. no improvement >>and more complexity/power.
>>
>>No. The addresses are unknown at compile time. So the only >>option with AVX1 is
>>to sequentially extract each address from the address >vector, and insert the read
>>element into the result vector. This takes 18 instructions.
>
>>With gather support it would be just one instruction. >Assuming it gets split into
>>two 128-bit gather uops, the maximum throughput is 1 every >cycle and the minimal throughput is 1 every 4 cycles.
>
>>>>But even in the worst case
>>>>it can't generate more misses or consume more bandwidth.
>>>
>>>It sure can. Now instead of having 1-2 TLB accesses per cycle, you get 16. How
>>>many TLB copies do you want? How many misses in flight do you want to support?
>>
>>You're still not getting it. It only accesses one cache >line per cycle. It simply
>>has to check which elements are within the same cache >line, and perform a single
>>TLB access for all of these elements. Checking whether the >addresses land on the
>>same cache line doesn't require full translation of each >address.
>
>That's quite complicated hardware, and you can't afford to have that on the critical
>path for any of your normal loads. So now you need a fairly separate load/store pipeline for scatter/gather.
I don't think any singnificant additions are needed on the critical path itself. It just requires four byte shift units intead of one, but they operate in parallel as well. Computing which elements go where can be done up front, before entering the critial path for normal loads. It would be perfectly acceptable to have a higher latency for scatter/gather, if necessary.
It's definitely an engineering challenge, but so far I can't think of anything which would jeopardize the feasibility.
>>Nothing other than graphics runs better on the IGP. As >I've mentioned before, GPGPU
>>is only succesful using high-end hardware.
>
>Today...unclear what tomorrow holds.
He who predicts the present is always right.
Anyway, yes, GPUs are highly likely to become better at GPGPU applications. But it will require improving the efficiency for executing workloads which differ from graphics. Which means fewer graphics-specific fixed-funciton hardware, more programmability, more unification, superscalar scheduling, concurrent kernels, larger caches, etc. Everything is pointing in the direction of the GPU becoming more CPU-like. Which means at some point it makes no sense at all to keep things heterogeneous.
>>So the CPU is better than the IGP at absolutely everything >else. That makes it
>>really tempting to have a closer look at what it would >take to make it adequately efficient at graphics as well.
>>
>>The answer: gather/scatter.
>
>It would also need a 2X improvement in FLOP/w and /mm2, possibly more.
Adding FMA support and tossing out the IGP roughly triple the compute density. Performance per Watt is already excellent as proven by the comparison against the GeForce 420 and 440.
>>Multi-core, 256-bit vectors, Hyper-Threading, software >pipelining... the CPU is
>>already a throughput device! It's just being held back by >the lack of parallel load/store
>>support. It's the one missing part to let all those GFLOPS >come to full fruition.
>
>You keep on repeating this as if it were true, but it's not. I agree that lack
>of scatter/gather is an issue. But a more fundamental issue is that throughput
>optimized cores (e.g. shader arrays) are simply more efficient for compute rich
>workloads. You can't really get around that.
You keep on repeating that throughput optimized cores are "simply" more efficient. I've given you dozens of detailed arguments why despite that, software rendering is the future, while you're just handwaving based on the prejudice that CPUs are weak and power hungry.
Yes GPUs are throughput optimized so evidently they are more efficient for compute rich workloads, but they fall flat on their face when running out of registers or if the working set doesn't fit in the cache or if the code is too divergent or if the work batches are too small, etc. CPUs cope much more gracefully with increasing software complexity.
So it's getting less relevant just how efficient GPUs are at compute rich workloads. Nobody cares if they can run Max Payne at 1000 FPS. What matters in the long run is the newer workloads, which are less data parallel and less coherent.
Now, GPUs obviously still dictate the pace at which application developers diverge from compute rich workloads. So we're not going to see for example ray-traced games tomorrow. But GPUs do still suck at these kind of workloads and no amount of additional shader cores is going to help it. They'll need to evolve in the direction of CPU architectures to enable new workloads.
It also means that CPUs don't have to become as compute optimized as today's GPUs for software rendering to take over. Although they'll still drastically improve at it with FMA and gather/scatter, they have plenty of other valuable features to become the dominant architecture for any workload.
>>What specialized hardware would that be? I've already shown that texture compression
>>hardly makes a difference,
>
>No, you cited extremely old data from a simulator, where even the author of the
>simulator thinks the data is not useful.
No, I only gave that simulator data to clarify the results from actual experiments. It doesn't have to be very accurate to be useful. Regardless of what the exact bandwidth usage looks like today, there will be plenty of headroom and it's not just consumed by texturing.
If you want to debunk that, I suggest you show me recent data for which this isn't true, or tell me exactly why the author thinks the old data isn't useful and how it affects the validity of your dedicated texture decompression hardware importance claim.
>>and sampling and filtering is becoming programmable anyway.
>>Gather/scatter speeds up just about every other pipeline >stage as well.
>
>Except it doesn't benefit many workloads, and it costs a lot of area and power.
>So you want to disable it on the many workloads where it does not help.
>>>I totally agree that scatter/gather is a great capability to have. But what's
>>>the cost in die area, power and complexity? Not just to the core, but also the memory controller, etc.
>>
>>Larrabee has wider vectors and smaller cores, but features gather/scatter support.
>>So I don't think it takes a lot of die space either way. It doesn't require any
>>changes to the memory controller, just the load/store units. I'm not entirely sure
>>but collecting four elements from a cache line can probably largely make use of
>>the existing network to extract one (unaligned) value. And checking which addresses
>>land on the same cache line is a very simple equality test >of the upper bits.
>
>I think you have no or minimal experience designing hardware, so I'm not really
>inclined to take your word for it...especially compared against the expertise of
>the thousands of CPU designers at places like Intel, AMD and IBM.
Let me get this straight... You assume I have no knowledge of hardware design, without pointing out any flaw in my reasoning, and you're more inclined to turn towards experienced CPU designers such as those working at Intel, who added gather/scatter to Larrabee, as an indication why gather/scatter for the CPU isn't feasible?
>Scatter/gather is expensive and that's why it isn't done.
All things weren't done the day before they were done. You can't conclude from that that gather/scatter is (too) expensive.
Lots of things don't happen simply out of poor judgement. For instance some of the SSE instructions are just late fixups of old incomplete extensions. It doesn't mean they were expensive to add the first time around.
Gather/scatter is a significant deviation from the well-known scalar load/store unit. It's very alien for CPU designers and it requires considerable R&D even if the end result isn't necessarily expensive. Also, CPU designers are often clueless about the software applications. They just benchmark current software and try to come up with the next idea on how to execute it faster. But without gather/scatter support, lots of loops are not vectorized or the software developers compute things very differently. For instance computing an exponential function with scalar code is best done using some lookup tables, but with vector code you currently need to resort to long polynomials. This is then wrongfully interpreted as a need for more arithmetic performance. Another example is converting AoS data into SoA data for SIMD processing. This currently requires lots of shuffle operations between registers so CPU designers are inclined to add more and faster shuffle units. But with scatter/gather there wouldn't be any need to shuffle data accross registers.
Cheers,
Nicolas
>Nicholas,
>
>Due to the length, I need to trim a fair number of comments, especially those related
>to SW rasterization techniques (and respond in a separate post).
Fair enough.
>>>David Kanter (dkanter@realworldtech.com) on 1/27/11 wrote:
>>>---------------------------
>>>That shows nothing about compression, that merely tells about the change in performance
>>>due to larger textures. It's also largely about an older game that isn't designed for 2560x1600.
>>
>>The change in performance is the whole point.
>
>No it's not. You made a specific claim about texture compression. Comparing large
>vs. small textures says NOTHING about absolute importance of compression, but only of relative compression.
I'm sorry but I'm still having a hard time understanding what you're trying to say here. What exactly do you mean by large vs. small textures? Just to be clear again; at Ultra High settings, Doom 3 uses the exact same textures, only uncompressed. They're the same dimensions, just bigger in storage size. So how is that saying nothing about absolute importance of compression? It's exactly the situation for a software renderer: It uncompresses the texture entirely so it can be sampled like any other texture.
Doom 3 is the only game I currently know of which allows testing the absolute difference in performance when using uncompressed textures of the same dimension. But while it's old, it should be more texture bandwidth limited than today's games with low TEX:ALU ratio so I still think this result is very relevant to my claim: "Hardware support for compressed textures is not critical to the viability of software rendering."
>>Doom 3's uncompressed textures are equal in dimensions to >the compressed ones.
>
>I didn't say dimensions, I said size and bandwidth. Again, you claimed that texture
>compression does not save meaningful amounts of bandwidth. I see no proof of that claim.
No, I claimed it only helps by about 10%. And I was obviously talking about absolute performance since the previous sentence was the claim about viability.
And it only helps that little simply because (a) applications are not bottlenecked by bandwidth all the time, and (b) a signficant portion of the bandwidth is also use for other purposes.
>>And I've tested 3DMark06 with SwiftShader
>3DMark06 is not reflective of modern games. It's 5 years old now!
May I ask what your expectation is for modern games if a stress test like 3DMark06 doesn't show any sign of being bandwidth limited? Do you honestly expect a sudden difference in texturing bandwidth so significant it makes this result meaningless?
>> while forcing >the mipmap LOD down one
>>level (which is equivalent to reducing the texture bandwidth by a factor 4), and
>>the SM3.0 score went from 250 to 249. Yes, due to some >statistical variance the
>>score was actually lower. If texture bandwidth was of >great significance, you'd expect a much higher score.
>Yes, but by reducing the texture size you have screwed up the compute:bandwidth
>ratio. Compressing textures will always improve that ratio...simply using smaller textures will preserve that ratio.
>
>By using smaller textures you changed the workload substantially.
I didn't touch the compute workload at all. I merely made it sample from a smaller mipmap level, reducing the bandwidth per sample footprint by a factor of 4, which is about the same as hardware would benefit from compressed textures.
>>>You're claiming that for #2 the difference is 10% and I don't see any real evidence
>>>of that. Compression should be vastly more effective.
>>
>>Texture compression rates of 1:2 and 1:4 are very common, >but that doesn't translate
>>into big performance improvements. Most of the time >there's sufficient bandwidth
>>headroom to allow uncompressed textures without an impact >on performance.
>
>And what about power? If I can transfer 2X or 4X less bytes, I can use a smaller (or slower) memory interface.
You can't use a smaller (or slower) memory interface without affecting all other applications. So that's just not an option. The memory subsystem is what it is so you might as well make use of it. It has ample bandwidth and large caches, which compensate for the lack of dedicated texture decompression hardware.
Also, while the additional transfers indeed take some power, dedicated texture decompression hardware takes power too.
>>And even
>>in bandwidth limited situations, there's already a large >chunk of it used by color,
>>z and geometry. So the performance won't drop by much.
>
>I don't believe any numbers you've provided on this. The simulator results you
>posted were very old and the author of the simulator basically indicated that they weren't valid or useful.
>
>[snip]
>
>>>I happen to know the author of that study in question. The data is INCREDIBLY
>>>OLD. It's from a simulator that did not use any Z or color compression, so the results cannot be taken seriously.
>>
>>Yes, it's from a simulator called ATTILA. And for the >record it did use z compression to generate that graph.
>
>I am familiar with the tool and with the author. The data in question came from
>an old version of Attila, and as per my discussion with Victor Moya (the author)...the data is pretty much useless.
Ok, then please show me recent data you think is more relevant. Regardless of the age of the data I used I think it's very reasonable to conclude that applications aren't bandwidth limited all the time and a significant portion of the bandwidth is used by data other than texels. It's up to you to disprove this now.
>>>I think you underestimate the cost of adding pins to your >>memory controller.
>>
>>I'm not suggesting adding extra pins to make software >rendering viable. It's already viable bandwidth-wise.
>
>You're suggesting getting rid of texture compression. That will increase bandwidth usage.
Yes it will, but it's insignficant since often there's some bandwidth headroom left, and other data consumes bandwidth too so the absolute increase in bandwidth is far less than the 1:2 or 1:4 texture compression ratio. And finally the large caches are compensating for it as well.
>>>Actually for the mobile space it does have to come close to dedicated hardware. Battery life matters, a lot.
>>
>>For laptops we see the graphics solution range from IGPs >to lower clocked high-end
>>desktop GPUs. So while battery life is important, it doesn't mean the TDP has to
>>be the lowest possible. It just has to be acceptable. A cheaper IGP which consumes
>>more power is likely to sell better than a more expensive >efficient one.
>
>Power consumption != TDP.
>
>And I strongly suspect that even a cheap IGP is going to be more efficient (power-wise)
>than software rendering on the CPU.
Quite possibly. And indeed the battery life depends on the average power consumption and not TDP. It is likely to include a lot of idle time or simple things like scrolling though a web page. But note again that not so long ago this was all handled adequately by the CPU and DMA operations. Nowadays CPUs have a lot better performance per Watt, so I'm not worried about battery life during light work.
>>Also note
>>that today's GPUs have far more features than the average >consumer will really use,
>>meaning they are less energy efficient than they could >have been. But the TDP is
>>still acceptable for a good battery life.
>
>No offense, but you don't seem to understand the difference between TDP and dynamic
>power consumption. They are only loosely related. Both are important, but I suspect
>SW rendering falls flat on its face for the latter.
There can be some difference in power consumption, but it's not an order of magnitude or so. And my point was that even a GPU consumes more power than what is strictly necessary for the task, but that's acceptable when you get other things in return. TDP is the wrong word to use in this context since it only determines the worst case battery life and not the average battery life, but what I meant is that a somewhat higher power consumption doesn't mean it's not viable. I've owned laptops with a battery life of two hours and ten hours when idle. You may think I unquestionably prefer the latter, but the two hour one was much cheaper and more powerful.
>>>>Most likely the bandwidth will just steadily keep increasing, helping all (high
>>>>bandwidth) applications equally. DDR3 is standard now accross the entire market,
>>>>and it's evolving toward higher frequencies and lower voltage. Next up is DDR4,
>>>>and if necessary the number of memory lanes can be >increased.
>>>
>>>More memory lanes substantially increases cost, which is >>something that everyone wants to avoid.
>>
>>They'll only avoid it till it's the cheapest solution. >Just like dual-channel and
>>DDR3 became standard after some time, things are still >evolving toward higher bandwidth
>>technologies. Five years ago the bandwidth achieved by >today's budget CPUs was unthinkable.
>
>Actually it was quite predictable. We're still using the same basic memory architecture - dual channel DDRx.
No it wasn't quite predictable. Even in late 2007 it wasn't clear what the benefit of dual-channel was: http://www.tomshardware.com/reviews/PARALLEL-PROCESSING,1705-11.html
It's the de facto standard now, even for mobile and budget systems, but it still took some steady evolution to get to this point. Nowadays triple channel is pretty high-end and quad channel is coming too, but in due time it will be affordable for more market segments. 256-bit memory busses have been pretty common for graphics cards for many years now. So while extra pins are expensive they're not prohibitively expensive, when the need arrives.
Anyway, we're nowhere near that point yet. And by that time the caches will be even more massive so it's unlikely that graphics will require anything in addition to the steady evolution.
>>>>And again CPU technology is not at a standstill. With >T-RAM just around the corner
>>>>we're looking at 20+ MB of cache for mainstream CPUs in >the not too distant future.
>>>
>>>T-RAM is not just around the corner.
>>
>>This news item suggests otherwise: http://www.businesswire.com/portal/site/home/permalink/?ndmViewId=news_view&newsId=20090518005181
>
>Please stop making me repeat myself. How about we make a gentleman's wager?
>
>You're apparently confident that TRAM will be shipping in products soon. I'm confident it won't.
>
>So let's define what soon means, and then we can step back and see who's right and who is wrong.
>
>For me, soon means a year.
The news sites say T-RAM is being prepared for GlobalFoundries's 32 and 22 nm nodes. Since it's unlikely to be used by Bulldozer, it will probably slip to the 22 nm revision.
So my soon means two years.
>>But even if it does take longer, it doesn't really matter to the long-term viability
>>of software rendering. There will be a breakthrough at some point and it will advance
>>the convergence by making dedicated decompression hardware totally unnecessary (if
>>it even has any relevance left today).
>
>It totally matters. If you are expecting a magical 2-4X increase in cache density
>that is iso-process, then you might as well just give up. And yes, it seems to
>me that much of what you are claiming is predicated on a magical increase in cache density.
It's not a necessity. Before the end of the decated a cache size of 256 MB will be perfectly common even without any technology breakthrough. Of course a breakthrough would help to rapidly expand the viability of software rendering to other markets, but it doesn't predicate the viability for the initial market.
>>>>And while the expectations for 'adequate' graphics go up as well, it's only a slow
>>>>moving target. First we saw the appearance of IGPs as an adequate solution for a
>>>>large portion of the market, and now things are evolving >in favor of software rendering.
>>>
>>>I think if you look at the improvement in IGPs, that's a >>very FAST improving target.
>>
>>The hardware isn't the target.
>
>Yes it is. To be competitive, your solution needs to have comparable power and
>adequate performance to a hardware solution.
It depends on your definition of comparable. Looking at the demise of dedicated audio processing again, it's clear that some difference in power efficiency is acceptable. The cost reduction is well worth it.
For argument's sake, let's say the successor to Sandy Bridge is identical except for the fact that the IGP consumes only one tenth of the power, at a 5$ cost increase. Does that mean nobody will buy Sandy Bridge any more? Of course not.
So again, the power consumption of other solutions is not the target. If the power consumption is within the consumer's expectation and it costs less, it will sell.
>>>Swiftshader is swiftshader - other SW rendering systems >work differently. They
>may (or may not) see similar >benefits.
>>
>>Again, why does that make SwiftShader's results only "meaningful" for SwiftShader?
>
>That's easy. Because other SW renders may work *differently* from swiftshader.
>If they work *differently*, they will have *different* performance and *different*
>bottlenecks and *different* performance gains from the changes in Sandy Bridge.
>
>Do you see a theme in the above statement?
No offence, but I only see handwaving. Exactly how do you imagine another software rendreer to work differently, in a way that makes the 30% performance increase at 55% bandwidth not meaningful?
Note that the shaders have a fixed number of arithmetic operations and data accesses. There's not all that much you could do "differently" to execute them as fast as possible without wasting cycles or bandwidth.
What you're saying is equivalent to saying that the results for one video codec implementation are meaningless to other implementations of the same codec or similar codecs. Yes they'll differ, but unless they perform completely differently they'll have closely resembling characteristics and the benchmark results of one implementation on a new CPU will be meaningful to the other implementations as well.
Anyway, I'd love to proven wrong and learn about a totally different software rendering approach...
>Let me give you a hypothetical example here. Say GCC runs 20% faster on Sandy
>Bridge than Nehalem. Now what does that tell you about the performance gain for LLVM or MSVC on Sandy Bridge?
It tells me that LLVM and MSVC will see similar performance increases. Each of them manipulates small data elements (string characters), does lots of pointer chasing, contains lots of branches, etc. If you're compiling the same source code and applying the same optimizations, the performance gain on Sandy Bridge will be comparable. If not, it means one of them is not using an optimal implementation in the first place. After switching to a better implementation, the performance gain will be more in line with the other compilers.
Just look at the problem space. You have similar input and the same CPU architecture to begin with. You can't have two optimal implementations with vastly different characteristics.
In particular for the case of software rendering I do not know of any different but still efficient approach for which SwiftShader's results are meaningless.
>>Anyway, to meet you in the middle I downclocked my >i7-920's memory from 1600 MHz
>>to 960 MHz, and the 3DMark06 SM3.0 score went from 250 to >247. So once again, reducing
>>the bandwidth to 60% has no significant impact on >performance.
>
>Alright...now we are getting somewhere, thanks for taking the effort to look into
>this! So you definitely have shown that for 3DMark06, there is not a big bandwidth bottleneck.
>
>However, I care about modern games. Could you try and run something like Crysis or Civ 5 or 3D Mark Vantage?
Crysis at 1680x1050, 4xMSAA, High detail, 64-bit: 0.89 FPS versus 0.85 FPS at 60% bandwidth.
Now, before you say this is evidence that bandwidth is sligtly more critical with modern games, I've also tested without AA and got 1.23 FPS versus 1.22 FPS. So clearly it would be more useful to focus on color compression than texture compression.
Also, these framerates are obviously unbearable but I've also tested my laptop's Quadro FX 380M and only got 0.77 FPS. And even at low detail playing Crysis on a previous generation IGP doesn't seem like a great experience: http://www.youtube.com/watch?v=qwrNP0CgPa8
Note once again that SwiftShader isn't using AVX yet, let alone gather/scatter...
>>>Reducing the number of loads and stores isn't really relevant. It's the number
>>>of operations that matters. If you are gathering 16 different addresses, you are
>>>really doing 16 different load operations.
>>
>>Not with Larrabee's implementation. It only takes as many >uops as the number of
>>cache lines that are needed to collect all elements, per >load unit.
>
>Yes, that's exactly what I said. It's the access pattern that's a problem.
It's not. Texture sampling can use a blocking technique (also called tiling or address swizzling) to ensure that texels have a high locality in both directions. Lookup tables, ROP operations, vertex attribute fetch, interpolant setup, etc. it all has a high locality.
Since Sandy Bridge already has two load units, the worst case for gathering eight 32-bit elements would be a mere 4 cycles, which is already way better than the extract/insert emulation. The typical throughput will be much closer to 1 every clock cycle though.
>>>No, that's totally different. With unified shaders you can easily use more or
>>>less geometry...until you run out of physical shaders to execute on. You should
>>>look at Kayvon's work on micro-polygon rendering.
>>
>>Easily? I think you're seriously underestimating the >complexity of adapting your
>>software to the hardware. Checking whether you're vertex >or pixel processing limited
>>wasn't feasible in actual games ten years ago, and it >still isn't.
>
>Sure you can, there are profiling tools for that. Nvidia and ATI both have them.
That only gives you global results. Different shaders can still be heavily bottlenecked by different resources. For instance a post filter effect is likely limited by texel fetch while procedural shaders or complex SH lighting is likely compute limited. There's only so much the developer can do about that.
A better hardware architecture is to get rid of the dedicated texel addressing units and filtering. This frees up space for more shader units and more generic load/store units, lessening the bottlenecks for most workloads. Only a very well balanced shader could be slightly worse off due to taking some computing resources for filtering. But that's the exception. This is very similar to the unification of the vertex and pixel pipelines. The average workload is better off with a unified architecture.
Note that some GPUs already use the shader units for part of the texel address calculations. And they might perform FP32 filtering in the shader units as well. So mark my works: the days of dedicated texture sampling are numbered.
>>>>It's clear that telling software developers what (not) to >do doesn't result in
>>>>a succesful next generation hardware architecture. With >non-unified architectures,
>>>>there were numerous applications which were vertex >processing limited, and numerous
>>>>ones which were pixel processing limited. And even those >in the middle have a fluctuating workload.
>>>
>>>Yes, except a unified shader architecture doesn't really preclude that many options.
>>
>>That's what I'm saying.
>
>Um no. You said that programmable shaders are too limited, and you want programmable rasterization and texturing.
Yes, but making these things programmable is equivalent to unification, as I've shown above.
Texturing is close to becoming programmable/unified. Alpha blending is also a perfect candidate for being computed in the shader cores. Note that DX10 already made the alpha test stage programmable.
Rasterization is a bigger challenge so it will probably take longer to become programmable, but it's already actively being researched and the results are "encouraging": http://research.microsoft.com/en-us/um/people/cloop/eg2010.pdf
>I am not convinced that those two should be programmable, and I'm not convinced
>that FF hardware is really that restrictive.
With all due respect, that's because you lack vision of what would be possible with a fully programmable architecture. Don't get me wrong, I'm not claiming I know everything it would be used for. The only thing I'm sure of is that it would be pretty exciting. There's a lot of GPGPU research which remains unsuccesful for mainstream hardware, and there's lots of talk about ray-tracing, Reyes, micropolygons, and many other rendering techniques. A lot of new techniques even still need to see the light of day. But we first need the flexible hardware to enable them.
If you're still not convinced, please recall that John Carmack was practically ridiculed for wanting floating-point shaders. Most people were not convinced it would be necessary and that integer operations are too restrictive. It's pretty obvious nowadays that these people were wrong. And while back then Carmack only envisioned Shader Model 2.0, we're now at Shader Model 5.0 and developers continue to push the limits of the hardware's capabilities. OpenCL is only in its infancy.
>>>First, you do need to count the increases in resolution. Pixels have definitely
>>>increased over time.
>>
>>Yes the resolution has increased but everything else scaled accordingly. More pixels
>>doesn't mean higher benfit from texture compression. In fact TEX:ALU is going down,
>>meaning pixel shaders are more compute limited than >bandwidth limited.
>
>I'm skeptical. Available bandwidth always grows more slowly than compute...
Which is fine. As the computational complexity of software increases, it uses more temporary data. The cache hierarchy can keep up with the core's bandwidth needs to store temporary results. End results got to RAM memory. But since the resolution only increases slowly the current steady evolution in RAM bandwidth suffices.
Note that before pixel shaders, color operations were essentially performed by the ROP units, and you needed multiple passes to create various effects. Shaders store the intermediate results locally, in the registers. The bandwidth needs went down (but obviously it then allowed other things to consume the freed up bandiwdth). Nowadays, the GPU doesn't just have massive register files, it also has caches. They're still tiny in comparison to the register files and the computing power, but they'll need to grow to accommodate the working sets of increasingly more complex software. The new data is uncompressed, so there's no way around using caches.
For CPUs the cache size per FLOP is going down. CPUs and GPUs are converging on every front. Of course if it increases again due to a breakthrough in density that's obviously not a bad thing. So let me rephrase that as the die area per FLOP spent on cache is converging.
>>>>In ten more years caches could be around 256 MB, and >that's without taking revolutionary
>>>>new technologies like T-RAM into account. So it's really >hard to imagine that this
>>>>won't suffice to compensate for the texture bandwidth >needs of the low-end graphics market.
>>>
>>>Because you are imagining that the low-end market stays put. It won't.
>>
>>I didn't say it stays put. I said it's a slow moving target. Evidence of this is
>>the ever growing gap between high-end and low-end graphics hardware. IGPs were born
>>out of the demand for really cheap but adequate 3D graphics support. They cover the majority of the market:
>>http://unity3d.com/webplayer/hwstats/pages/web-2011Q1-gfxvendor.html
>>
>>This massive market must obviously have a further division >in price and performance
>>expectations. Some people want a more powerful CPU for the >same price by sacrificing
>>a bit of graphics performance, while others simply want a >cheaper system that isn't
>>aimed at serious gaming. As the CPU performance continues >to increase exponentially,
>>and things like gather/scatter can make a drastic >difference in graphics efficiency,
>>software rendering can satisfy more and more people's expectations, even if those
>>expectations themselves incease slowly.
>
>In essence, what you are saying is that some people would be fine with lower performance
>graphics. That's something I agree with.
>
>I just don't know what the relative performance of SW rendering is to dedicated
>hardware, and how that curve will change over time.
It started to converge ever since the end of the MHz race and they started to focus on performance per Watt. GPUs are essentially limited by performance per Watt as well. But they're pretty much out of architectural options to increase it. They have to rely on new semiconductor processes. CPUs benefit from that equally, but at the same time they started to keep the clock frequency moderate, increased the number of cores and increased the vector width. This has massively increased the performance per Watt, beyond what the advancement in process technology offers. And they still have FMA up their sleeves.
As for translating this into software rendering performance, it's unfortunately held back by the lack of gather/scatter. But I've shown you plenty of data now which shows that the IGP is going to go the way of the dodo.
>My sense is that hardware is probably getting relatively faster, given the attention Intel is paying.
You have to take into account how much of that is due to increasing the area and/or power consumption. It's easy to mistake a performance increase as an architectural advancement. For several years GPUs were able to increase their ALU density dramatically, but now they're hitting walls. Instead their best hope now is to increase the utilization of these ALUs for complex workloads, using unification and extending the memory hierarchy.
Back to Intel's IGPs, their increased performance is due to additional Execution Units and higher clock frequency. It means that from a cost and power consumption point of view they're not improving relative to software rendering.
>>>That's because it won't.
>>
>>It will. The only strengths the GPU has left are all >components based on the ability
>>to load/store lots of data in parallel. The CPU cores >already achieve higher GFLOPS
>>than the IGP
>
>That's true today, but I suspect it won't be true in the future.
Do you have any indication to back up this suspicion? CPUs still have FMA to come, and I see no indication that the CPU core count is stagnating. Ever more applications see some benefit from quad-core, and Bulldozer will default to eight cores. Haswell is claimed to default to eight (beefier) cores as well.
Note that the non-K models of Sandy Bridge have half the EUs. So the average IGP has a lot of catching up to do to exceed the CPU's GFLOPS. Against an 8-core Haswell CPU it would have to become 8 times more powerful to exceed it. Fat chance.
It's far more interesting to give the CPU cores gather/scatter support and ditch that sorry IGP.
>Also remember
>that you have to share those FLOP/s with other tasks.
If you replace the IGP with CPU cores you get about the same amount of FLOPs in return. Furthermore, with an IGP every game is GPU limited. As unification proves, sharing FLOPs is not a bad thing. Even if the other tasks need lots of FLOPs, it's typically only temporary. For instance to compute physics. At other times the CPU is sitting idle waiting for the IGP to finish.
By unifying the whole thing the bottlenecks are gone and you get closer interaction. Currently a lot of GPGPU applications are unsuccesful because of the round-trip latency. You need to send a massive amount of data-parallel workload to the GPU to compensate for that. That doesn't always work out so well. A lot of applications want to process small amounts of work instead and get the results fast to interact with it.
Software rendering allows to break free from the legacy graphics pipeline and perform only exactly the operations you want. For instance when doing a post filter effect you don't need perspective correction or mipmap LOD calculations. You can already do this with a compute shader, but then again a compute shader IS software rendering...
>>You
>>can either ditch the IGP to make things cheaper, or >replace it with additional CPU
>>cores so you get a really powerful processor for any >workload.
>
>I think to achieve IGP level of performance, using an IGP is the most efficient in terms of power and area.
Yes, but you have to look a the total picture. The IGP is only used intensively during gaming. At other times its transistors are largely a waste of die space (money). Having more CPU performance/dollar can be of greater importance to the consumer. So just like the most cost effective solution for audio processing is a software codec, we're close to the point where for some markets an IGP-less system offers the best balance.
>>Quoting t-ram.com: "T-RAM Semiconductor has successfully developed the Thyristor-RAM
>>technology from concept to production-readiness. Our Thyristor-RAM technology has
>>been successfully implemented on both Bulk and SOI CMOS. "
>>
>>Sounds like production ready to me.
>
>Not even close.
What makes you so sure? Yes, Z-RAM appears to be a failure but I'd expect them to think twice before making statements like jointly developing it for application in 32 and 22 nm nodes. Since the work on Bulldozer started many years ago and they probably didn't want to take any risks it's highly unlikely to use 32 nm T-RAM, but it could appear in a 22 nm refresh in about two years time.
So do you have any other information aside from the irrelevant comparison to Z-RAM to claim T-RAM is not even close to production?
>>>>2nd gen Z-RAM
>>>
>>>Doesn't work at all.
>>
>>Maybe not as cache memory, but it's hopeful as a DRAM replacement: http://www.z-ram.com/en/pdf/Z-RAM_LV_and_bulk_PR_Final_for_press.pdf
>>
>
>Just a moment ago, you were suggesting it as a cache replacement. Now you suddenly
>are back-tracking? And nobody really wants a proprietary DRAM replacement.
Yes, I wasn't aware that 2nd gen. Z-RAM is not considered as an SRAM replacement. Either way the point was that cache technology is not at a standstill and one less candidate for improving the density doesn't put a lid on it.
>>>It's possible, but they will need to become more competitive from an energy perspective with fixed function stuff.
>>
>>There's not a lot of fixed-function stuff left. The >majority of the GPU's die space
>>consist of programmable or generic components.
>>
>>And I've shown before that the CPUs FLOPS/Watt is in the >same league as GPUs:
>>- Core i7-282QM: 150 GFLOPS / 45 Watt (more with Turbo Boost)
>>- GeForce GT 420: 134.4 GFLOPS / 50 Watt
>
>The GT 420 is ancient. A better comparison would be the GT 440, which is 96 shaders,
>1.6GHz and 65W. That's ~300 GFLOP/s for 65W, or a roughly 2X advantage.
Ancient? Both the GT 420 and 440 use a GF108 chip.
And my calculator says it's only a 50% advantage. But that's without taking into account that Sandy Bridge's TDP includes the IGP and includes Turbo Mode. So that advantage likely goes up in smoke.
FMA will double the GFLOP/s score for a minimal increase in transistor count. Add to this powerhouse gather/scatter support and NVIDIA has a serious problem.
>>Obviously software rendering requires a bit more >arithmetic power to implement
>>the remaining fixed-function functionality, but >programmable shaders take the bulk.
>>
>>So there's no lack of energy efficiency. The CPU simply >can't utilize its computing power effectively
>
>GPUs are definitely more power efficient than CPUs.
At effective graphics performance, yes, but that's easy enough to fix with FMA, integer AVX, and gather/scatter.
Advances in power gating technology ensures that CPUs don't waste power on things which are idle. Also the Physical Register File of Sandy Bridge singificantly reduces the power consumption of the out-of-order execution logic.
So today's CPUs are pretty lean and mean and they're only going to get better. But they don't have to match the effective power efficiency of GPUs to make software rendering viable.
>>All applications that contain loops can benefit from >gather/scatter. That's all applications.
>
>If that's true, then what % performance increase could we expect to see in SPECint?
A 15% reduction in memory accesses and 28% of scalar instructions converted into vector instructions: http://personals.ac.upc.edu/mpajuelo/papers/ISCA02.pdf. This paper uses a dynamic technique, but with a bit of assistance from the programmer (like the use of the 'restrict' keyword) the same results can be achieved statically.
Without gather/scatter a large number of loops can't be vectorized: http://hpc.cs.tsinghua.edu.cn/research/cluster/SPEC2006Characterization/auto_para.html
>>With sather/scatter support every scalar operation would >have a parallel equivalent.
>>So any loop with independent iterations can be >parallelized and execute up to 8 times faster.
>
>That's assuming there is no control flow divergence.
Simple flow control is still worth vectorizing.
>>And I don't think the hardware cost is that high. All you >need is a bit of logic
>>to check which elements are located in the same cache >line, and four byte shift
>>units per 128-bit load units instead of one, to collect >the individual elements.
>>Note that logic for sequentially accessing the cache lines >is already largely in
>>place to support load operations which straddle a cache >line boundary.
>
>You are saying that because you don't design hardware. What you are suggesting
>is in fact, quite complicated and large.
I don't currently design hardware but I have a masters degree in computer engineering, with a minor in embedded systems. I've read 'Digital Integrated Circuits - A Design Perspective' by Rabaey et al. front-to-back so by all means please elaborate on just how complicated and large it would be.
Also please tell me how Larrabee can have 512-bit wide gather/scatter support for each of its tiny cores while a pair of 128-bit gather/scatter units would be quite complicated and large.
Unless of course by quite complicated and large you meant about as complicated and large as texel fetching logic. Sure, it's definitely not trivial to design and the area is not insignificant. But it seems well worth it given that it will allow the vectorization of code which previously wasn't vectorizable.
>>>Really? Have you heard of Vertica? They do an awful lot >>of lossless compression of data in memory.
>>
>>No, I hadn't heard about them before. Could you point me >to some document where
>>they detail how they added hardware support for compressed >memory transfers to reduce bandwidth?
>
>They don't need hardware to do lossless compression. They have a clever column
>oriented database. Check vertica.com. One of their big performance gains is from reducing memory (and disk) bandwidth.
You were previously talking about the importance of dedicated texture decompression hardware. Now you're telling me about Vertica and how they don't need hardware to do lossless compression...
I don't see the relevance of Vertica to this discussion, unless you're actually trying to say dedicated hardware isn't that important after all.
Indeed there are also lossy and lossless techniques to reduce memory bandwidth which can be implemented in software. That includes textures. It's currently not worth the cycles though, as software rendering isn't bandwidth limited.
>>>Many applications use adjacent values.
>>
>>Yes, and many applications also use non-adjacent values.
>>
>>If a loop contains just one load or store at an address >which isn't consecutive,
>>it can't be vectorized (unless you want to resort to >serially extracting/inserting
>>addresses and values). So even if the majority of values >are adjacent, it doesn't
>>take a lot of non-adjacent data to cripple the performance.
>
>You can still vectorize it, you just need to have a bunch of scalar loads/stores
>to deal with the non-adjacent addresses.
That's exactly what I said (note the "unless"). You don't generally want to do that though. Note that you need two instructions (extract and insert) to emulate a single scalar load. So you risk making things slower than the scalar code. See slide 44 here: http://sc.tamu.edu/help/softwareDocs/intel/tutorial/compiler_1.pdf
It would help to have an instruction which takes a vector element as address offset (e.g. "mov ymm0.3, dword ptr [rax+ymm1.3]"), but to really tackle Amdahl's Law we need gather/scatter support which in the ideal case takes a single cycle.
>>>>Why? It only accesses the cache lines it needs. If all >elements are from the same
>>>>cache line, it's as fast as accessing a single element.
>>>
>>>And exactly as fast as using AVX! i.e. no improvement >>and more complexity/power.
>>
>>No. The addresses are unknown at compile time. So the only >>option with AVX1 is
>>to sequentially extract each address from the address >vector, and insert the read
>>element into the result vector. This takes 18 instructions.
>
>>With gather support it would be just one instruction. >Assuming it gets split into
>>two 128-bit gather uops, the maximum throughput is 1 every >cycle and the minimal throughput is 1 every 4 cycles.
>
>>>>But even in the worst case
>>>>it can't generate more misses or consume more bandwidth.
>>>
>>>It sure can. Now instead of having 1-2 TLB accesses per cycle, you get 16. How
>>>many TLB copies do you want? How many misses in flight do you want to support?
>>
>>You're still not getting it. It only accesses one cache >line per cycle. It simply
>>has to check which elements are within the same cache >line, and perform a single
>>TLB access for all of these elements. Checking whether the >addresses land on the
>>same cache line doesn't require full translation of each >address.
>
>That's quite complicated hardware, and you can't afford to have that on the critical
>path for any of your normal loads. So now you need a fairly separate load/store pipeline for scatter/gather.
I don't think any singnificant additions are needed on the critical path itself. It just requires four byte shift units intead of one, but they operate in parallel as well. Computing which elements go where can be done up front, before entering the critial path for normal loads. It would be perfectly acceptable to have a higher latency for scatter/gather, if necessary.
It's definitely an engineering challenge, but so far I can't think of anything which would jeopardize the feasibility.
>>Nothing other than graphics runs better on the IGP. As >I've mentioned before, GPGPU
>>is only succesful using high-end hardware.
>
>Today...unclear what tomorrow holds.
He who predicts the present is always right.
Anyway, yes, GPUs are highly likely to become better at GPGPU applications. But it will require improving the efficiency for executing workloads which differ from graphics. Which means fewer graphics-specific fixed-funciton hardware, more programmability, more unification, superscalar scheduling, concurrent kernels, larger caches, etc. Everything is pointing in the direction of the GPU becoming more CPU-like. Which means at some point it makes no sense at all to keep things heterogeneous.
>>So the CPU is better than the IGP at absolutely everything >else. That makes it
>>really tempting to have a closer look at what it would >take to make it adequately efficient at graphics as well.
>>
>>The answer: gather/scatter.
>
>It would also need a 2X improvement in FLOP/w and /mm2, possibly more.
Adding FMA support and tossing out the IGP roughly triple the compute density. Performance per Watt is already excellent as proven by the comparison against the GeForce 420 and 440.
>>Multi-core, 256-bit vectors, Hyper-Threading, software >pipelining... the CPU is
>>already a throughput device! It's just being held back by >the lack of parallel load/store
>>support. It's the one missing part to let all those GFLOPS >come to full fruition.
>
>You keep on repeating this as if it were true, but it's not. I agree that lack
>of scatter/gather is an issue. But a more fundamental issue is that throughput
>optimized cores (e.g. shader arrays) are simply more efficient for compute rich
>workloads. You can't really get around that.
You keep on repeating that throughput optimized cores are "simply" more efficient. I've given you dozens of detailed arguments why despite that, software rendering is the future, while you're just handwaving based on the prejudice that CPUs are weak and power hungry.
Yes GPUs are throughput optimized so evidently they are more efficient for compute rich workloads, but they fall flat on their face when running out of registers or if the working set doesn't fit in the cache or if the code is too divergent or if the work batches are too small, etc. CPUs cope much more gracefully with increasing software complexity.
So it's getting less relevant just how efficient GPUs are at compute rich workloads. Nobody cares if they can run Max Payne at 1000 FPS. What matters in the long run is the newer workloads, which are less data parallel and less coherent.
Now, GPUs obviously still dictate the pace at which application developers diverge from compute rich workloads. So we're not going to see for example ray-traced games tomorrow. But GPUs do still suck at these kind of workloads and no amount of additional shader cores is going to help it. They'll need to evolve in the direction of CPU architectures to enable new workloads.
It also means that CPUs don't have to become as compute optimized as today's GPUs for software rendering to take over. Although they'll still drastically improve at it with FMA and gather/scatter, they have plenty of other valuable features to become the dominant architecture for any workload.
>>What specialized hardware would that be? I've already shown that texture compression
>>hardly makes a difference,
>
>No, you cited extremely old data from a simulator, where even the author of the
>simulator thinks the data is not useful.
No, I only gave that simulator data to clarify the results from actual experiments. It doesn't have to be very accurate to be useful. Regardless of what the exact bandwidth usage looks like today, there will be plenty of headroom and it's not just consumed by texturing.
If you want to debunk that, I suggest you show me recent data for which this isn't true, or tell me exactly why the author thinks the old data isn't useful and how it affects the validity of your dedicated texture decompression hardware importance claim.
>>and sampling and filtering is becoming programmable anyway.
>>Gather/scatter speeds up just about every other pipeline >stage as well.
>
>Except it doesn't benefit many workloads, and it costs a lot of area and power.
>So you want to disable it on the many workloads where it does not help.
>>>I totally agree that scatter/gather is a great capability to have. But what's
>>>the cost in die area, power and complexity? Not just to the core, but also the memory controller, etc.
>>
>>Larrabee has wider vectors and smaller cores, but features gather/scatter support.
>>So I don't think it takes a lot of die space either way. It doesn't require any
>>changes to the memory controller, just the load/store units. I'm not entirely sure
>>but collecting four elements from a cache line can probably largely make use of
>>the existing network to extract one (unaligned) value. And checking which addresses
>>land on the same cache line is a very simple equality test >of the upper bits.
>
>I think you have no or minimal experience designing hardware, so I'm not really
>inclined to take your word for it...especially compared against the expertise of
>the thousands of CPU designers at places like Intel, AMD and IBM.
Let me get this straight... You assume I have no knowledge of hardware design, without pointing out any flaw in my reasoning, and you're more inclined to turn towards experienced CPU designers such as those working at Intel, who added gather/scatter to Larrabee, as an indication why gather/scatter for the CPU isn't feasible?
>Scatter/gather is expensive and that's why it isn't done.
All things weren't done the day before they were done. You can't conclude from that that gather/scatter is (too) expensive.
Lots of things don't happen simply out of poor judgement. For instance some of the SSE instructions are just late fixups of old incomplete extensions. It doesn't mean they were expensive to add the first time around.
Gather/scatter is a significant deviation from the well-known scalar load/store unit. It's very alien for CPU designers and it requires considerable R&D even if the end result isn't necessarily expensive. Also, CPU designers are often clueless about the software applications. They just benchmark current software and try to come up with the next idea on how to execute it faster. But without gather/scatter support, lots of loops are not vectorized or the software developers compute things very differently. For instance computing an exponential function with scalar code is best done using some lookup tables, but with vector code you currently need to resort to long polynomials. This is then wrongfully interpreted as a need for more arithmetic performance. Another example is converting AoS data into SoA data for SIMD processing. This currently requires lots of shuffle operations between registers so CPU designers are inclined to add more and faster shuffle units. But with scatter/gather there wouldn't be any need to shuffle data accross registers.
Cheers,
Nicolas
Topic | Posted By | Date |
---|---|---|
Sandy Bridge CPU article online | David Kanter | 2010/09/26 09:35 PM |
Sandy Bridge CPU article online | Alex | 2010/09/27 05:22 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 10:06 AM |
Sandy Bridge CPU article online | someone | 2010/09/27 06:03 AM |
Sandy Bridge CPU article online | slacker | 2010/09/27 02:08 PM |
PowerPC is now Power | Paul A. Clayton | 2010/09/27 04:34 PM |
Sandy Bridge CPU article online | Dave | 2010/11/10 10:15 PM |
Sandy Bridge CPU article online | someone | 2010/09/27 06:23 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 06:39 PM |
Optimizing register clear | Paul A. Clayton | 2010/09/28 12:34 PM |
Sandy Bridge CPU article online | MS | 2010/09/27 06:54 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 10:15 AM |
Sandy Bridge CPU article online | MS | 2010/09/27 11:02 AM |
Sandy Bridge CPU article online | mpx | 2010/09/27 11:44 AM |
Sandy Bridge CPU article online | MS | 2010/09/27 02:37 PM |
Precisely | David Kanter | 2010/09/27 03:22 PM |
Sandy Bridge CPU article online | Richard Cownie | 2010/09/27 08:27 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 10:01 AM |
Sandy Bridge CPU article online | Richard Cownie | 2010/09/27 10:40 AM |
Sandy Bridge CPU article online | boots | 2010/09/27 11:19 AM |
Right, mid-2011, not 2010. Sorry (NT) | Richard Cownie | 2010/09/27 11:42 AM |
bulldozer single thread performance | Max | 2010/09/27 12:57 PM |
bulldozer single thread performance | Matt Waldhauer | 2011/03/02 11:32 AM |
Sandy Bridge CPU article online | Pun Zu | 2010/09/27 11:32 AM |
Sandy Bridge CPU article online | ? | 2010/09/27 11:44 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 01:11 PM |
My opinion is that anything that would take advantage of 256-bit AVX | redpriest | 2010/09/27 01:17 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/27 03:09 PM |
My opinion is that anything that would take advantage of 256-bit AVX | redpriest | 2010/09/27 04:06 PM |
My opinion is that anything that would take advantage of 256-bit AVX | David Kanter | 2010/09/27 05:23 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 03:57 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 04:35 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Matt Waldhauer | 2010/09/28 10:58 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/27 06:39 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 04:14 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Megol | 2010/09/28 02:17 AM |
My opinion is that anything that would take advantage of 256-bit AVX | Michael S | 2010/09/28 05:47 AM |
PGI | Carlie Coats | 2010/09/28 10:23 AM |
gfortran... | Carlie Coats | 2010/09/29 09:33 AM |
My opinion is that anything that would take advantage of 256-bit AVX | mpx | 2010/09/28 12:58 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Michael S | 2010/09/28 01:36 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Foo_ | 2010/09/29 01:08 AM |
My opinion is that anything that would take advantage of 256-bit AVX | mpx | 2010/09/28 11:37 AM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/28 01:19 PM |
My opinion is that anything that would take advantage of 256-bit AVX | hobold | 2010/09/28 03:08 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 04:26 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Anthony | 2010/09/28 10:31 PM |
Sandy Bridge CPU article online | Hans de Vries | 2010/09/27 02:19 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 03:19 PM |
Sandy Bridge CPU article online | -Sweeper_ | 2010/09/27 05:50 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 06:41 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/27 02:55 PM |
Sandy Bridge CPU article online | line98 | 2010/09/27 03:05 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 03:20 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/27 03:23 PM |
Sandy Bridge CPU article online | line98 | 2010/09/27 03:42 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 09:33 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 04:04 PM |
Sandy Bridge CPU article online | Jack | 2010/09/27 04:40 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 11:47 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 11:54 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 11:59 PM |
Sandy Bridge CPU article online | JS | 2010/09/28 01:18 AM |
Sandy Bridge CPU article online | Royi | 2010/09/28 01:31 AM |
Sandy Bridge CPU article online | Jack | 2010/09/28 06:34 AM |
Sandy Bridge CPU article online | Royi | 2010/09/28 08:22 AM |
Sandy Bridge CPU article online | Foo_ | 2010/09/28 12:53 PM |
Sandy Bridge CPU article online | Paul | 2010/09/28 01:17 PM |
Sandy Bridge CPU article online | mpx | 2010/09/28 01:22 PM |
Sandy Bridge CPU article online | anonymous | 2010/09/28 02:06 PM |
Sandy Bridge CPU article online | IntelUser2000 | 2010/09/29 01:49 AM |
Sandy Bridge CPU article online | Jack | 2010/09/28 05:08 PM |
Sandy Bridge CPU article online | mpx | 2010/09/29 01:50 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/29 12:01 PM |
Sandy Bridge CPU article online | Royi | 2010/09/29 12:48 PM |
Sandy Bridge CPU article online | mpx | 2010/09/29 02:15 PM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/29 02:27 PM |
Sandy Bridge CPU article online | ? | 2010/09/29 11:18 PM |
Sandy Bridge CPU article online | savantu | 2010/09/30 12:28 AM |
Sandy Bridge CPU article online | ? | 2010/09/30 03:43 AM |
Sandy Bridge CPU article online | gallier2 | 2010/09/30 04:18 AM |
Sandy Bridge CPU article online | ? | 2010/09/30 08:38 AM |
Sandy Bridge CPU article online | David Hess | 2010/09/30 10:28 AM |
moderation (again) | hobold | 2010/10/01 05:08 AM |
Sandy Bridge CPU article online | Megol | 2010/09/30 02:13 AM |
Sandy Bridge CPU article online | ? | 2010/09/30 03:47 AM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 08:54 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 10:18 AM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 12:04 PM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 12:38 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/30 01:02 PM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 08:09 PM |
Sandy Bridge CPU article online | mpx | 2010/09/30 12:40 PM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 01:00 PM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 08:44 PM |
Sandy Bridge CPU article online | David Hess | 2010/09/30 10:36 AM |
Sandy Bridge CPU article online | someone | 2010/09/30 11:23 AM |
Sandy Bridge CPU article online | mpx | 2010/09/30 01:50 PM |
wii lesson | Michael S | 2010/09/30 02:12 PM |
wii lesson | Dan Downs | 2010/09/30 03:33 PM |
wii lesson | Kevin G | 2010/10/01 12:27 AM |
wii lesson | Rohit | 2010/10/01 07:53 AM |
wii lesson | Kevin G | 2010/10/02 03:30 AM |
wii lesson | mpx | 2010/10/01 09:02 AM |
wii lesson | IntelUser2000 | 2010/10/01 09:31 AM |
GPUs and games | David Kanter | 2010/09/30 08:17 PM |
GPUs and games | hobold | 2010/10/01 05:27 AM |
GPUs and games | anonymous | 2010/10/01 06:35 AM |
GPUs and games | Gabriele Svelto | 2010/10/01 09:07 AM |
GPUs and games | Linus Torvalds | 2010/10/01 10:41 AM |
GPUs and games | Anon | 2010/10/01 11:23 AM |
Can Intel do *this* ??? | Mark Roulo | 2010/10/03 03:17 PM |
Can Intel do *this* ??? | Anon | 2010/10/03 03:29 PM |
Can Intel do *this* ??? | Mark Roulo | 2010/10/03 03:55 PM |
Can Intel do *this* ??? | Anon | 2010/10/03 05:45 PM |
Can Intel do *this* ??? | Ian Ameline | 2010/10/03 10:35 PM |
Graphics, IGPs, and Cache | Joe | 2010/10/10 09:51 AM |
Graphics, IGPs, and Cache | Anon | 2010/10/10 10:18 PM |
Graphics, IGPs, and Cache | Rohit | 2010/10/11 06:14 AM |
Graphics, IGPs, and Cache | hobold | 2010/10/11 06:43 AM |
Maybe the IGPU doesn't load into the L3 | Mark Roulo | 2010/10/11 08:05 AM |
Graphics, IGPs, and Cache | David Kanter | 2010/10/11 09:01 AM |
Can Intel do *this* ??? | Gabriele Svelto | 2010/10/04 12:31 AM |
Kanter's Law. | Ian Ameline | 2010/10/01 02:05 PM |
Kanter's Law. | David Kanter | 2010/10/01 02:18 PM |
Kanter's Law. | Ian Ameline | 2010/10/01 02:33 PM |
Kanter's Law. | Kevin G | 2010/10/01 04:19 PM |
Kanter's Law. | IntelUser2000 | 2010/10/01 10:36 PM |
Kanter's Law. | Kevin G | 2010/10/02 03:15 AM |
Kanter's Law. | IntelUser2000 | 2010/10/02 02:35 PM |
Wii vs pc's | Rohit | 2010/10/01 07:34 PM |
Wii vs pc's | Gabriele Svelto | 2010/10/01 11:54 PM |
GPUs and games | mpx | 2010/10/02 11:30 AM |
GPUs and games | Foo_ | 2010/10/02 04:03 PM |
GPUs and games | mpx | 2010/10/03 11:29 AM |
GPUs and games | Foo_ | 2010/10/03 01:52 PM |
GPUs and games | mpx | 2010/10/03 03:29 PM |
GPUs and games | Anon | 2010/10/03 03:49 PM |
GPUs and games | mpx | 2010/10/04 11:42 AM |
GPUs and games | MS | 2010/10/04 02:51 PM |
GPUs and games | Anon | 2010/10/04 08:29 PM |
persistence of vision | hobold | 2010/10/04 11:47 PM |
GPUs and games | mpx | 2010/10/05 12:51 AM |
GPUs and games | MS | 2010/10/05 06:49 AM |
GPUs and games | Jack | 2010/10/05 11:17 AM |
GPUs and games | MS | 2010/10/05 05:19 PM |
GPUs and games | Jack | 2010/10/05 11:11 AM |
GPUs and games | mpx | 2010/10/05 12:51 PM |
GPUs and games | David Kanter | 2010/10/06 09:04 AM |
GPUs and games | jack | 2010/10/06 09:34 PM |
GPUs and games | Linus Torvalds | 2010/10/05 07:29 AM |
GPUs and games | Foo_ | 2010/10/04 04:49 AM |
GPUs and games | Jeremiah | 2010/10/08 10:58 AM |
GPUs and games | MS | 2010/10/08 01:37 PM |
GPUs and games | Salvatore De Dominicis | 2010/10/04 01:41 AM |
GPUs and games | Kevin G | 2010/10/05 02:13 PM |
GPUs and games | mpx | 2010/10/03 11:36 AM |
GPUs and games | David Kanter | 2010/10/04 07:08 AM |
GPUs and games | Kevin G | 2010/10/04 10:38 AM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 09:19 PM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 12:06 PM |
Sandy Bridge CPU article online | rwessel | 2010/09/30 02:29 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/30 03:06 PM |
Sandy Bridge CPU article online | rwessel | 2010/09/30 06:55 PM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 03:53 AM |
Sandy Bridge CPU article online | rwessel | 2010/10/01 08:30 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 09:31 AM |
Sandy Bridge CPU article online | rwessel | 2010/10/01 10:56 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 08:28 PM |
Sandy Bridge CPU article online | Ricardo B | 2010/10/02 05:38 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/02 06:59 PM |
which bus more wasteful | Michael S | 2010/10/02 10:38 AM |
which bus more wasteful | rwessel | 2010/10/02 07:15 PM |
Sandy Bridge CPU article online | Ricardo B | 2010/10/01 10:08 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 08:31 PM |
Sandy Bridge CPU article online | Andi Kleen | 2010/10/01 11:55 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 08:32 PM |
Sandy Bridge CPU article online | kdg | 2010/10/01 11:26 AM |
Sandy Bridge CPU article online | Anon | 2010/10/01 11:33 AM |
Analog display out? | David Kanter | 2010/10/01 01:05 PM |
Analog display out? | mpx | 2010/10/02 11:46 AM |
Analog display out? | Anon | 2010/10/03 03:26 PM |
Digital is expensive! | David Kanter | 2010/10/03 06:36 PM |
Digital is expensive! | Anon | 2010/10/03 08:07 PM |
Digital is expensive! | David Kanter | 2010/10/03 10:02 PM |
Digital is expensive! | Steve Underwood | 2010/10/04 03:52 AM |
Digital is expensive! | David Kanter | 2010/10/04 07:03 AM |
Digital is expensive! | anonymous | 2010/10/04 07:11 AM |
Digital is not very expensive! | Steve Underwood | 2010/10/04 06:08 PM |
Digital is not very expensive! | Anon | 2010/10/04 08:33 PM |
Digital is not very expensive! | Steve Underwood | 2010/10/04 11:03 PM |
Digital is not very expensive! | mpx | 2010/10/05 01:10 PM |
Digital is not very expensive! | Gabriele Svelto | 2010/10/05 12:24 AM |
Digital is expensive! | jal142 | 2010/10/04 11:46 AM |
Digital is expensive! | mpx | 2010/10/04 01:04 AM |
Digital is expensive! | Gabriele Svelto | 2010/10/04 03:28 AM |
Digital is expensive! | Mark Christiansen | 2010/10/04 03:12 PM |
Analog display out? | slacker | 2010/10/03 06:44 PM |
Analog display out? | Anon | 2010/10/03 08:05 PM |
Analog display out? | Steve Underwood | 2010/10/04 03:48 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 08:37 PM |
Sandy Bridge CPU article online | slacker | 2010/10/02 02:53 PM |
Sandy Bridge CPU article online | David Hess | 2010/10/02 06:49 PM |
memory bandwith | Max | 2010/09/30 12:19 PM |
memory bandwith | Anon | 2010/10/01 11:28 AM |
memory bandwith | Jack | 2010/10/01 07:45 PM |
memory bandwith | Anon | 2010/10/03 03:19 PM |
Sandy Bridge CPU article online | PiedPiper | 2010/09/30 07:05 PM |
Sandy Bridge CPU article online | Matt Sayler | 2010/09/29 04:38 PM |
Sandy Bridge CPU article online | Jack | 2010/09/29 09:39 PM |
Sandy Bridge CPU article online | mpx | 2010/09/30 12:24 AM |
Sandy Bridge CPU article online | passer | 2010/09/30 03:15 AM |
Sandy Bridge CPU article online | mpx | 2010/09/30 03:47 AM |
Sandy Bridge CPU article online | passer | 2010/09/30 04:25 AM |
SB and web browsing | Rohit | 2010/09/30 06:47 AM |
SB and web browsing | David Hess | 2010/09/30 07:10 AM |
SB and web browsing | MS | 2010/09/30 10:21 AM |
SB and web browsing | passer | 2010/09/30 10:26 AM |
SB and web browsing | MS | 2010/10/02 06:41 PM |
SB and web browsing | Rohit | 2010/10/01 08:02 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/30 08:35 AM |
Sandy Bridge CPU article online | Jack | 2010/09/30 10:40 PM |
processor evolution | hobold | 2010/09/29 02:16 PM |
processor evolution | Foo_ | 2010/09/30 06:10 AM |
processor evolution | Jack | 2010/09/30 07:07 PM |
3D gaming as GPGPU app | hobold | 2010/10/01 04:59 AM |
3D gaming as GPGPU app | Jack | 2010/10/01 07:39 PM |
processor evolution | hobold | 2010/10/01 04:35 AM |
processor evolution | David Kanter | 2010/10/01 10:02 AM |
processor evolution | Anon | 2010/10/01 11:46 AM |
Display | David Kanter | 2010/10/01 01:26 PM |
Display | Rohit | 2010/10/02 02:56 AM |
Display | Linus Torvalds | 2010/10/02 07:40 AM |
Display | rwessel | 2010/10/02 08:58 AM |
Display | sJ | 2010/10/02 10:28 PM |
Display | rwessel | 2010/10/03 08:38 AM |
Display | Anon | 2010/10/03 03:06 PM |
Display tech and compute are different | David Kanter | 2010/10/03 06:33 PM |
Display tech and compute are different | Anon | 2010/10/03 08:16 PM |
Display tech and compute are different | David Kanter | 2010/10/03 10:00 PM |
Display tech and compute are different | hobold | 2010/10/04 01:40 AM |
Display | ? | 2010/10/03 03:02 AM |
Display | Linus Torvalds | 2010/10/03 10:18 AM |
Display | Richard Cownie | 2010/10/03 11:12 AM |
Display | Linus Torvalds | 2010/10/03 12:16 PM |
Display | slacker | 2010/10/03 07:35 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/04 07:06 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/04 11:44 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/04 02:59 PM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/04 03:13 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/04 08:58 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/05 01:39 AM |
current V12 engines with >6.0 displacement | MS | 2010/10/05 06:57 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/05 01:20 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/05 09:26 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 05:39 AM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 01:22 PM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/06 03:07 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 03:56 PM |
current V12 engines with >6.0 displacement | rwessel | 2010/10/06 03:30 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 03:53 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/07 01:32 PM |
current V12 engines with >6.0 displacement | rwessel | 2010/10/07 07:54 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/07 09:02 PM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | slacker | 2010/10/06 07:20 PM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | Ricardo B | 2010/10/07 01:32 AM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | slacker | 2010/10/07 08:15 AM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | Ricardo B | 2010/10/07 10:51 AM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 05:03 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 06:26 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 11:15 PM |
current V12 engines with >6.0 displacement | Howard Chu | 2010/10/07 02:16 PM |
current V12 engines with >6.0 displacement | Anon | 2010/10/05 10:31 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 05:55 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/06 06:15 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 06:34 AM |
I wonder is there any tech area that this forum doesn't have an opinion on (NT) | Rob Thorpe | 2010/10/06 10:11 AM |
Cunieform tablets | David Kanter | 2010/10/06 12:57 PM |
Cunieform tablets | Linus Torvalds | 2010/10/06 01:06 PM |
Ouch...maybe I should hire a new editor (NT) | David Kanter | 2010/10/06 04:38 PM |
Cunieform tablets | rwessel | 2010/10/06 03:41 PM |
Cunieform tablets | seni | 2010/10/07 10:56 AM |
Cunieform tablets | Howard Chu | 2010/10/07 01:44 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/06 06:10 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/06 10:44 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 07:55 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 08:51 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 07:38 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 08:33 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/07 09:04 PM |
Practical vehicles for commuting | Rob Thorpe | 2010/10/08 05:50 AM |
Practical vehicles for commuting | Gabriele Svelto | 2010/10/08 06:05 AM |
Practical vehicles for commuting | Rob Thorpe | 2010/10/08 06:21 AM |
Practical vehicles for commuting | j | 2010/10/08 02:20 PM |
Practical vehicles for commuting | Rob Thorpe | 2010/12/09 07:00 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/08 10:14 AM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/07 01:23 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/07 04:08 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 05:41 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 08:05 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 08:52 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/08 07:52 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 11:28 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/07 12:37 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/07 01:37 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/05 02:02 AM |
Display | Linus Torvalds | 2010/10/04 10:39 AM |
Display | Gabriele Svelto | 2010/10/05 12:34 AM |
Display | Richard Cownie | 2010/10/04 06:22 AM |
Display | anon | 2010/10/04 09:22 PM |
Display | Richard Cownie | 2010/10/05 06:42 AM |
Display | mpx | 2010/10/03 11:55 AM |
Display | rcf | 2010/10/03 01:12 PM |
Display | mpx | 2010/10/03 02:36 PM |
Display | rcf | 2010/10/03 05:36 PM |
Display | Ricardo B | 2010/10/04 02:50 PM |
Display | gallier2 | 2010/10/05 03:44 AM |
Display | David Hess | 2010/10/05 05:21 AM |
Display | gallier2 | 2010/10/05 08:21 AM |
Display | David Hess | 2010/10/03 11:21 PM |
Display | rcf | 2010/10/04 08:06 AM |
Display | David Kanter | 2010/10/03 01:54 PM |
Alternative integration | Paul A. Clayton | 2010/10/06 08:51 AM |
Display | slacker | 2010/10/03 07:26 PM |
Display & marketing & analogies | ? | 2010/10/04 02:33 AM |
Display & marketing & analogies | kdg | 2010/10/04 06:00 AM |
Display | Kevin G | 2010/10/02 09:49 AM |
Display | Anon | 2010/10/03 03:43 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/29 03:17 PM |
Sandy Bridge CPU article online | Jack | 2010/09/28 06:27 AM |
Sandy Bridge CPU article online | IntelUser2000 | 2010/09/28 03:07 AM |
Sandy Bridge CPU article online | mpx | 2010/09/28 12:34 PM |
Sandy Bridge CPU article online | Aaron Spink | 2010/09/28 01:28 PM |
Sandy Bridge CPU article online | JoshW | 2010/09/28 02:13 PM |
Sandy Bridge CPU article online | mpx | 2010/09/28 02:54 PM |
Sandy Bridge CPU article online | Foo_ | 2010/09/29 01:19 AM |
Sandy Bridge CPU article online | mpx | 2010/09/29 03:06 AM |
Sandy Bridge CPU article online | JS | 2010/09/29 03:42 AM |
Sandy Bridge CPU article online | mpx | 2010/09/29 04:03 AM |
Sandy Bridge CPU article online | Foo_ | 2010/09/29 05:55 AM |
Sandy Bridge CPU article online | ajensen | 2010/09/28 12:19 AM |
Sandy Bridge CPU article online | Ian Ollmann | 2010/09/28 04:52 PM |
Sandy Bridge CPU article online | a reader | 2010/09/28 05:05 PM |
Sandy Bridge CPU article online | ajensen | 2010/09/28 11:35 PM |
Updated: Sandy Bridge CPU article | David Kanter | 2010/10/01 05:11 AM |
Updated: Sandy Bridge CPU article | anon | 2011/01/07 09:55 PM |
Updated: Sandy Bridge CPU article | Eric Bron | 2011/01/08 03:29 AM |
Updated: Sandy Bridge CPU article | anon | 2011/01/11 11:24 PM |
Updated: Sandy Bridge CPU article | anon | 2011/01/15 11:21 AM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anon | 2011/01/16 11:22 PM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anonymous | 2011/01/17 02:04 AM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anon | 2011/01/17 07:12 AM |
I can try.... | David Kanter | 2011/01/18 03:54 PM |
I can try.... | anon | 2011/01/18 08:07 PM |
I can try.... | David Kanter | 2011/01/18 11:24 PM |
I can try.... | anon | 2011/01/19 07:51 AM |
Wider fetch than execute makes sense | Paul A. Clayton | 2011/01/19 08:53 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/04 07:29 AM |
Sandy Bridge CPU article online | Seni | 2011/01/04 09:07 PM |
Sandy Bridge CPU article online | hobold | 2011/01/04 11:26 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 02:01 AM |
software assist exceptions | hobold | 2011/01/05 04:36 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 01:58 AM |
Sandy Bridge CPU article online | anon | 2011/01/05 04:51 AM |
Sandy Bridge CPU article online | Seni | 2011/01/05 08:53 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 09:03 AM |
Sandy Bridge CPU article online | anon | 2011/01/05 04:14 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 04:50 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/05 05:00 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 07:26 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/05 07:50 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 08:39 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 03:50 PM |
permuting vector elements | hobold | 2011/01/05 05:03 PM |
permuting vector elements | Nicolas Capens | 2011/01/05 06:01 PM |
permuting vector elements | Nicolas Capens | 2011/01/06 08:27 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/11 11:33 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/11 01:51 PM |
Sandy Bridge CPU article online | hobold | 2011/01/11 02:11 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/11 06:07 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/12 03:25 AM |
Sandy Bridge CPU article online | hobold | 2011/01/12 05:03 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/12 11:27 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/13 02:38 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/13 03:32 AM |
Sandy Bridge CPU article online | hobold | 2011/01/13 01:53 PM |
What happened to VPERMIL2PS? | Michael S | 2011/01/13 03:46 AM |
What happened to VPERMIL2PS? | Eric Bron | 2011/01/13 06:46 AM |
Lower cost permute | Paul A. Clayton | 2011/01/13 12:11 PM |
Sandy Bridge CPU article online | anon | 2011/01/25 06:31 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/12 06:34 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/13 07:38 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/15 09:47 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/16 03:13 AM |
And just to make a further example | Gabriele Svelto | 2011/01/16 04:24 AM |
Sandy Bridge CPU article online | mpx | 2011/01/16 01:27 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/25 02:56 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/25 04:11 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/26 08:49 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/26 04:35 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/27 02:51 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/27 02:40 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/28 03:24 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/28 03:49 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/30 02:11 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/31 03:43 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/01 04:02 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/01 04:28 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/01 04:43 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/28 07:14 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/01 02:58 AM |
Sandy Bridge CPU article online | EduardoS | 2011/02/01 02:36 PM |
Sandy Bridge CPU article online | anon | 2011/02/01 04:56 PM |
Sandy Bridge CPU article online | EduardoS | 2011/02/01 09:17 PM |
Sandy Bridge CPU article online | anon | 2011/02/01 10:13 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/02 04:08 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/02 04:26 AM |
Sandy Bridge CPU article online | kalmaegi | 2011/02/01 09:29 AM |
SW Rasterization | David Kanter | 2011/01/27 05:18 PM |
Lower pin count memory | iz | 2011/01/27 09:19 PM |
Lower pin count memory | David Kanter | 2011/01/27 09:25 PM |
Lower pin count memory | iz | 2011/01/27 11:31 PM |
Lower pin count memory | David Kanter | 2011/01/27 11:52 PM |
Lower pin count memory | iz | 2011/01/28 12:28 AM |
Lower pin count memory | David Kanter | 2011/01/28 01:05 AM |
Lower pin count memory | iz | 2011/01/28 03:55 AM |
Lower pin count memory | David Hess | 2011/01/28 01:15 PM |
Lower pin count memory | David Kanter | 2011/01/28 01:57 PM |
Lower pin count memory | iz | 2011/01/28 05:20 PM |
Two years later | ForgotPants | 2013/10/26 11:33 AM |
Two years later | anon | 2013/10/26 11:36 AM |
Two years later | Exophase | 2013/10/26 12:56 PM |
Two years later | David Hess | 2013/10/26 05:05 PM |
Herz is totally the thing you DON*T care. | Jouni Osmala | 2013/10/27 01:48 AM |
Herz is totally the thing you DON*T care. | EduardoS | 2013/10/27 07:00 AM |
Herz is totally the thing you DON*T care. | Michael S | 2013/10/27 07:45 AM |
Two years later | someone | 2013/10/28 07:21 AM |
Lower pin count memory | Martin Høyer Kristiansen | 2011/01/28 01:41 AM |
Lower pin count memory | iz | 2011/01/28 03:07 AM |
Lower pin count memory | Darrell Coker | 2011/01/27 10:39 PM |
Lower pin count memory | iz | 2011/01/28 12:20 AM |
Lower pin count memory | Darrell Coker | 2011/01/28 06:07 PM |
Lower pin count memory | iz | 2011/01/28 11:57 PM |
Lower pin count memory | Darrell Coker | 2011/01/29 02:21 AM |
Lower pin count memory | iz | 2011/01/31 10:28 PM |
SW Rasterization | Nicolas Capens | 2011/02/02 08:48 AM |
SW Rasterization | Eric Bron | 2011/02/02 09:37 AM |
SW Rasterization | Nicolas Capens | 2011/02/02 04:35 PM |
SW Rasterization | Eric Bron | 2011/02/02 05:11 PM |
SW Rasterization | Eric Bron | 2011/02/03 02:13 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 07:57 AM |
SW Rasterization | Eric Bron | 2011/02/04 08:50 AM |
erratum | Eric Bron | 2011/02/04 08:58 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 05:25 PM |
SW Rasterization | David Kanter | 2011/02/04 05:33 PM |
SW Rasterization | anon | 2011/02/04 06:04 PM |
SW Rasterization | Nicolas Capens | 2011/02/05 03:39 PM |
SW Rasterization | David Kanter | 2011/02/05 05:07 PM |
SW Rasterization | Nicolas Capens | 2011/02/05 11:39 PM |
SW Rasterization | Eric Bron | 2011/02/04 10:55 AM |
Comments pt 1 | David Kanter | 2011/02/02 01:08 PM |
Comments pt 1 | Eric Bron | 2011/02/02 03:16 PM |
Comments pt 1 | Gabriele Svelto | 2011/02/03 01:37 AM |
Comments pt 1 | Eric Bron | 2011/02/03 02:36 AM |
Comments pt 1 | Nicolas Capens | 2011/02/03 11:08 PM |
Comments pt 1 | Nicolas Capens | 2011/02/03 10:26 PM |
Comments pt 1 | Eric Bron | 2011/02/04 03:33 AM |
Comments pt 1 | Nicolas Capens | 2011/02/04 05:24 AM |
example code | Eric Bron | 2011/02/04 04:51 AM |
example code | Nicolas Capens | 2011/02/04 08:24 AM |
example code | Eric Bron | 2011/02/04 08:36 AM |
example code | Nicolas Capens | 2011/02/05 11:43 PM |
Comments pt 1 | Rohit | 2011/02/04 12:43 PM |
Comments pt 1 | Nicolas Capens | 2011/02/04 05:05 PM |
Comments pt 1 | David Kanter | 2011/02/04 05:36 PM |
Comments pt 1 | Nicolas Capens | 2011/02/05 02:45 PM |
Comments pt 1 | Eric Bron | 2011/02/05 04:13 PM |
Comments pt 1 | Nicolas Capens | 2011/02/05 11:52 PM |
Comments pt 1 | Eric Bron | 2011/02/06 01:31 AM |
Comments pt 1 | Nicolas Capens | 2011/02/06 04:06 PM |
Comments pt 1 | Eric Bron | 2011/02/07 03:12 AM |
The need for gather/scatter support | Nicolas Capens | 2011/02/10 10:07 AM |
The need for gather/scatter support | Eric Bron | 2011/02/11 03:11 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/13 03:39 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 07:46 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 07:48 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 09:32 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 10:07 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 09:00 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 07:49 AM |
Gather/scatter performance data | Eric Bron | 2011/02/15 02:23 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 05:06 PM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 07:52 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 09:43 AM |
SW Rasterization - a long way off | Rohit | 2011/02/02 01:17 PM |
SW Rasterization - a long way off | Nicolas Capens | 2011/02/04 03:59 AM |
CPU only rendering - a long way off | Rohit | 2011/02/04 11:52 AM |
CPU only rendering - a long way off | Nicolas Capens | 2011/02/04 07:15 PM |
CPU only rendering - a long way off | Rohit | 2011/02/05 02:00 AM |
CPU only rendering - a long way off | Nicolas Capens | 2011/02/05 09:45 PM |
CPU only rendering - a long way off | David Kanter | 2011/02/06 09:51 PM |
CPU only rendering - a long way off | Gian-Carlo Pascutto | 2011/02/07 12:22 AM |
Encryption | David Kanter | 2011/02/07 01:18 AM |
Encryption | Nicolas Capens | 2011/02/07 07:51 AM |
Encryption | David Kanter | 2011/02/07 11:50 AM |
Encryption | Nicolas Capens | 2011/02/08 10:26 AM |
CPUs are latency optimized | David Kanter | 2011/02/08 11:38 AM |
efficient compiler on an efficient GPU real today. | sJ | 2011/02/08 11:29 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/09 09:49 PM |
CPUs are latency optimized | Eric Bron | 2011/02/10 12:49 AM |
CPUs are latency optimized | Antti-Ville Tuunainen | 2011/02/10 06:16 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/10 07:04 AM |
CPUs are latency optimized | Eric Bron | 2011/02/10 07:48 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/10 01:31 PM |
CPUs are latency optimized | Eric Bron | 2011/02/11 02:43 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/11 07:31 AM |
CPUs are latency optimized | EduardoS | 2011/02/10 05:29 PM |
CPUs are latency optimized | Anon | 2011/02/10 06:40 PM |
CPUs are latency optimized | David Kanter | 2011/02/10 08:33 PM |
CPUs are latency optimized | EduardoS | 2011/02/11 02:18 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/11 05:56 AM |
CPUs are latency optimized | Rohit | 2011/02/11 07:33 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/14 02:19 AM |
CPUs are latency optimized | Eric Bron | 2011/02/14 03:23 AM |
CPUs are latency optimized | EduardoS | 2011/02/14 01:11 PM |
CPUs are latency optimized | David Kanter | 2011/02/11 02:45 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/15 05:22 AM |
CPUs are latency optimized | David Kanter | 2011/02/15 12:47 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/15 07:10 PM |
Have fun | David Kanter | 2011/02/15 10:04 PM |
Have fun | Nicolas Capens | 2011/02/17 03:59 AM |
Have fun | Brett | 2011/02/17 12:56 PM |
Have fun | Nicolas Capens | 2011/02/19 04:53 PM |
Have fun | Brett | 2011/02/20 06:08 PM |
Have fun | Brett | 2011/02/20 07:13 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/02/23 05:37 PM |
On-die storage to fight Amdahl | Brett | 2011/02/23 09:59 PM |
On-die storage to fight Amdahl | Brett | 2011/02/23 10:08 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/02/24 07:42 PM |
On-die storage to fight Amdahl | Rohit | 2011/02/25 11:02 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/03/09 06:53 PM |
On-die storage to fight Amdahl | Rohit | 2011/03/10 08:02 AM |
NVIDIA using tile based rendering? | Nathan Monson | 2011/03/11 07:58 PM |
NVIDIA using tile based rendering? | Rohit | 2011/03/12 04:29 AM |
NVIDIA using tile based rendering? | Nathan Monson | 2011/03/12 11:05 AM |
NVIDIA using tile based rendering? | Rohit | 2011/03/12 11:16 AM |
On-die storage to fight Amdahl | Brett | 2011/02/26 02:10 AM |
On-die storage to fight Amdahl | Nathan Monson | 2011/02/26 01:51 PM |
On-die storage to fight Amdahl | Brett | 2011/02/26 04:40 PM |
Convergence is inevitable | Nicolas Capens | 2011/03/09 08:22 PM |
Convergence is inevitable | Brett | 2011/03/09 10:59 PM |
Convergence is inevitable | Antti-Ville Tuunainen | 2011/03/10 03:34 PM |
Convergence is inevitable | Brett | 2011/03/10 09:39 PM |
Procedural texturing? | David Kanter | 2011/03/11 01:32 AM |
Procedural texturing? | hobold | 2011/03/11 03:59 AM |
Procedural texturing? | Dan Downs | 2011/03/11 09:28 AM |
Procedural texturing? | Mark Roulo | 2011/03/11 02:58 PM |
Procedural texturing? | Anon | 2011/03/11 06:11 PM |
Procedural texturing? | Nathan Monson | 2011/03/11 07:30 PM |
Procedural texturing? | Brett | 2011/03/15 07:45 AM |
Procedural texturing? | Seni | 2011/03/15 10:13 AM |
Procedural texturing? | Brett | 2011/03/15 11:45 AM |
Procedural texturing? | Seni | 2011/03/15 02:09 PM |
Procedural texturing? | Brett | 2011/03/11 10:02 PM |
Procedural texturing? | Brett | 2011/03/11 09:34 PM |
Procedural texturing? | Eric Bron | 2011/03/12 03:37 AM |
Convergence is inevitable | Jouni Osmala | 2011/03/09 11:28 PM |
Convergence is inevitable | Brett | 2011/04/05 05:08 PM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 05:23 AM |
Convergence is inevitable | none | 2011/04/07 07:03 AM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 10:34 AM |
Convergence is inevitable | anon | 2011/04/07 02:15 PM |
Convergence is inevitable | none | 2011/04/08 01:57 AM |
Convergence is inevitable | Brett | 2011/04/07 08:04 PM |
Convergence is inevitable | none | 2011/04/08 02:14 AM |
Gather implementation | David Kanter | 2011/04/08 12:01 PM |
RAM Latency | David Hess | 2011/04/07 08:22 AM |
RAM Latency | Brett | 2011/04/07 07:20 PM |
RAM Latency | Nicolas Capens | 2011/04/07 10:18 PM |
RAM Latency | Brett | 2011/04/08 05:33 AM |
RAM Latency | Nicolas Capens | 2011/04/10 02:23 PM |
RAM Latency | Rohit | 2011/04/08 06:57 AM |
RAM Latency | Nicolas Capens | 2011/04/10 01:23 PM |
RAM Latency | David Kanter | 2011/04/10 02:27 PM |
RAM Latency | Rohit | 2011/04/11 06:17 AM |
Convergence is inevitable | Eric Bron | 2011/04/07 09:46 AM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 09:50 PM |
Convergence is inevitable | Eric Bron | 2011/04/08 12:39 AM |
Flaws in PowerVR | Rohit | 2011/02/25 11:21 PM |
Flaws in PowerVR | Brett | 2011/02/26 12:37 AM |
Flaws in PowerVR | Paul | 2011/02/26 05:17 AM |
Have fun | David Kanter | 2011/02/18 12:52 PM |
Have fun | Michael S | 2011/02/19 12:12 PM |
Have fun | David Kanter | 2011/02/19 03:26 PM |
Have fun | Michael S | 2011/02/19 04:43 PM |
Have fun | anon | 2011/02/19 05:02 PM |
Have fun | Michael S | 2011/02/19 05:56 PM |
Have fun | anon | 2011/02/20 03:50 PM |
Have fun | EduardoS | 2011/02/20 02:44 PM |
Linear vs non-linear | EduardoS | 2011/02/20 02:55 PM |
Have fun | Michael S | 2011/02/20 04:19 PM |
Have fun | EduardoS | 2011/02/20 05:51 PM |
Have fun | Nicolas Capens | 2011/02/21 11:12 AM |
Have fun | Michael S | 2011/02/21 12:38 PM |
Have fun | Eric Bron | 2011/02/21 02:10 PM |
Have fun | Eric Bron | 2011/02/21 02:39 PM |
Have fun | Michael S | 2011/02/21 06:13 PM |
Have fun | Eric Bron | 2011/02/22 12:43 AM |
Have fun | Michael S | 2011/02/22 01:47 AM |
Have fun | Eric Bron | 2011/02/22 02:10 AM |
Have fun | Michael S | 2011/02/22 11:37 AM |
Have fun | anon | 2011/02/22 01:38 PM |
Have fun | EduardoS | 2011/02/22 03:49 PM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/23 06:37 PM |
Gather/scatter efficiency | anonymous | 2011/02/23 06:51 PM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/24 06:57 PM |
Gather/scatter efficiency | anonymous | 2011/02/24 07:16 PM |
Gather/scatter efficiency | Michael S | 2011/02/25 07:45 AM |
Gather implementation | David Kanter | 2011/02/25 05:34 PM |
Gather implementation | Michael S | 2011/02/26 10:40 AM |
Gather implementation | anon | 2011/02/26 11:52 AM |
Gather implementation | Michael S | 2011/02/26 12:16 PM |
Gather implementation | anon | 2011/02/26 11:22 PM |
Gather implementation | Michael S | 2011/02/27 07:23 AM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/28 03:14 PM |
Consider yourself ignored | David Kanter | 2011/02/22 01:05 AM |
one more anti-FMA flame. By me. | Michael S | 2011/02/16 07:40 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/16 08:30 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/16 09:15 AM |
one more anti-FMA flame. By me. | Nicolas Capens | 2011/02/17 06:27 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/17 07:42 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/17 05:46 PM |
Tarantula paper | Paul A. Clayton | 2011/02/18 12:38 AM |
Tarantula paper | Nicolas Capens | 2011/02/19 05:19 PM |
anti-FMA != anti-throughput or anti-SG | Eric Bron | 2011/02/18 01:48 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/20 03:46 PM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/20 05:00 PM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 04:05 AM |
Software pipelining on x86 | David Kanter | 2011/02/23 05:04 AM |
Software pipelining on x86 | JS | 2011/02/23 05:25 AM |
Software pipelining on x86 | Salvatore De Dominicis | 2011/02/23 08:37 AM |
Software pipelining on x86 | Jouni Osmala | 2011/02/23 09:10 AM |
Software pipelining on x86 | LeeMiller | 2011/02/23 10:07 PM |
Software pipelining on x86 | Nicolas Capens | 2011/02/24 03:17 PM |
Software pipelining on x86 | anonymous | 2011/02/24 07:04 PM |
Software pipelining on x86 | Nicolas Capens | 2011/02/28 09:27 AM |
Software pipelining on x86 | Antti-Ville Tuunainen | 2011/03/02 04:31 AM |
Software pipelining on x86 | Megol | 2011/03/02 12:55 PM |
Software pipelining on x86 | Geert Bosch | 2011/03/03 07:58 AM |
FMA benefits and latency predictions | David Kanter | 2011/02/25 05:14 PM |
FMA benefits and latency predictions | Antti-Ville Tuunainen | 2011/02/26 10:43 AM |
FMA benefits and latency predictions | Matt Waldhauer | 2011/02/27 06:42 AM |
FMA benefits and latency predictions | Nicolas Capens | 2011/03/09 06:11 PM |
FMA benefits and latency predictions | Rohit | 2011/03/10 08:11 AM |
FMA benefits and latency predictions | Eric Bron | 2011/03/10 09:30 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/23 05:19 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 07:50 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/23 10:37 AM |
FMA and beyond | Nicolas Capens | 2011/02/24 04:47 PM |
detour on terminology | hobold | 2011/02/24 07:08 PM |
detour on terminology | Nicolas Capens | 2011/02/28 02:24 PM |
detour on terminology | Eric Bron | 2011/03/01 02:38 AM |
detour on terminology | Michael S | 2011/03/01 05:03 AM |
detour on terminology | Eric Bron | 2011/03/01 05:39 AM |
detour on terminology | Michael S | 2011/03/01 08:33 AM |
detour on terminology | Eric Bron | 2011/03/01 09:34 AM |
erratum | Eric Bron | 2011/03/01 09:54 AM |
detour on terminology | Nicolas Capens | 2011/03/10 08:39 AM |
detour on terminology | Eric Bron | 2011/03/10 09:50 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 06:12 AM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 11:25 PM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/17 06:51 PM |
Tarantula vector unit well-integrated | Paul A. Clayton | 2011/02/18 12:38 AM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/19 02:17 PM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 02:09 AM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/20 09:55 AM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 01:39 PM |
anti-FMA != anti-throughput or anti-SG | EduardoS | 2011/02/20 02:35 PM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/21 08:12 AM |
anti-FMA != anti-throughput or anti-SG | anon | 2011/02/17 10:44 PM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/18 06:20 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/17 08:24 AM |
thanks | Michael S | 2011/02/17 04:56 PM |
CPUs are latency optimized | EduardoS | 2011/02/15 01:24 PM |
SwiftShader SNB test | Eric Bron | 2011/02/15 03:46 PM |
SwiftShader NHM test | Eric Bron | 2011/02/15 04:50 PM |
SwiftShader SNB test | Nicolas Capens | 2011/02/17 12:06 AM |
SwiftShader SNB test | Eric Bron | 2011/02/17 01:21 AM |
SwiftShader SNB test | Eric Bron | 2011/02/22 10:32 AM |
SwiftShader SNB test 2nd run | Eric Bron | 2011/02/22 10:51 AM |
SwiftShader SNB test 2nd run | Nicolas Capens | 2011/02/23 02:14 PM |
SwiftShader SNB test 2nd run | Eric Bron | 2011/02/23 02:42 PM |
Win7SP1 out but no AVX hype? | Michael S | 2011/02/24 03:14 AM |
Win7SP1 out but no AVX hype? | Eric Bron | 2011/02/24 03:39 AM |
CPUs are latency optimized | Eric Bron | 2011/02/15 08:02 AM |
CPUs are latency optimized | EduardoS | 2011/02/11 03:40 PM |
CPU only rendering - not a long way off | Nicolas Capens | 2011/02/07 06:45 AM |
CPU only rendering - not a long way off | David Kanter | 2011/02/07 12:09 PM |
CPU only rendering - not a long way off | anonymous | 2011/02/07 10:25 PM |
Sandy Bridge IGP EUs | David Kanter | 2011/02/07 11:22 PM |
Sandy Bridge IGP EUs | Hannes | 2011/02/08 05:59 AM |
SW Rasterization - Why? | Seni | 2011/02/02 02:53 PM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/10 03:12 PM |
Market reasons to ditch the IGP | Seni | 2011/02/11 05:42 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/16 04:29 AM |
Market reasons to ditch the IGP | Seni | 2011/02/16 01:39 PM |
An excellent post! | David Kanter | 2011/02/16 03:18 PM |
CPUs clock higher | Moritz | 2011/02/17 08:06 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/18 06:22 PM |
Market reasons to ditch the IGP | IntelUser2000 | 2011/02/18 07:20 PM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/21 02:42 PM |
Bad data (repeated) | David Kanter | 2011/02/22 12:21 AM |
Bad data (repeated) | none | 2011/02/22 03:04 AM |
13W or 8W? | Foo_ | 2011/02/22 06:00 AM |
13W or 8W? | Linus Torvalds | 2011/02/22 08:58 AM |
13W or 8W? | David Kanter | 2011/02/22 11:33 AM |
13W or 8W? | Mark Christiansen | 2011/02/22 02:47 PM |
Bigger picture | Nicolas Capens | 2011/02/24 06:33 PM |
Bigger picture | Nicolas Capens | 2011/02/24 08:06 PM |
20+ Watt | Nicolas Capens | 2011/02/24 08:18 PM |
<20W | David Kanter | 2011/02/25 01:13 PM |
>20W | Nicolas Capens | 2011/03/08 07:34 PM |
IGP is 3X more efficient | David Kanter | 2011/03/08 10:53 PM |
IGP is 3X more efficient | Eric Bron | 2011/03/09 02:44 AM |
>20W | Eric Bron | 2011/03/09 03:48 AM |
Specious data and claims are still specious | David Kanter | 2011/02/25 02:38 AM |
IGP power consumption, LRB samplers | Nicolas Capens | 2011/03/08 06:24 PM |
IGP power consumption, LRB samplers | EduardoS | 2011/03/08 06:52 PM |
IGP power consumption, LRB samplers | Rohit | 2011/03/09 07:42 AM |
Market reasons to ditch the IGP | none | 2011/02/22 02:58 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/24 06:43 PM |
Market reasons to ditch the IGP | slacker | 2011/02/22 02:32 PM |
Market reasons to ditch the IGP | Seni | 2011/02/18 09:51 PM |
Correction - 28 comparators, not 36. (NT) | Seni | 2011/02/18 10:03 PM |
Market reasons to ditch the IGP | Gabriele Svelto | 2011/02/19 01:49 AM |
Market reasons to ditch the IGP | Seni | 2011/02/19 11:59 AM |
Market reasons to ditch the IGP | Exophase | 2011/02/20 10:43 AM |
Market reasons to ditch the IGP | EduardoS | 2011/02/19 10:13 AM |
Market reasons to ditch the IGP | Seni | 2011/02/19 11:46 AM |
The next revolution | Nicolas Capens | 2011/02/22 03:33 AM |
The next revolution | Gabriele Svelto | 2011/02/22 09:15 AM |
The next revolution | Eric Bron | 2011/02/22 09:48 AM |
The next revolution | Nicolas Capens | 2011/02/23 07:39 PM |
The next revolution | Gabriele Svelto | 2011/02/24 12:43 AM |
GPGPU content creation (or lack of it) | Nicolas Capens | 2011/02/28 07:39 AM |
GPGPU content creation (or lack of it) | The market begs to differ | 2011/03/01 06:32 AM |
GPGPU content creation (or lack of it) | Nicolas Capens | 2011/03/09 09:14 PM |
GPGPU content creation (or lack of it) | Gabriele Svelto | 2011/03/10 01:01 AM |
The market begs to differ | Gabriele Svelto | 2011/03/01 06:33 AM |
The next revolution | Anon | 2011/02/24 02:15 AM |
The next revolution | Nicolas Capens | 2011/02/28 02:34 PM |
The next revolution | Seni | 2011/02/22 02:02 PM |
The next revolution | Gabriele Svelto | 2011/02/23 06:27 AM |
The next revolution | Seni | 2011/02/23 09:03 AM |
The next revolution | Nicolas Capens | 2011/02/24 06:11 AM |
The next revolution | Seni | 2011/02/24 08:45 PM |
IGP sampler count | Nicolas Capens | 2011/03/03 05:19 AM |
Latency and throughput optimized cores | Nicolas Capens | 2011/03/07 03:28 PM |
The real reason no IGP /CPU converge. | Jouni Osmala | 2011/03/07 11:34 PM |
Still converging | Nicolas Capens | 2011/03/13 03:08 PM |
Homogeneous CPU advantages | Nicolas Capens | 2011/03/08 12:12 AM |
Homogeneous CPU advantages | Seni | 2011/03/08 09:23 AM |
Homogeneous CPU advantages | David Kanter | 2011/03/08 11:16 AM |
Homogeneous CPU advantages | Brett | 2011/03/09 03:37 AM |
Homogeneous CPU advantages | Jouni Osmala | 2011/03/09 12:27 AM |
SW Rasterization | firsttimeposter | 2011/02/03 11:18 PM |
SW Rasterization | Nicolas Capens | 2011/02/04 04:48 AM |
SW Rasterization | Eric Bron | 2011/02/04 05:14 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 08:36 AM |
SW Rasterization | Eric Bron | 2011/02/04 08:42 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/26 03:23 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/02/04 04:31 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/05 08:46 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/02/06 06:20 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/06 06:07 PM |
Sandy Bridge CPU article online | arch.comp | 2011/01/06 10:58 PM |
Sandy Bridge CPU article online | Seni | 2011/01/07 10:25 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 04:28 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 06:06 AM |
permuting vector elements (yet again) | hobold | 2011/01/05 05:15 PM |
permuting vector elements (yet again) | Nicolas Capens | 2011/01/06 06:11 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/05 12:46 PM |
wow ...! | hobold | 2011/01/05 05:19 PM |
wow ...! | Nicolas Capens | 2011/01/05 06:11 PM |
wow ...! | Eric Bron | 2011/01/05 10:46 PM |
compress LUT | Eric Bron | 2011/01/05 11:05 PM |
wow ...! | Michael S | 2011/01/06 02:25 AM |
wow ...! | Nicolas Capens | 2011/01/06 06:26 AM |
wow ...! | Eric Bron | 2011/01/06 09:08 AM |
wow ...! | Nicolas Capens | 2011/01/07 07:19 AM |
wow ...! | Steve Underwood | 2011/01/07 10:53 PM |
saturation | hobold | 2011/01/08 10:25 AM |
saturation | Steve Underwood | 2011/01/08 12:38 PM |
saturation | Michael S | 2011/01/08 01:05 PM |
128 bit floats | Brett | 2011/01/08 01:39 PM |
128 bit floats | Michael S | 2011/01/08 02:10 PM |
128 bit floats | Anil Maliyekkel | 2011/01/08 03:46 PM |
128 bit floats | Kevin G | 2011/02/27 11:15 AM |
128 bit floats | hobold | 2011/02/27 04:42 PM |
128 bit floats | Ian Ollmann | 2011/02/28 04:56 PM |
OpenCL FP accuracy | hobold | 2011/03/01 06:45 AM |
OpenCL FP accuracy | anon | 2011/03/01 08:03 PM |
OpenCL FP accuracy | hobold | 2011/03/02 03:53 AM |
OpenCL FP accuracy | Eric Bron | 2011/03/02 07:10 AM |
pet project | hobold | 2011/03/02 09:22 AM |
pet project | Anon | 2011/03/02 09:10 PM |
pet project | hobold | 2011/03/03 04:57 AM |
pet project | Eric Bron | 2011/03/03 02:29 AM |
pet project | hobold | 2011/03/03 05:14 AM |
pet project | Eric Bron | 2011/03/03 03:10 PM |
pet project | hobold | 2011/03/03 04:04 PM |
OpenCL and AMD | Vincent Diepeveen | 2011/03/07 01:44 PM |
OpenCL and AMD | Eric Bron | 2011/03/08 02:05 AM |
OpenCL and AMD | Vincent Diepeveen | 2011/03/08 08:27 AM |
128 bit floats | Michael S | 2011/02/27 04:46 PM |
128 bit floats | Anil Maliyekkel | 2011/02/27 06:14 PM |
saturation | Steve Underwood | 2011/01/17 04:42 AM |
wow ...! | hobold | 2011/01/06 05:05 PM |
Ring | Moritz | 2011/01/20 10:51 PM |
Ring | Antti-Ville Tuunainen | 2011/01/21 12:25 PM |
Ring | Moritz | 2011/01/23 01:38 AM |
Ring | Michael S | 2011/01/23 04:04 AM |
So fast | Moritz | 2011/01/23 07:57 AM |
So fast | David Kanter | 2011/01/23 10:05 AM |
Sandy Bridge CPU (L1D cache) | Gordon Ward | 2011/09/09 02:47 AM |
Sandy Bridge CPU (L1D cache) | David Kanter | 2011/09/09 04:19 PM |
Sandy Bridge CPU (L1D cache) | EduardoS | 2011/09/09 08:53 PM |
Sandy Bridge CPU (L1D cache) | Paul A. Clayton | 2011/09/10 05:12 AM |
Sandy Bridge CPU (L1D cache) | Michael S | 2011/09/10 09:41 AM |
Sandy Bridge CPU (L1D cache) | EduardoS | 2011/09/10 11:17 AM |
Address Ports on Sandy Bridge Scheduler | Victor | 2011/10/16 06:40 AM |
Address Ports on Sandy Bridge Scheduler | EduardoS | 2011/10/16 07:45 PM |
Address Ports on Sandy Bridge Scheduler | Megol | 2011/10/17 09:20 AM |
Address Ports on Sandy Bridge Scheduler | Victor | 2011/10/18 05:34 PM |
Benefits of early scheduling | Paul A. Clayton | 2011/10/18 06:53 PM |
Benefits of early scheduling | Victor | 2011/10/19 05:58 PM |
Consistency and invalidation ordering | Paul A. Clayton | 2011/10/20 04:43 AM |
Address Ports on Sandy Bridge Scheduler | John Upcroft | 2011/10/21 04:16 PM |
Address Ports on Sandy Bridge Scheduler | David Kanter | 2011/10/22 10:49 AM |
Address Ports on Sandy Bridge Scheduler | John Upcroft | 2011/10/26 01:24 PM |
Store TLB look-up at commit? | Paul A. Clayton | 2011/10/26 08:30 PM |
Store TLB look-up at commit? | Richard Scott | 2011/10/26 09:40 PM |
Just a guess | Paul A. Clayton | 2011/10/27 01:54 PM |