By: Nicolas Capens (nicolas.capens.delete@this.gmail.com), February 9, 2011 8:49 pm
Room: Moderated Discussions
Hi David,
>>>>David Kanter (dkanter@realworldtech.com) on 2/7/11 wrote:
>>>>---------------------------
>>>You are talking about eliminating throughput cores. You claim to have a basic
>>>understanding of hardware, so it should be readily apparent how CPU cores and throughput
>>>cores (e.g. Niagara, GPU shaders) differ.
>>
>>Throughput-oriented is a term which originates from server systems, long before
>>graphics chips were even called GPUs! It merely means a focus on data rate, potentially
>>at the cost of latency. Any multiprocessor system, is a throughput oriented system.
>>Clock frequency and ILP are latency-oriented, while DLP and TLP are throughput-oriented.
>>
>>Today's x86 CPUs exploit DLP and TLP (SIMD, multi-core and Hyper-Threading) and
>>have a less aggressive clocking than several years ago.
>
>Actually the clock speeds are about the same 3-4GHz.
Indeed. You appear to have missed the word "aggressive". At 90 nm, 3.8 GHz was really pushing it. Actually, the Pentium 4 had an integer core at twice the frequency. It required very short pipeline stages and Low-Voltage Swing circuits, which takes a lot of extra transistors. So at 32 nm, 3.8 GHz is a breeze.
>>So they are definitely throughput-oriented
>>architectures already and would become more efficient at with the addition of gather/scatter
>>support. Parallel load/store is the main thing setting them apart from GPUs.
>
>That's simply false. There are many more differences in terms of circuit design
>and the latency of individual instructions.
Sure, but those things don't make it not thoughput-oriented.
>>GPUs are, in the words of NVIDIA's chief scientists, "aggressively throughput-oriented
>>processors". Note though that GF104 features superscalar >execution, intended to
>>lower the latency. And aside from reducing bandwidth, caches also reduce latency.
>>So GPUs are forces to become less aggressive at using throughput-oriented techniques,
>>because reducing latency somewhat also reduces the amount of on-chip storage you
>>need. It's a balancing act, because obviously reducing >latency costs transistors as well.
>
>The two architectures are leagues apart in terms of latency? What is the latency
>of a dependent chain of adds in a GPU? What about a CPU? What is the branch coherence? What is the memory latency?
The latency of a multiply-add on GF114 is 18 cycles I believe. On Sandy Bridge it would take 8 cycles. That's roughly a 2x difference, or 4x in absolute time.
Note though that on GT200 it was 24 cycles, so there's some convergence taking place.
Furthermore, the latency of the CPU is based on making use of argument forwarding, to bypass the register file. On a GPU the latency includes accessing the register file, which for AMD's architecture takes half of the total latency (base on what you wrote about AMD Cayman here: http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=5. What this means is that CPUs are not pushing their ALUs to perform 4 times faster than a GPU ALU. Note also that an AMD shader core is a fairly complex VLIW core which can do more than a MAD, while the CPU's multiply+add latency is based on just the scalar latency for a multiply and an add. It leaves the pipeline as soon as the result is ready. So even in terms of absolute latency, CPUs are not pushing their execution units.
Last but not least, note that using a forwarding network is a design choice which increases the area of the execution pipeline, but decreases the register file size! In conclusion, CPUs are indeed latency optimized, but that doesn't mean it can't be a throughput oriented architecture.
>The memory latency makes this especially obvious, as there is a huge difference.
There's indeed a big difference, but it has to be compensated using additional on-die storage, which itself lowers compute density.
So again, CPUs can achieve practically the same compute density, while at the same time being latency optimized too. Note that with FMA, the CPU cores would have a 2x compute density *advantage* over the IGP. Of course with software rendering you need that to implement the fixed-funtion operations, but in combination with the advantage of not being bottlenecked by any dedicated components it's clear that software rendering is viable.
And once more the conclusion is that modern CPUs are compute oriented. The GPU is still "aggressively" compute oriented, but that's slowly changing as this aggressiveness lowers effective efficiency for complex workloads.
>>>Even to someone without circuit design
>>>expertise, it should be blinding obvious - the clock speeds are about a factor of 2-4X different.
>>
>>Then why did NVIDIA decide to put its aggressively throughput-oriented cores into
>>a higher clock domain? The GeForce GTX 560 Ti has a shader clock of 1645 MHz, while
>>the Radeon HD 6950 has a clock of 800 MHz. Does that mean >NVIDIA's architecture is not throughput-oriented?
>
>1.5GHz is still half the speed or less of a modern CPU. AMD has GPUs that run
>at 900MHz. And those speeds have stayed constant for the last several generations (since about 65nm).
You missed the point. There's a design space for compute oriented devices, which includes a wide range of clock frequencies (3.2 GHz for Cell BE at 90 nm). So you can't conclude that modern x86 CPUs aren't compute oriented, based on their clock frequency (or latency).
Unless of course you want to imply that NVIDIA's GPUs are any less of a GPU due to their higher clock frequency?
Seriously, whether you like it or not modern CPUs are compute-oriented. And they're about to become more compute-oriented with FMA, and more efficient at it with gather/scatter, while GPUs become less compute-oriented (and in the case of AMD, less effient at it due to bottlenecks causing low utilization, unless they change course).
>>Clock speed alone isn't an indication of being throughput-oriented or not. It's
>>a design decision which doesn't have to compromise *effective* throughput, as proven
>>by NVIDIA vs. AMD. Another example is Cell BE, which clocks at 3.2 GHz but even
>>at 90 nm was considered strongly thoughput-oriented. x86 CPUs have come a long way
>>since the single-core Pentium 4 days, and they're not about to stop increasing throughput
>>efficiency (AVX, Bulldozer, FMA, etc.).
>
>Listen, no matter what you say, CPUs are still optimized for latency. It's blatantly
>obvious if you have ever written code for a CPU. It's also blatantly obvious that
>GPUs are not optimized for latency. This means different circuits, different architectures,
>pipeline arrangements, etc. etc. and most importantly a different style of programming.
I never denied CPUs are latency optimized! But they're no longer "aggressive" about it (i.e. they don't sacrifice everything else for the sake of latency reduction). The MHz-race is over. It has made way for a focus on effective performance/Watt, leading to a healthy balance between latency and throughput optimization.
Yes the design decisions are still different from GPUs, but it's converging, and even today it's already well within the design space of throughput-oriented architectures. Except for the "aggressive" varieties, it's orthogonal to being latency-oriented.
And no, it doesn't result in a different style of programming. SwiftShader takes shaders (GPU programming style) and compiles them into a sequence of vector instructions. It's abstracted from the application developer, in the same way the GPUs driver and hardware abstract the parallelism. The only thing that's missing to make the translation from a throughput-oriented programming language into explicit vector instructions more effective, is gather/scatter. But this happens within the abstraction layer. CPUs already support the programming model without it.
>Consider the performance degradation of running a latency sensitive workload (e.g.
>SPECint) on a GPU vs. a CPU. It's going to be huge.
Still not proving that CPUs aren't throughput-oriented.
>>As a matter of fact the CPU's clock frequency has remained nearly constant. The
>>i7-2600K has a 3.8 GHz Turbo Mode ( but under multi-threaded workloads it's not
>>that high), the same as the 2004 Pentium 4 Prescott. In the same time period, NVIDIA's
>>GPUs have increased their clock frequency by over a factor 3. There's no indication
>>this is going to change. For NVIDIA to conquer the HPC market, it needs to continue
>>investing into latency reduction. To prevent an excessive growth in die size, it
>>needs to increase the clock frequency. GF100 had some thermal issues, but they got
>>that under control with GF110, which has more cores >enabled and even higher clocks.
>
>Again, frequency is just one aspect. There are so many other things to consider.
>Pipeline depth, result forwarding, dependent instruction latency, etc. etc.
Yep, covered all of that. Still the conclusion is that aside from being latency optimized, CPUs are also thoughput optimized. One doesn't cancel the other if they're not aggressive about it.
>>So while it's "blinding obvious" that there's a clock frequency difference today,
>>it's also "blinding obvious" they're on a collision >course. Gather/scatter support
>>is still several years out, so by that time they'll have >converged even closer and gather/scatter is the keystone.
>
>Scatter/gather is helpful, but it will not make CPUs as efficient as GPUs. It
>doesn't help make the ALUs lower power. It doesn't help make the front-end lower power.
Higher utilization does improve power consumption of the rest of the pipeline. Idle logic consumes power as well.
Furthermore, you have to take into account that gather/scatter involves only the load/store units, while a sequence of extract/insert instructions involves both the load/store units and ALUs pipelines. With gather/scatter these pipelines become available for more useful work, improving throughput beyond what would be achieved if gather/scatter was an ALU operation!
Emulating gather/scatter involves 2.25 instructions per element. That's a lot of data moving though the pipelines just to exchange one element. Gather/scatter mainly only needs a small network to move the elements where you want them, and nothing else. I'm sure this is more power efficient than the 2.25 instructions that need to pass through the entire pipeline (which involves shuffle units as well). And on top of that it frees up an ALU. You could use this ALU to achieve higher thoughput (which itself consumes more power), or you could simplify the execution units (e.g. toss out the duplicate blend unit) and use the freed up cycle to achieve the same total throughput, keeping power consumption the same.
And it doesn't stop there. There's already people complaining about the fact that Sandy Bridge's dual load units can't load two 256-bit words, making AVX code more L1 cache bandwidth limited than SSE code. This is a problem which is going to get worse with FMA. Which is why FMA support was delayed till Haswell (i.e. when the transistor budget becomes available and applications which use AVX are around). Haswell will double the width of all load/store units. But since dual 128-bit gather/scatter suffices, a large part of the logic can overlap. This means that even when you're not using gather/scatter, the logic isn't a waste of transistors.
So any way you look at it, gather/scatter is an absolute improvement in power efficiency for throughput computing.
>Look at what CPUs have that GPUs do not:
>Bypass networks
>Branch predictors
>Prefetchers
>Out-of-order execution
>Low latency instructions
>
>etc. etc.
>
>The differences at the circuit level are just as big.
I've already shown that being throughput-oriented is a far more global property than any of those things could change.
Regarding prefetching, Fermi already has CUDA prefetch instructions, which are proven to be useful: http://comparch.gatech.edu/hparch/nvidia_kickoff_2010_kim.pdf. However, it takes additional instructions to calculate the fetch address and issue the prefetch. Here's an entire paper on the subject: http://arch.ece.gatech.edu/pub/asplos15.pdf. But note that successful prefetching makes a latency bound workload compute bound again. So it's better not to waste these cycles and use a small, conservative hardware prefetcher instead, which generically and automatically helps everything that could benefit from prefetch. Simple stream prefetchers have been around in CPUs for ages, so I'm sure it's a tiny amount of logic (in comparison to other solutions).
With increasing RAM latencies and increasing workload complexity, including for graphics, it's only a matter of time before GPUs feature speculative prefetching logic. Prefetching makes a throughput-oriented architecture a *more* succesful throughput-oriented architecture.
>>>You cannot simply tack on scatter/gather to a latency optimized CPU core and expect
>>>it to look like a throughput core in terms of power efficiency. At least, there
>>>is definitely a lack of evidence for any such claims. Moreover, you need to preserve
>>>the power efficiency for workloads that cannot be vectorized.
>>
>>An architecture which balances latency and theoretical throughput, can still achieve
>>high effective thoughput. It's how NVIDIA achieved to >outperform AMD with only half the FLOP density.
>
>That's so wrong it isn't even funny. That's because of AMD's use of static scheduling
>for their VLIW, and because Nvidia is much more optimized for scalar memory accesses.
>Has nothing to do with latency vs. throughput.
Hold your horses. The scheduling efficiency of the VLIW5 core was 3.4 operations (http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4). So the move to VLIW4 made it very efficient. Also note that instead of one fat ALU taking the role of SFU, three ALUs can together operate as an SFU. This unification means there's even better scheduling opportunity.
So it would be pretty ridiculous if low scalar memory access performance alone was responsible for lowering the efficiency to 50% of that of NVIDIA's architecture. You'd think someone would have noticed that and fixed it by now.
The reality is that AMD's architecture often can't use its full computing capacity. It's not a well balanced architecture for executing a wide range of workloads, but it compensates for that with raw computing power while executing tasks that don't get bottlenecked. Fighting Amdahl's Law by using a massive amount of tiny cores works for a while, but eventually it stops scaling.
Don't get me wrong, it's an excellent GPU for contemporary games. But as NVIDIA proves, an architecture can be both good at games and other workloads, with half of the peak computing power.
>>The way things converge, tacking on gather/scatter support >does put the GPU within
>>striking distance, starting with the IGP. For someone not
>>time, a balanced homogenous architecture is the most cost >effective solution for all his processing needs.
>
>I don't believe that a homogeneous architecture is optimal at all, and you have
>yet to show that in any meaningful way. In fact, you have admitted that it is sub-optimal
>for power consumption...which means that as long as graphics consumes a non-trivial
>amount of power, that an IGP will be a superior solution. If there is a day when
>graphics is merely 1-2% of all cycles, then perhaps it might happen...but I don't see that ever happening.
I've shown you that homogeneous architectures can work, through the example of vertex and pixel shading unification. If there was a fixed ratio in the workload, keeping them separate whould have been more efficient, but because that's not the case, all desktop and laptop GPUs have moved to unified shaders.
Now, while gaming with an IGP, you're practically always going to be GPU limited. There might be some peaks in the CPU usage for physics and AI and such, but much of the time the CPU is waiting on the IGP. While not gaming, which is going to be the majority of the time for people who chose a system with an IGP, it's the IGP that sits idle. In particular, it's of no use during CPU intensive tasks. Also, they're both using the same floating-point and integer operations. So it's clearly worth unifying them (by wich I mean ditching the IGP and adding gather/scatter to the CPU).
That would give us a 6-core mainstream CPU. And with FMA already on the roadmap we're looking at 650 GFLOPS of parallel processing power within grasp. That's nothing to sneeze at; comparable to a GF106 in GFLOPS, transistors and TDP. So the global properties of a compute-oriented architecture are all there. Only the lack of gather/scatter support would hold it back, which is easy enough to fix.
>>Note that widening the vectors amortizes the cost of >things like out-of-order execution.
>>At the same time, AMD has reduced its VLIW width from 5 to >4, in order to achieve
>>higher efficiency.
>
>So what?
Widening the vectors makes the CPU more efficient at compute-oriented workloads. Performance/transistor increases.
AMD reduced the VLIW width but not the front end or register file. This results in a lower computing density. Cypress XT achieves a theoretical 2.7 TFLOPS with 2.15 billion transistors, while Cayman XT requires 2.65 billion transistors to do the same. The effective thoughput increased by 10%, but still the performance/transistor went down.
Convergence. That's what.
>>GPUs also introduced concurrent kernel execution and >scalar execution,
>>and have growing register files and caches. So they're >investing more transistors
>>into latency reduction and programmability than raw FLOPS. >GF110 has a 0.52 FLOP/transitor
>>ratio. With G92b that was still a 0.94 ratio.
>
>Register file size is increased for better throughput...the registers per vector lane have been decreasing.
That's not what you wrote here: http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=5
I'm assuming you're talking about GF100 then. Note that GT200 doubled the register count, without touching the execution core. GF100 was able to reduce that again, but at the cost of increased scheduling complexity. Quoting your article on GF100: "Again, each pipeline is still scalar, but there are now two for added throughput. Despite the notion that GPU cores are simpler than CPU cores, the schedulers have to tackle considerable complexity."
This proves two of my points. The first one is that GPUs either lose computing density by adding more registers, or they lose computing density by adding complex scheduling. And secondly, it indicates that the use latency-oriented techniques doesn't make a modern CPU less of a comput-oriented architecture.
>What you are saying is obvious - GPUs are becoming more programmable. But the
>reality is that they are not even remotely optimized for latency. Where do you compile your code? On a CPU or a GPU?
They're not very optimized for latency yet, but they are becoming more optimized for it. They've got no choice since even graphics is running out of easy to exploit DLP. And to run some of the parallel workloads the CPU is currently better at requires reducing lantecy as well.
Note that it wouldn't take a whole lot of logic to make the GPU vastly more latency optimized. They've already got generic caches, but lack speculative prefetch, which is tiny in comparison to the cache itself. And they've already got superscalar execution, but they lack reordering. A small instruction window would allow to hide some of the latency within the same strand, which allows reducing the register file and reduces cache contention.
So you see, the design space allows to make them much more latency optimized without making them much less thoughput optimized, if at all. I still won't use the GPU for compilation any time soon, but there's a whole range of GPGPU applications which would be vastly more successful on a more latency optimized GPU. And NVIDIA's push for the HPC market forces them to come up with efficient techniques to balance the efficiency at complex workloads.
CPUs do trade some potential thoughput for latency, but note that they've come a *very* long way since the Pentium 4 days. By adding FMA and gather/scatter support they retain their efficiency at latency-bound workloads, but become even optimized for thoughput workloads, including adequately running graphics to make the IGP redundant.
>The bottom line is that to achieve optimal throughput you must sacrifice latency
>(look at the memory subsystem), and vica versa. You can refuse to believe this,
>but it's simply true. While GPUs may become more programmable, this is all relative
>to an architecture that started with 0 programmability. The gap between GPUs and
>CPUs may shrink, but it will never disappear and the efficiency differences will always be sizable.
Saying something is "simply true", doesn't make it true. I've given you lots of proof of the contrary already. Refusing to optimize for latency, is not something that works for long due to Amdahl's Law. It may cost theoretical thoughput but unagressive latency optimization improves effective throughput. Only the latter is relevant.
Regarding programmability versus efficiency differences, you're forgetting that pixel processing was still largely fixed-function when vertex processing became programmable. Yet it evolved into full unification!
The gap will not stop shrinking, because of the simple fact that you can't get high utilization out of a million cores for any realistic workload in consumer applications. So there is no other option but to *increasingly* focus more on latency optimization.
>>It's easy to see where your preconceptions come from though. NV40 had a 0.24 ratio,
>>which G92b increased by a fourfold in a few years time. But you got fooled into
>>thinking that this is a trend which can be sustained. Widening a component only
>>increases the overall throughput desity of that component till it reaches 50% of
>>the die area. And the components themselves get fatter to increase programmability
>>as well, and the rest of the architecture needs to support >the same throughput.
>
>Of course it's easy to see where my preconceptions come from. It's reality.
Keep telling yourself that.
[snip]
>>>Let's take another example. Icera makes a very cool SDR. However, to meet the
>>>performance and power efficiency requirements, they use a custom designed chip to
>>>run the SDR. So, the 'dedicated hardware' is used by many different radio protocols,
>>>in exactly the same way that GPU shaders are used by many different shader types.
>>It's still dedicated hardware though.
>>
>>Does Icera's SDR support IEEE-754? I guess not, so *this* >is irrelevant.
>
>What does IEEE-754 have to do with latency?
Who's talking about latency? This part of the discussion was about dedicated versus generic/unified hardware.
I mentioned IEEE-754 because among other criteria it only makes sense to unify things when you're using the same generic operations (i.e. floating-point and/or integer math). There's singificant opportunity for using dedicated hardware for an SDR since it's a very defined workload. You can't compare that to graphics, which uses the same IEEE-754 operations as supported by the CPU, and there's a varying workload.
Also, dedicated SDR hardware belongs in the ultra-mobile market. This market has very different characteristics. Power consumption is of the utmost importance, while cost is much less of an issue because the required logic is tiny anyway.
>>It's nothing personal, but face it, you're running out of >arguments and start handwaving
>>and reaching for absurd examples which I'm easily able to >debunk.
>
>Only because you totally fail to understand and refuse to acknowledge reality.
No, I do understand and acknowledge the reality of a dedicted SDR for the ultra-mobile market.
What you're failing to understand and acknowledge though is that its charachteristics have no immediate relevance to the markets I'm making software rendering claims about.
>>>In case you haven't noticed, modern CPUs are filled with >>idle silicon. Floating
>>>point units, AES crypto blocks, virtualization support, >>real mode support, etc. Many of these were added recently.
>>
>>Floating-point is useful to graphics, so this isn't an argument against software rendering.
>>
>>As for AES, virtualization, real mode, etc. they certainly don't "fill" the CPU
>>with idle silicon.
>
>Microcode?
Tiny, and not idle.
>>Unless you can prove me otherwise, AES doesn't take die space
>>proportional to the GPU's, ROPs, texture samplers or >rasterizers.
>
>The ROPs are used for general purpose workloads, as are the texture sampling units.
>Where do you think loads and stores are executed? And atomic operations? The
>rasterizer is not useful for general software, but how much power does it consume? How much area?
Fine, then lets compare AES to anti-aliasing or anisotropic filtering.
On the one hand you're trying to tell me dedicated hardware is an absolute necessity but on the other hand CPUs are not allowed to spend a tiny bit of die space on things like AES?
Don't mistake my claim about software rendering on a homogenous architecture, in specific markets, for a claim that all dedicated hardware should be banned.
>>And like I said
>>before, fast AES support is important for generic encrypted disk and network access,
>>and gather/scatter speeds up software AES so the dedicated >hardware can be removed.
>
>You said that, but you're wrong. You cannot remove it for compatibility reasons, and also for security reasons.
Wrong. AES-NI has its own CPUID bit. Software has to check support for it before using it. So there's no compatibility issue. And any security attach is utterly impractical. But even for the paranoia it doesn't mean you're out of options without AES-NI hardware support. The AESSE implementation only uses registers so it's not succeptible to cold-boot or cache-timing attacks.
>>VT-x and real mode are even supported by Atom cores, so >it's doubtful this takes
>>any noticable die space on a desktop chip, and it's >obviously indispensable for the software that make use of >it.
>
>Why is virtualization support in hardware? VMware was doing fine with their binary
>translation. Maybe it was added to improve performance and efficiency!!!! Just like rasterizers!
First of all, you're talking about totally different pieces of dedicated hardware. You can't conclude from the potential need for virtualization hardware that dedicated rasterizers are a necessity for graphics.
That said, virtualization may not be needed at all: http://www.vmware.com/pdf/asplos235_adams.pdf. Obviously it's not an efficiency improvement when performance is much lower. Once again though, it's a CPUID bit. So if they ever felt like it's not worth it, they can leave it out. So far it looks like they intend on keeping it, but you can't conclude anything about graphics hardware from this.
>>Besides, like I said before GPUs also have lots of programmability features which
>>may or may not be used. For instance it's doubtful I'll ever use my GeForce GTX
>>460's double-precision computing capabilities. But that's fine, it's relatively
>>small and it's not worth designing a separate chip for the people who do use it.
>>
>>So I have nothing against dedicated hardware in general, but like I said it has
>>to offer a high enough efficiency advantage, weighed against its utilization. The
>>problem with some of the GPU's dedicated hardware is that even during it's key application,
>>graphics, it's often either a significant bottleneck or >mostly idle.
>
>You have said that, but frankly, you've said a lot of things that are simply wrong.
>
>How about you provide some hard data on modern high performance GPUs (e.g. most
>recent generation from NV or AMD) on the utilization of the rasterizer. They have
>performance profilers, so it shouldn't be too hard. Then you can find out how much
>power the rasterizers use, and we can compare it to the power consumption of SW
>rendering. Then you will have actually a marginal understanding of the relative efficiency.
>
>And I'm fairly certain that you will find that comparison to be very unattractive for SW rendering.
For software rendering, rasterization and gradient setup combined takes on average 1.4% of CPU time running the Crysis benchmark. That's all the data you need from me. The rest of the claim is yours, so you prove it.
>>Unifying vertex
>>and pixel processing removed the bottleneck between them >and increased utilization.
>>Texture sampling is useless to generic computing and >having too few texture units
>>is a bottleneck to graphics, while the importance of FP32 >texture filtering increases,
>>so it makes lots of sense to start doing the filtering in >shader units and have
>>more generic gather/scatter units. And support for >micropolygons would require substantial
>>hardware to sustain the peak throughput, but it's again >idle during other workloads
>>and even for graphics it's full capacity isn't used all >the time. Make it smaller,
>>and it's a bottleneck when drawing micropolygons. Again >unification seems like the better option here to me.
>
>You haven't even quantified the gains from utilization at all for rendering, or the cost in terms of power consumption.
What you're asking here is probably worth a doctoral dissertation. So you're going to have to wait for detailed data than what I've already provided, or come up with it yourself. In the meantime, I've given you plenty of arguments to make it at the very least plausible for software rendering to make the IGP redundant once gahter/scatter support is added.
Face it. You haven't presented a single smashing evidence of the countrary. You started with the preconception that hardware rendering is an order of magnitude more efficient, but that clearly crumbled as you had to look for deeper differences, which don't affect the global efficiency nearly as much, and you felt the need to come up with ever more contrived examples from different markets than the one that's relevant. Seriously, this entire discussion has only made me more confident in what I do. Thanks for that.
>>>>What you're also forgetting is that the software evolves as well. In 2001 people
>>>>were really excited about pixel shader 1.1. Today, a desktop GPU with only pixel
>>>>shader 1.1 support would be totally ridiculous, regardless of how power efficient
>>>>it is. I've said it before; we don't need more pixels, we >need more exciting ones.
>>>>Which means increasing generic programmability.
>>>
>>>So let the shaders evolve, and stay separate.
>>
>>I sincerely hope you're not being serious. There's no way >GPU manufacturers will un-unify their architectures.
>
>Please read what I wrote, carefully and think about it. "Stay separate" implies
>they are already separate. What are they separate from? You seem to assume I'm
>talking about the vertex/pixel/geo shaders being separate from one another, but that's hardly clear.
>
>What was meant is that the shaders should stay separate from the CPU (which is the state today, even in IGPs).
I misunderstood that, but ironically it's not all that very different from asking GPU manufacturers to un-unify vertex and pixel shaders. You're not aknowledging the motivations and advatages behind that unification.
So tell me, why should vertex and pixel shaders stay unified while unifying the CPU and IGP would be a bad idea?
>>>Every single fact that I've seen tends to suggest that software rendering is a demonstrably bad idea.
>>
>>You haven't demonstrated anything.
>
>Sure I have. CPUs are not optimized for throughput and have roughly 4X lower performance
>efficiency. In fact, in some cases that's a vast understatement.
No, you have not demonstrated that CPUs are not optimized for throughput. You demonstrated they are optimized for latency, and wrongly implied from that they can't be optimized for throughput.
>A Tesla has roughly 2.2 GFLOP/s per W (DP). A high performance Westmere has roughly
>0.75 GFLOP/s per W. Cayman is roughly 2.7 GFLOP/s per, although a real workstation
>card would be lower, probably around 2.5 GFLOP/s per W.
>
>So the reality is that the performance per watt is much worse on CPUs than GPU,
>by a factor of 3-4. So to achieve the same throughput, the power consumption would
>be 3-4X higher. So...um...CPUs aren't throughput optimized.
Westmere doesn't even have AVX, and FMA doubles the GFLOPS rating again. Furthemore, as proven by NVIDIA, Cayman's effective throughput is only half the theoretical throughput. So there's your 3-4X smashed to piece. And while GPUs make use of their fixed-function hardware as well during graphics, you're neglecting that they're no longer able to scale their effective throughput aggressively.
>>And "tends to suggest" coming from someone who's clearly basing things on prejudice
>>is just more handwaving. I've proven you WRONG about the necessity for dedicated
>>texture decompression, using real data.
>
>You have no real data. You had bad data from an old simulator that the author
>of the simulator thought was BS. Garbage in, garbage out.
I do have real data, but I might have forgotten to mention it in this thread (I mentioned it in two other posts though): Crysis at High detail at 1680x1050 performs 22 million compressed texture accesses per frame. Assuming no magnification and no texture reuse, this means that using uncompressed textures instead would have costed only 4 GB/s of extra bandwidth at 60 FPS. But no IGP runs Crysis at these settings at 60 FPS (not even my GTX 460), so it's way less than that, and in reality there is some magnification and quite a bit of texture reuse in the foilage.
[snip]
>>And finally I've shown that an IGP does cost
>>quite a bit and is worthless for non-graphics applications.
>
>That you definitely haven't shown. And IGPs are useful for the same general purpose
>applications that a GPU is. Fusion parts will have OpenCL and compute shader. So will Ivy Bridge.
Then show me one real-life example of an application using the IGP for something other than graphics, and achieving an advantage over using properly optimized AVX code (I'll settle for SSE if you must).
>>Yes, GPUs are evolving too, toward a more CPU-like >architecture! I've proven that many times now.
>
>Yes and the relative gap in performance is still HUGE.
Is it? Again, show me this huge performance gap for anything other than graphics running on the IGP. Also, the gap for SwiftShader is only 5x. It's not using AVX yet, there's no FMA, and no gather/scatter. Are you still comfortable claiming the gap will be huge with these three throughput-oriented technologies in place?
[snip]
>The bottom line is that while it's true that GPUs and CPUs are evolving towards
>one another, that says nothing about how vast the distance between the two is.
>The reality is that there is roughly a 4X gap in performance efficiency between
>GPUs and CPUs on many throughput workloads, and the gap is even larger on latency sensitive workloads.
There's a 4X gap today but not for long. And no the gap isn't larger on lantency sensitive workloads. There's software pipelining to deal with that. There's nothing in terms of latency a GPU can do, that a CPU can't.
>Throughput means more than just scatter/gather although it is one key aspect.
>But to simply throughput down to scatter/gather is pure ignorance and naivete, and
>shows an acute lack of understanding of the substantial differences in circuit design, microarchitecture and software.
With all due respect, you missed the fact that the 4X throughput gap is soon gone and you're telling me I have an acute lack of understanding the substantial differences? Please.
Take care,
Nicolas
>>>>David Kanter (dkanter@realworldtech.com) on 2/7/11 wrote:
>>>>---------------------------
>>>You are talking about eliminating throughput cores. You claim to have a basic
>>>understanding of hardware, so it should be readily apparent how CPU cores and throughput
>>>cores (e.g. Niagara, GPU shaders) differ.
>>
>>Throughput-oriented is a term which originates from server systems, long before
>>graphics chips were even called GPUs! It merely means a focus on data rate, potentially
>>at the cost of latency. Any multiprocessor system, is a throughput oriented system.
>>Clock frequency and ILP are latency-oriented, while DLP and TLP are throughput-oriented.
>>
>>Today's x86 CPUs exploit DLP and TLP (SIMD, multi-core and Hyper-Threading) and
>>have a less aggressive clocking than several years ago.
>
>Actually the clock speeds are about the same 3-4GHz.
Indeed. You appear to have missed the word "aggressive". At 90 nm, 3.8 GHz was really pushing it. Actually, the Pentium 4 had an integer core at twice the frequency. It required very short pipeline stages and Low-Voltage Swing circuits, which takes a lot of extra transistors. So at 32 nm, 3.8 GHz is a breeze.
>>So they are definitely throughput-oriented
>>architectures already and would become more efficient at with the addition of gather/scatter
>>support. Parallel load/store is the main thing setting them apart from GPUs.
>
>That's simply false. There are many more differences in terms of circuit design
>and the latency of individual instructions.
Sure, but those things don't make it not thoughput-oriented.
>>GPUs are, in the words of NVIDIA's chief scientists, "aggressively throughput-oriented
>>processors". Note though that GF104 features superscalar >execution, intended to
>>lower the latency. And aside from reducing bandwidth, caches also reduce latency.
>>So GPUs are forces to become less aggressive at using throughput-oriented techniques,
>>because reducing latency somewhat also reduces the amount of on-chip storage you
>>need. It's a balancing act, because obviously reducing >latency costs transistors as well.
>
>The two architectures are leagues apart in terms of latency? What is the latency
>of a dependent chain of adds in a GPU? What about a CPU? What is the branch coherence? What is the memory latency?
The latency of a multiply-add on GF114 is 18 cycles I believe. On Sandy Bridge it would take 8 cycles. That's roughly a 2x difference, or 4x in absolute time.
Note though that on GT200 it was 24 cycles, so there's some convergence taking place.
Furthermore, the latency of the CPU is based on making use of argument forwarding, to bypass the register file. On a GPU the latency includes accessing the register file, which for AMD's architecture takes half of the total latency (base on what you wrote about AMD Cayman here: http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=5. What this means is that CPUs are not pushing their ALUs to perform 4 times faster than a GPU ALU. Note also that an AMD shader core is a fairly complex VLIW core which can do more than a MAD, while the CPU's multiply+add latency is based on just the scalar latency for a multiply and an add. It leaves the pipeline as soon as the result is ready. So even in terms of absolute latency, CPUs are not pushing their execution units.
Last but not least, note that using a forwarding network is a design choice which increases the area of the execution pipeline, but decreases the register file size! In conclusion, CPUs are indeed latency optimized, but that doesn't mean it can't be a throughput oriented architecture.
>The memory latency makes this especially obvious, as there is a huge difference.
There's indeed a big difference, but it has to be compensated using additional on-die storage, which itself lowers compute density.
So again, CPUs can achieve practically the same compute density, while at the same time being latency optimized too. Note that with FMA, the CPU cores would have a 2x compute density *advantage* over the IGP. Of course with software rendering you need that to implement the fixed-funtion operations, but in combination with the advantage of not being bottlenecked by any dedicated components it's clear that software rendering is viable.
And once more the conclusion is that modern CPUs are compute oriented. The GPU is still "aggressively" compute oriented, but that's slowly changing as this aggressiveness lowers effective efficiency for complex workloads.
>>>Even to someone without circuit design
>>>expertise, it should be blinding obvious - the clock speeds are about a factor of 2-4X different.
>>
>>Then why did NVIDIA decide to put its aggressively throughput-oriented cores into
>>a higher clock domain? The GeForce GTX 560 Ti has a shader clock of 1645 MHz, while
>>the Radeon HD 6950 has a clock of 800 MHz. Does that mean >NVIDIA's architecture is not throughput-oriented?
>
>1.5GHz is still half the speed or less of a modern CPU. AMD has GPUs that run
>at 900MHz. And those speeds have stayed constant for the last several generations (since about 65nm).
You missed the point. There's a design space for compute oriented devices, which includes a wide range of clock frequencies (3.2 GHz for Cell BE at 90 nm). So you can't conclude that modern x86 CPUs aren't compute oriented, based on their clock frequency (or latency).
Unless of course you want to imply that NVIDIA's GPUs are any less of a GPU due to their higher clock frequency?
Seriously, whether you like it or not modern CPUs are compute-oriented. And they're about to become more compute-oriented with FMA, and more efficient at it with gather/scatter, while GPUs become less compute-oriented (and in the case of AMD, less effient at it due to bottlenecks causing low utilization, unless they change course).
>>Clock speed alone isn't an indication of being throughput-oriented or not. It's
>>a design decision which doesn't have to compromise *effective* throughput, as proven
>>by NVIDIA vs. AMD. Another example is Cell BE, which clocks at 3.2 GHz but even
>>at 90 nm was considered strongly thoughput-oriented. x86 CPUs have come a long way
>>since the single-core Pentium 4 days, and they're not about to stop increasing throughput
>>efficiency (AVX, Bulldozer, FMA, etc.).
>
>Listen, no matter what you say, CPUs are still optimized for latency. It's blatantly
>obvious if you have ever written code for a CPU. It's also blatantly obvious that
>GPUs are not optimized for latency. This means different circuits, different architectures,
>pipeline arrangements, etc. etc. and most importantly a different style of programming.
I never denied CPUs are latency optimized! But they're no longer "aggressive" about it (i.e. they don't sacrifice everything else for the sake of latency reduction). The MHz-race is over. It has made way for a focus on effective performance/Watt, leading to a healthy balance between latency and throughput optimization.
Yes the design decisions are still different from GPUs, but it's converging, and even today it's already well within the design space of throughput-oriented architectures. Except for the "aggressive" varieties, it's orthogonal to being latency-oriented.
And no, it doesn't result in a different style of programming. SwiftShader takes shaders (GPU programming style) and compiles them into a sequence of vector instructions. It's abstracted from the application developer, in the same way the GPUs driver and hardware abstract the parallelism. The only thing that's missing to make the translation from a throughput-oriented programming language into explicit vector instructions more effective, is gather/scatter. But this happens within the abstraction layer. CPUs already support the programming model without it.
>Consider the performance degradation of running a latency sensitive workload (e.g.
>SPECint) on a GPU vs. a CPU. It's going to be huge.
Still not proving that CPUs aren't throughput-oriented.
>>As a matter of fact the CPU's clock frequency has remained nearly constant. The
>>i7-2600K has a 3.8 GHz Turbo Mode ( but under multi-threaded workloads it's not
>>that high), the same as the 2004 Pentium 4 Prescott. In the same time period, NVIDIA's
>>GPUs have increased their clock frequency by over a factor 3. There's no indication
>>this is going to change. For NVIDIA to conquer the HPC market, it needs to continue
>>investing into latency reduction. To prevent an excessive growth in die size, it
>>needs to increase the clock frequency. GF100 had some thermal issues, but they got
>>that under control with GF110, which has more cores >enabled and even higher clocks.
>
>Again, frequency is just one aspect. There are so many other things to consider.
>Pipeline depth, result forwarding, dependent instruction latency, etc. etc.
Yep, covered all of that. Still the conclusion is that aside from being latency optimized, CPUs are also thoughput optimized. One doesn't cancel the other if they're not aggressive about it.
>>So while it's "blinding obvious" that there's a clock frequency difference today,
>>it's also "blinding obvious" they're on a collision >course. Gather/scatter support
>>is still several years out, so by that time they'll have >converged even closer and gather/scatter is the keystone.
>
>Scatter/gather is helpful, but it will not make CPUs as efficient as GPUs. It
>doesn't help make the ALUs lower power. It doesn't help make the front-end lower power.
Higher utilization does improve power consumption of the rest of the pipeline. Idle logic consumes power as well.
Furthermore, you have to take into account that gather/scatter involves only the load/store units, while a sequence of extract/insert instructions involves both the load/store units and ALUs pipelines. With gather/scatter these pipelines become available for more useful work, improving throughput beyond what would be achieved if gather/scatter was an ALU operation!
Emulating gather/scatter involves 2.25 instructions per element. That's a lot of data moving though the pipelines just to exchange one element. Gather/scatter mainly only needs a small network to move the elements where you want them, and nothing else. I'm sure this is more power efficient than the 2.25 instructions that need to pass through the entire pipeline (which involves shuffle units as well). And on top of that it frees up an ALU. You could use this ALU to achieve higher thoughput (which itself consumes more power), or you could simplify the execution units (e.g. toss out the duplicate blend unit) and use the freed up cycle to achieve the same total throughput, keeping power consumption the same.
And it doesn't stop there. There's already people complaining about the fact that Sandy Bridge's dual load units can't load two 256-bit words, making AVX code more L1 cache bandwidth limited than SSE code. This is a problem which is going to get worse with FMA. Which is why FMA support was delayed till Haswell (i.e. when the transistor budget becomes available and applications which use AVX are around). Haswell will double the width of all load/store units. But since dual 128-bit gather/scatter suffices, a large part of the logic can overlap. This means that even when you're not using gather/scatter, the logic isn't a waste of transistors.
So any way you look at it, gather/scatter is an absolute improvement in power efficiency for throughput computing.
>Look at what CPUs have that GPUs do not:
>Bypass networks
>Branch predictors
>Prefetchers
>Out-of-order execution
>Low latency instructions
>
>etc. etc.
>
>The differences at the circuit level are just as big.
I've already shown that being throughput-oriented is a far more global property than any of those things could change.
Regarding prefetching, Fermi already has CUDA prefetch instructions, which are proven to be useful: http://comparch.gatech.edu/hparch/nvidia_kickoff_2010_kim.pdf. However, it takes additional instructions to calculate the fetch address and issue the prefetch. Here's an entire paper on the subject: http://arch.ece.gatech.edu/pub/asplos15.pdf. But note that successful prefetching makes a latency bound workload compute bound again. So it's better not to waste these cycles and use a small, conservative hardware prefetcher instead, which generically and automatically helps everything that could benefit from prefetch. Simple stream prefetchers have been around in CPUs for ages, so I'm sure it's a tiny amount of logic (in comparison to other solutions).
With increasing RAM latencies and increasing workload complexity, including for graphics, it's only a matter of time before GPUs feature speculative prefetching logic. Prefetching makes a throughput-oriented architecture a *more* succesful throughput-oriented architecture.
>>>You cannot simply tack on scatter/gather to a latency optimized CPU core and expect
>>>it to look like a throughput core in terms of power efficiency. At least, there
>>>is definitely a lack of evidence for any such claims. Moreover, you need to preserve
>>>the power efficiency for workloads that cannot be vectorized.
>>
>>An architecture which balances latency and theoretical throughput, can still achieve
>>high effective thoughput. It's how NVIDIA achieved to >outperform AMD with only half the FLOP density.
>
>That's so wrong it isn't even funny. That's because of AMD's use of static scheduling
>for their VLIW, and because Nvidia is much more optimized for scalar memory accesses.
>Has nothing to do with latency vs. throughput.
Hold your horses. The scheduling efficiency of the VLIW5 core was 3.4 operations (http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4). So the move to VLIW4 made it very efficient. Also note that instead of one fat ALU taking the role of SFU, three ALUs can together operate as an SFU. This unification means there's even better scheduling opportunity.
So it would be pretty ridiculous if low scalar memory access performance alone was responsible for lowering the efficiency to 50% of that of NVIDIA's architecture. You'd think someone would have noticed that and fixed it by now.
The reality is that AMD's architecture often can't use its full computing capacity. It's not a well balanced architecture for executing a wide range of workloads, but it compensates for that with raw computing power while executing tasks that don't get bottlenecked. Fighting Amdahl's Law by using a massive amount of tiny cores works for a while, but eventually it stops scaling.
Don't get me wrong, it's an excellent GPU for contemporary games. But as NVIDIA proves, an architecture can be both good at games and other workloads, with half of the peak computing power.
>>The way things converge, tacking on gather/scatter support >does put the GPU within
>>striking distance, starting with the IGP. For someone not
>
>I don't believe that a homogeneous architecture is optimal at all, and you have
>yet to show that in any meaningful way. In fact, you have admitted that it is sub-optimal
>for power consumption...which means that as long as graphics consumes a non-trivial
>amount of power, that an IGP will be a superior solution. If there is a day when
>graphics is merely 1-2% of all cycles, then perhaps it might happen...but I don't see that ever happening.
I've shown you that homogeneous architectures can work, through the example of vertex and pixel shading unification. If there was a fixed ratio in the workload, keeping them separate whould have been more efficient, but because that's not the case, all desktop and laptop GPUs have moved to unified shaders.
Now, while gaming with an IGP, you're practically always going to be GPU limited. There might be some peaks in the CPU usage for physics and AI and such, but much of the time the CPU is waiting on the IGP. While not gaming, which is going to be the majority of the time for people who chose a system with an IGP, it's the IGP that sits idle. In particular, it's of no use during CPU intensive tasks. Also, they're both using the same floating-point and integer operations. So it's clearly worth unifying them (by wich I mean ditching the IGP and adding gather/scatter to the CPU).
That would give us a 6-core mainstream CPU. And with FMA already on the roadmap we're looking at 650 GFLOPS of parallel processing power within grasp. That's nothing to sneeze at; comparable to a GF106 in GFLOPS, transistors and TDP. So the global properties of a compute-oriented architecture are all there. Only the lack of gather/scatter support would hold it back, which is easy enough to fix.
>>Note that widening the vectors amortizes the cost of >things like out-of-order execution.
>>At the same time, AMD has reduced its VLIW width from 5 to >4, in order to achieve
>>higher efficiency.
>
>So what?
Widening the vectors makes the CPU more efficient at compute-oriented workloads. Performance/transistor increases.
AMD reduced the VLIW width but not the front end or register file. This results in a lower computing density. Cypress XT achieves a theoretical 2.7 TFLOPS with 2.15 billion transistors, while Cayman XT requires 2.65 billion transistors to do the same. The effective thoughput increased by 10%, but still the performance/transistor went down.
Convergence. That's what.
>>GPUs also introduced concurrent kernel execution and >scalar execution,
>>and have growing register files and caches. So they're >investing more transistors
>>into latency reduction and programmability than raw FLOPS. >GF110 has a 0.52 FLOP/transitor
>>ratio. With G92b that was still a 0.94 ratio.
>
>Register file size is increased for better throughput...the registers per vector lane have been decreasing.
That's not what you wrote here: http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=5
I'm assuming you're talking about GF100 then. Note that GT200 doubled the register count, without touching the execution core. GF100 was able to reduce that again, but at the cost of increased scheduling complexity. Quoting your article on GF100: "Again, each pipeline is still scalar, but there are now two for added throughput. Despite the notion that GPU cores are simpler than CPU cores, the schedulers have to tackle considerable complexity."
This proves two of my points. The first one is that GPUs either lose computing density by adding more registers, or they lose computing density by adding complex scheduling. And secondly, it indicates that the use latency-oriented techniques doesn't make a modern CPU less of a comput-oriented architecture.
>What you are saying is obvious - GPUs are becoming more programmable. But the
>reality is that they are not even remotely optimized for latency. Where do you compile your code? On a CPU or a GPU?
They're not very optimized for latency yet, but they are becoming more optimized for it. They've got no choice since even graphics is running out of easy to exploit DLP. And to run some of the parallel workloads the CPU is currently better at requires reducing lantecy as well.
Note that it wouldn't take a whole lot of logic to make the GPU vastly more latency optimized. They've already got generic caches, but lack speculative prefetch, which is tiny in comparison to the cache itself. And they've already got superscalar execution, but they lack reordering. A small instruction window would allow to hide some of the latency within the same strand, which allows reducing the register file and reduces cache contention.
So you see, the design space allows to make them much more latency optimized without making them much less thoughput optimized, if at all. I still won't use the GPU for compilation any time soon, but there's a whole range of GPGPU applications which would be vastly more successful on a more latency optimized GPU. And NVIDIA's push for the HPC market forces them to come up with efficient techniques to balance the efficiency at complex workloads.
CPUs do trade some potential thoughput for latency, but note that they've come a *very* long way since the Pentium 4 days. By adding FMA and gather/scatter support they retain their efficiency at latency-bound workloads, but become even optimized for thoughput workloads, including adequately running graphics to make the IGP redundant.
>The bottom line is that to achieve optimal throughput you must sacrifice latency
>(look at the memory subsystem), and vica versa. You can refuse to believe this,
>but it's simply true. While GPUs may become more programmable, this is all relative
>to an architecture that started with 0 programmability. The gap between GPUs and
>CPUs may shrink, but it will never disappear and the efficiency differences will always be sizable.
Saying something is "simply true", doesn't make it true. I've given you lots of proof of the contrary already. Refusing to optimize for latency, is not something that works for long due to Amdahl's Law. It may cost theoretical thoughput but unagressive latency optimization improves effective throughput. Only the latter is relevant.
Regarding programmability versus efficiency differences, you're forgetting that pixel processing was still largely fixed-function when vertex processing became programmable. Yet it evolved into full unification!
The gap will not stop shrinking, because of the simple fact that you can't get high utilization out of a million cores for any realistic workload in consumer applications. So there is no other option but to *increasingly* focus more on latency optimization.
>>It's easy to see where your preconceptions come from though. NV40 had a 0.24 ratio,
>>which G92b increased by a fourfold in a few years time. But you got fooled into
>>thinking that this is a trend which can be sustained. Widening a component only
>>increases the overall throughput desity of that component till it reaches 50% of
>>the die area. And the components themselves get fatter to increase programmability
>>as well, and the rest of the architecture needs to support >the same throughput.
>
>Of course it's easy to see where my preconceptions come from. It's reality.
Keep telling yourself that.
[snip]
>>>Let's take another example. Icera makes a very cool SDR. However, to meet the
>>>performance and power efficiency requirements, they use a custom designed chip to
>>>run the SDR. So, the 'dedicated hardware' is used by many different radio protocols,
>>>in exactly the same way that GPU shaders are used by many different shader types.
>>It's still dedicated hardware though.
>>
>>Does Icera's SDR support IEEE-754? I guess not, so *this* >is irrelevant.
>
>What does IEEE-754 have to do with latency?
Who's talking about latency? This part of the discussion was about dedicated versus generic/unified hardware.
I mentioned IEEE-754 because among other criteria it only makes sense to unify things when you're using the same generic operations (i.e. floating-point and/or integer math). There's singificant opportunity for using dedicated hardware for an SDR since it's a very defined workload. You can't compare that to graphics, which uses the same IEEE-754 operations as supported by the CPU, and there's a varying workload.
Also, dedicated SDR hardware belongs in the ultra-mobile market. This market has very different characteristics. Power consumption is of the utmost importance, while cost is much less of an issue because the required logic is tiny anyway.
>>It's nothing personal, but face it, you're running out of >arguments and start handwaving
>>and reaching for absurd examples which I'm easily able to >debunk.
>
>Only because you totally fail to understand and refuse to acknowledge reality.
No, I do understand and acknowledge the reality of a dedicted SDR for the ultra-mobile market.
What you're failing to understand and acknowledge though is that its charachteristics have no immediate relevance to the markets I'm making software rendering claims about.
>>>In case you haven't noticed, modern CPUs are filled with >>idle silicon. Floating
>>>point units, AES crypto blocks, virtualization support, >>real mode support, etc. Many of these were added recently.
>>
>>Floating-point is useful to graphics, so this isn't an argument against software rendering.
>>
>>As for AES, virtualization, real mode, etc. they certainly don't "fill" the CPU
>>with idle silicon.
>
>Microcode?
Tiny, and not idle.
>>Unless you can prove me otherwise, AES doesn't take die space
>>proportional to the GPU's, ROPs, texture samplers or >rasterizers.
>
>The ROPs are used for general purpose workloads, as are the texture sampling units.
>Where do you think loads and stores are executed? And atomic operations? The
>rasterizer is not useful for general software, but how much power does it consume? How much area?
Fine, then lets compare AES to anti-aliasing or anisotropic filtering.
On the one hand you're trying to tell me dedicated hardware is an absolute necessity but on the other hand CPUs are not allowed to spend a tiny bit of die space on things like AES?
Don't mistake my claim about software rendering on a homogenous architecture, in specific markets, for a claim that all dedicated hardware should be banned.
>>And like I said
>>before, fast AES support is important for generic encrypted disk and network access,
>>and gather/scatter speeds up software AES so the dedicated >hardware can be removed.
>
>You said that, but you're wrong. You cannot remove it for compatibility reasons, and also for security reasons.
Wrong. AES-NI has its own CPUID bit. Software has to check support for it before using it. So there's no compatibility issue. And any security attach is utterly impractical. But even for the paranoia it doesn't mean you're out of options without AES-NI hardware support. The AESSE implementation only uses registers so it's not succeptible to cold-boot or cache-timing attacks.
>>VT-x and real mode are even supported by Atom cores, so >it's doubtful this takes
>>any noticable die space on a desktop chip, and it's >obviously indispensable for the software that make use of >it.
>
>Why is virtualization support in hardware? VMware was doing fine with their binary
>translation. Maybe it was added to improve performance and efficiency!!!! Just like rasterizers!
First of all, you're talking about totally different pieces of dedicated hardware. You can't conclude from the potential need for virtualization hardware that dedicated rasterizers are a necessity for graphics.
That said, virtualization may not be needed at all: http://www.vmware.com/pdf/asplos235_adams.pdf. Obviously it's not an efficiency improvement when performance is much lower. Once again though, it's a CPUID bit. So if they ever felt like it's not worth it, they can leave it out. So far it looks like they intend on keeping it, but you can't conclude anything about graphics hardware from this.
>>Besides, like I said before GPUs also have lots of programmability features which
>>may or may not be used. For instance it's doubtful I'll ever use my GeForce GTX
>>460's double-precision computing capabilities. But that's fine, it's relatively
>>small and it's not worth designing a separate chip for the people who do use it.
>>
>>So I have nothing against dedicated hardware in general, but like I said it has
>>to offer a high enough efficiency advantage, weighed against its utilization. The
>>problem with some of the GPU's dedicated hardware is that even during it's key application,
>>graphics, it's often either a significant bottleneck or >mostly idle.
>
>You have said that, but frankly, you've said a lot of things that are simply wrong.
>
>How about you provide some hard data on modern high performance GPUs (e.g. most
>recent generation from NV or AMD) on the utilization of the rasterizer. They have
>performance profilers, so it shouldn't be too hard. Then you can find out how much
>power the rasterizers use, and we can compare it to the power consumption of SW
>rendering. Then you will have actually a marginal understanding of the relative efficiency.
>
>And I'm fairly certain that you will find that comparison to be very unattractive for SW rendering.
For software rendering, rasterization and gradient setup combined takes on average 1.4% of CPU time running the Crysis benchmark. That's all the data you need from me. The rest of the claim is yours, so you prove it.
>>Unifying vertex
>>and pixel processing removed the bottleneck between them >and increased utilization.
>>Texture sampling is useless to generic computing and >having too few texture units
>>is a bottleneck to graphics, while the importance of FP32 >texture filtering increases,
>>so it makes lots of sense to start doing the filtering in >shader units and have
>>more generic gather/scatter units. And support for >micropolygons would require substantial
>>hardware to sustain the peak throughput, but it's again >idle during other workloads
>>and even for graphics it's full capacity isn't used all >the time. Make it smaller,
>>and it's a bottleneck when drawing micropolygons. Again >unification seems like the better option here to me.
>
>You haven't even quantified the gains from utilization at all for rendering, or the cost in terms of power consumption.
What you're asking here is probably worth a doctoral dissertation. So you're going to have to wait for detailed data than what I've already provided, or come up with it yourself. In the meantime, I've given you plenty of arguments to make it at the very least plausible for software rendering to make the IGP redundant once gahter/scatter support is added.
Face it. You haven't presented a single smashing evidence of the countrary. You started with the preconception that hardware rendering is an order of magnitude more efficient, but that clearly crumbled as you had to look for deeper differences, which don't affect the global efficiency nearly as much, and you felt the need to come up with ever more contrived examples from different markets than the one that's relevant. Seriously, this entire discussion has only made me more confident in what I do. Thanks for that.
>>>>What you're also forgetting is that the software evolves as well. In 2001 people
>>>>were really excited about pixel shader 1.1. Today, a desktop GPU with only pixel
>>>>shader 1.1 support would be totally ridiculous, regardless of how power efficient
>>>>it is. I've said it before; we don't need more pixels, we >need more exciting ones.
>>>>Which means increasing generic programmability.
>>>
>>>So let the shaders evolve, and stay separate.
>>
>>I sincerely hope you're not being serious. There's no way >GPU manufacturers will un-unify their architectures.
>
>Please read what I wrote, carefully and think about it. "Stay separate" implies
>they are already separate. What are they separate from? You seem to assume I'm
>talking about the vertex/pixel/geo shaders being separate from one another, but that's hardly clear.
>
>What was meant is that the shaders should stay separate from the CPU (which is the state today, even in IGPs).
I misunderstood that, but ironically it's not all that very different from asking GPU manufacturers to un-unify vertex and pixel shaders. You're not aknowledging the motivations and advatages behind that unification.
So tell me, why should vertex and pixel shaders stay unified while unifying the CPU and IGP would be a bad idea?
>>>Every single fact that I've seen tends to suggest that software rendering is a demonstrably bad idea.
>>
>>You haven't demonstrated anything.
>
>Sure I have. CPUs are not optimized for throughput and have roughly 4X lower performance
>efficiency. In fact, in some cases that's a vast understatement.
No, you have not demonstrated that CPUs are not optimized for throughput. You demonstrated they are optimized for latency, and wrongly implied from that they can't be optimized for throughput.
>A Tesla has roughly 2.2 GFLOP/s per W (DP). A high performance Westmere has roughly
>0.75 GFLOP/s per W. Cayman is roughly 2.7 GFLOP/s per, although a real workstation
>card would be lower, probably around 2.5 GFLOP/s per W.
>
>So the reality is that the performance per watt is much worse on CPUs than GPU,
>by a factor of 3-4. So to achieve the same throughput, the power consumption would
>be 3-4X higher. So...um...CPUs aren't throughput optimized.
Westmere doesn't even have AVX, and FMA doubles the GFLOPS rating again. Furthemore, as proven by NVIDIA, Cayman's effective throughput is only half the theoretical throughput. So there's your 3-4X smashed to piece. And while GPUs make use of their fixed-function hardware as well during graphics, you're neglecting that they're no longer able to scale their effective throughput aggressively.
>>And "tends to suggest" coming from someone who's clearly basing things on prejudice
>>is just more handwaving. I've proven you WRONG about the necessity for dedicated
>>texture decompression, using real data.
>
>You have no real data. You had bad data from an old simulator that the author
>of the simulator thought was BS. Garbage in, garbage out.
I do have real data, but I might have forgotten to mention it in this thread (I mentioned it in two other posts though): Crysis at High detail at 1680x1050 performs 22 million compressed texture accesses per frame. Assuming no magnification and no texture reuse, this means that using uncompressed textures instead would have costed only 4 GB/s of extra bandwidth at 60 FPS. But no IGP runs Crysis at these settings at 60 FPS (not even my GTX 460), so it's way less than that, and in reality there is some magnification and quite a bit of texture reuse in the foilage.
[snip]
>>And finally I've shown that an IGP does cost
>>quite a bit and is worthless for non-graphics applications.
>
>That you definitely haven't shown. And IGPs are useful for the same general purpose
>applications that a GPU is. Fusion parts will have OpenCL and compute shader. So will Ivy Bridge.
Then show me one real-life example of an application using the IGP for something other than graphics, and achieving an advantage over using properly optimized AVX code (I'll settle for SSE if you must).
>>Yes, GPUs are evolving too, toward a more CPU-like >architecture! I've proven that many times now.
>
>Yes and the relative gap in performance is still HUGE.
Is it? Again, show me this huge performance gap for anything other than graphics running on the IGP. Also, the gap for SwiftShader is only 5x. It's not using AVX yet, there's no FMA, and no gather/scatter. Are you still comfortable claiming the gap will be huge with these three throughput-oriented technologies in place?
[snip]
>The bottom line is that while it's true that GPUs and CPUs are evolving towards
>one another, that says nothing about how vast the distance between the two is.
>The reality is that there is roughly a 4X gap in performance efficiency between
>GPUs and CPUs on many throughput workloads, and the gap is even larger on latency sensitive workloads.
There's a 4X gap today but not for long. And no the gap isn't larger on lantency sensitive workloads. There's software pipelining to deal with that. There's nothing in terms of latency a GPU can do, that a CPU can't.
>Throughput means more than just scatter/gather although it is one key aspect.
>But to simply throughput down to scatter/gather is pure ignorance and naivete, and
>shows an acute lack of understanding of the substantial differences in circuit design, microarchitecture and software.
With all due respect, you missed the fact that the 4X throughput gap is soon gone and you're telling me I have an acute lack of understanding the substantial differences? Please.
Take care,
Nicolas
Topic | Posted By | Date |
---|---|---|
Sandy Bridge CPU article online | David Kanter | 2010/09/26 08:35 PM |
Sandy Bridge CPU article online | Alex | 2010/09/27 04:22 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 09:06 AM |
Sandy Bridge CPU article online | someone | 2010/09/27 05:03 AM |
Sandy Bridge CPU article online | slacker | 2010/09/27 01:08 PM |
PowerPC is now Power | Paul A. Clayton | 2010/09/27 03:34 PM |
Sandy Bridge CPU article online | Dave | 2010/11/10 09:15 PM |
Sandy Bridge CPU article online | someone | 2010/09/27 05:23 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 05:39 PM |
Optimizing register clear | Paul A. Clayton | 2010/09/28 11:34 AM |
Sandy Bridge CPU article online | MS | 2010/09/27 05:54 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 09:15 AM |
Sandy Bridge CPU article online | MS | 2010/09/27 10:02 AM |
Sandy Bridge CPU article online | mpx | 2010/09/27 10:44 AM |
Sandy Bridge CPU article online | MS | 2010/09/27 01:37 PM |
Precisely | David Kanter | 2010/09/27 02:22 PM |
Sandy Bridge CPU article online | Richard Cownie | 2010/09/27 07:27 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 09:01 AM |
Sandy Bridge CPU article online | Richard Cownie | 2010/09/27 09:40 AM |
Sandy Bridge CPU article online | boots | 2010/09/27 10:19 AM |
Right, mid-2011, not 2010. Sorry (NT) | Richard Cownie | 2010/09/27 10:42 AM |
bulldozer single thread performance | Max | 2010/09/27 11:57 AM |
bulldozer single thread performance | Matt Waldhauer | 2011/03/02 10:32 AM |
Sandy Bridge CPU article online | Pun Zu | 2010/09/27 10:32 AM |
Sandy Bridge CPU article online | ? | 2010/09/27 10:44 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 12:11 PM |
My opinion is that anything that would take advantage of 256-bit AVX | redpriest | 2010/09/27 12:17 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/27 02:09 PM |
My opinion is that anything that would take advantage of 256-bit AVX | redpriest | 2010/09/27 03:06 PM |
My opinion is that anything that would take advantage of 256-bit AVX | David Kanter | 2010/09/27 04:23 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 02:57 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 03:35 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Matt Waldhauer | 2010/09/28 09:58 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/27 05:39 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 03:14 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Megol | 2010/09/28 01:17 AM |
My opinion is that anything that would take advantage of 256-bit AVX | Michael S | 2010/09/28 04:47 AM |
PGI | Carlie Coats | 2010/09/28 09:23 AM |
gfortran... | Carlie Coats | 2010/09/29 08:33 AM |
My opinion is that anything that would take advantage of 256-bit AVX | mpx | 2010/09/28 11:58 AM |
My opinion is that anything that would take advantage of 256-bit AVX | Michael S | 2010/09/28 12:36 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Foo_ | 2010/09/29 12:08 AM |
My opinion is that anything that would take advantage of 256-bit AVX | mpx | 2010/09/28 10:37 AM |
My opinion is that anything that would take advantage of 256-bit AVX | Aaron Spink | 2010/09/28 12:19 PM |
My opinion is that anything that would take advantage of 256-bit AVX | hobold | 2010/09/28 02:08 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Ian Ollmann | 2010/09/28 03:26 PM |
My opinion is that anything that would take advantage of 256-bit AVX | Anthony | 2010/09/28 09:31 PM |
Sandy Bridge CPU article online | Hans de Vries | 2010/09/27 01:19 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 02:19 PM |
Sandy Bridge CPU article online | -Sweeper_ | 2010/09/27 04:50 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 05:41 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/27 01:55 PM |
Sandy Bridge CPU article online | line98 | 2010/09/27 02:05 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 02:20 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/27 02:23 PM |
Sandy Bridge CPU article online | line98 | 2010/09/27 02:42 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 08:33 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 03:04 PM |
Sandy Bridge CPU article online | Jack | 2010/09/27 03:40 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 10:47 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/27 10:54 PM |
Sandy Bridge CPU article online | Royi | 2010/09/27 10:59 PM |
Sandy Bridge CPU article online | JS | 2010/09/28 12:18 AM |
Sandy Bridge CPU article online | Royi | 2010/09/28 12:31 AM |
Sandy Bridge CPU article online | Jack | 2010/09/28 05:34 AM |
Sandy Bridge CPU article online | Royi | 2010/09/28 07:22 AM |
Sandy Bridge CPU article online | Foo_ | 2010/09/28 11:53 AM |
Sandy Bridge CPU article online | Paul | 2010/09/28 12:17 PM |
Sandy Bridge CPU article online | mpx | 2010/09/28 12:22 PM |
Sandy Bridge CPU article online | anonymous | 2010/09/28 01:06 PM |
Sandy Bridge CPU article online | IntelUser2000 | 2010/09/29 12:49 AM |
Sandy Bridge CPU article online | Jack | 2010/09/28 04:08 PM |
Sandy Bridge CPU article online | mpx | 2010/09/29 12:50 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/29 11:01 AM |
Sandy Bridge CPU article online | Royi | 2010/09/29 11:48 AM |
Sandy Bridge CPU article online | mpx | 2010/09/29 01:15 PM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/29 01:27 PM |
Sandy Bridge CPU article online | ? | 2010/09/29 10:18 PM |
Sandy Bridge CPU article online | savantu | 2010/09/29 11:28 PM |
Sandy Bridge CPU article online | ? | 2010/09/30 02:43 AM |
Sandy Bridge CPU article online | gallier2 | 2010/09/30 03:18 AM |
Sandy Bridge CPU article online | ? | 2010/09/30 07:38 AM |
Sandy Bridge CPU article online | David Hess | 2010/09/30 09:28 AM |
moderation (again) | hobold | 2010/10/01 04:08 AM |
Sandy Bridge CPU article online | Megol | 2010/09/30 01:13 AM |
Sandy Bridge CPU article online | ? | 2010/09/30 02:47 AM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 07:54 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 09:18 AM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 11:04 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 11:38 AM |
Sandy Bridge CPU article online | Michael S | 2010/09/30 12:02 PM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 07:09 PM |
Sandy Bridge CPU article online | mpx | 2010/09/30 11:40 AM |
Sandy Bridge CPU article online | Linus Torvalds | 2010/09/30 12:00 PM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 07:44 PM |
Sandy Bridge CPU article online | David Hess | 2010/09/30 09:36 AM |
Sandy Bridge CPU article online | someone | 2010/09/30 10:23 AM |
Sandy Bridge CPU article online | mpx | 2010/09/30 12:50 PM |
wii lesson | Michael S | 2010/09/30 01:12 PM |
wii lesson | Dan Downs | 2010/09/30 02:33 PM |
wii lesson | Kevin G | 2010/09/30 11:27 PM |
wii lesson | Rohit | 2010/10/01 06:53 AM |
wii lesson | Kevin G | 2010/10/02 02:30 AM |
wii lesson | mpx | 2010/10/01 08:02 AM |
wii lesson | IntelUser2000 | 2010/10/01 08:31 AM |
GPUs and games | David Kanter | 2010/09/30 07:17 PM |
GPUs and games | hobold | 2010/10/01 04:27 AM |
GPUs and games | anonymous | 2010/10/01 05:35 AM |
GPUs and games | Gabriele Svelto | 2010/10/01 08:07 AM |
GPUs and games | Linus Torvalds | 2010/10/01 09:41 AM |
GPUs and games | Anon | 2010/10/01 10:23 AM |
Can Intel do *this* ??? | Mark Roulo | 2010/10/03 02:17 PM |
Can Intel do *this* ??? | Anon | 2010/10/03 02:29 PM |
Can Intel do *this* ??? | Mark Roulo | 2010/10/03 02:55 PM |
Can Intel do *this* ??? | Anon | 2010/10/03 04:45 PM |
Can Intel do *this* ??? | Ian Ameline | 2010/10/03 09:35 PM |
Graphics, IGPs, and Cache | Joe | 2010/10/10 08:51 AM |
Graphics, IGPs, and Cache | Anon | 2010/10/10 09:18 PM |
Graphics, IGPs, and Cache | Rohit | 2010/10/11 05:14 AM |
Graphics, IGPs, and Cache | hobold | 2010/10/11 05:43 AM |
Maybe the IGPU doesn't load into the L3 | Mark Roulo | 2010/10/11 07:05 AM |
Graphics, IGPs, and Cache | David Kanter | 2010/10/11 08:01 AM |
Can Intel do *this* ??? | Gabriele Svelto | 2010/10/03 11:31 PM |
Kanter's Law. | Ian Ameline | 2010/10/01 01:05 PM |
Kanter's Law. | David Kanter | 2010/10/01 01:18 PM |
Kanter's Law. | Ian Ameline | 2010/10/01 01:33 PM |
Kanter's Law. | Kevin G | 2010/10/01 03:19 PM |
Kanter's Law. | IntelUser2000 | 2010/10/01 09:36 PM |
Kanter's Law. | Kevin G | 2010/10/02 02:15 AM |
Kanter's Law. | IntelUser2000 | 2010/10/02 01:35 PM |
Wii vs pc's | Rohit | 2010/10/01 06:34 PM |
Wii vs pc's | Gabriele Svelto | 2010/10/01 10:54 PM |
GPUs and games | mpx | 2010/10/02 10:30 AM |
GPUs and games | Foo_ | 2010/10/02 03:03 PM |
GPUs and games | mpx | 2010/10/03 10:29 AM |
GPUs and games | Foo_ | 2010/10/03 12:52 PM |
GPUs and games | mpx | 2010/10/03 02:29 PM |
GPUs and games | Anon | 2010/10/03 02:49 PM |
GPUs and games | mpx | 2010/10/04 10:42 AM |
GPUs and games | MS | 2010/10/04 01:51 PM |
GPUs and games | Anon | 2010/10/04 07:29 PM |
persistence of vision | hobold | 2010/10/04 10:47 PM |
GPUs and games | mpx | 2010/10/04 11:51 PM |
GPUs and games | MS | 2010/10/05 05:49 AM |
GPUs and games | Jack | 2010/10/05 10:17 AM |
GPUs and games | MS | 2010/10/05 04:19 PM |
GPUs and games | Jack | 2010/10/05 10:11 AM |
GPUs and games | mpx | 2010/10/05 11:51 AM |
GPUs and games | David Kanter | 2010/10/06 08:04 AM |
GPUs and games | jack | 2010/10/06 08:34 PM |
GPUs and games | Linus Torvalds | 2010/10/05 06:29 AM |
GPUs and games | Foo_ | 2010/10/04 03:49 AM |
GPUs and games | Jeremiah | 2010/10/08 09:58 AM |
GPUs and games | MS | 2010/10/08 12:37 PM |
GPUs and games | Salvatore De Dominicis | 2010/10/04 12:41 AM |
GPUs and games | Kevin G | 2010/10/05 01:13 PM |
GPUs and games | mpx | 2010/10/03 10:36 AM |
GPUs and games | David Kanter | 2010/10/04 06:08 AM |
GPUs and games | Kevin G | 2010/10/04 09:38 AM |
Sandy Bridge CPU article online | NEON cortex | 2010/11/17 08:19 PM |
Sandy Bridge CPU article online | Ian Ameline | 2010/09/30 11:06 AM |
Sandy Bridge CPU article online | rwessel | 2010/09/30 01:29 PM |
Sandy Bridge CPU article online | Michael S | 2010/09/30 02:06 PM |
Sandy Bridge CPU article online | rwessel | 2010/09/30 05:55 PM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 02:53 AM |
Sandy Bridge CPU article online | rwessel | 2010/10/01 07:30 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 08:31 AM |
Sandy Bridge CPU article online | rwessel | 2010/10/01 09:56 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 07:28 PM |
Sandy Bridge CPU article online | Ricardo B | 2010/10/02 04:38 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/02 05:59 PM |
which bus more wasteful | Michael S | 2010/10/02 09:38 AM |
which bus more wasteful | rwessel | 2010/10/02 06:15 PM |
Sandy Bridge CPU article online | Ricardo B | 2010/10/01 09:08 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 07:31 PM |
Sandy Bridge CPU article online | Andi Kleen | 2010/10/01 10:55 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 07:32 PM |
Sandy Bridge CPU article online | kdg | 2010/10/01 10:26 AM |
Sandy Bridge CPU article online | Anon | 2010/10/01 10:33 AM |
Analog display out? | David Kanter | 2010/10/01 12:05 PM |
Analog display out? | mpx | 2010/10/02 10:46 AM |
Analog display out? | Anon | 2010/10/03 02:26 PM |
Digital is expensive! | David Kanter | 2010/10/03 05:36 PM |
Digital is expensive! | Anon | 2010/10/03 07:07 PM |
Digital is expensive! | David Kanter | 2010/10/03 09:02 PM |
Digital is expensive! | Steve Underwood | 2010/10/04 02:52 AM |
Digital is expensive! | David Kanter | 2010/10/04 06:03 AM |
Digital is expensive! | anonymous | 2010/10/04 06:11 AM |
Digital is not very expensive! | Steve Underwood | 2010/10/04 05:08 PM |
Digital is not very expensive! | Anon | 2010/10/04 07:33 PM |
Digital is not very expensive! | Steve Underwood | 2010/10/04 10:03 PM |
Digital is not very expensive! | mpx | 2010/10/05 12:10 PM |
Digital is not very expensive! | Gabriele Svelto | 2010/10/04 11:24 PM |
Digital is expensive! | jal142 | 2010/10/04 10:46 AM |
Digital is expensive! | mpx | 2010/10/04 12:04 AM |
Digital is expensive! | Gabriele Svelto | 2010/10/04 02:28 AM |
Digital is expensive! | Mark Christiansen | 2010/10/04 02:12 PM |
Analog display out? | slacker | 2010/10/03 05:44 PM |
Analog display out? | Anon | 2010/10/03 07:05 PM |
Analog display out? | Steve Underwood | 2010/10/04 02:48 AM |
Sandy Bridge CPU article online | David Hess | 2010/10/01 07:37 PM |
Sandy Bridge CPU article online | slacker | 2010/10/02 01:53 PM |
Sandy Bridge CPU article online | David Hess | 2010/10/02 05:49 PM |
memory bandwith | Max | 2010/09/30 11:19 AM |
memory bandwith | Anon | 2010/10/01 10:28 AM |
memory bandwith | Jack | 2010/10/01 06:45 PM |
memory bandwith | Anon | 2010/10/03 02:19 PM |
Sandy Bridge CPU article online | PiedPiper | 2010/09/30 06:05 PM |
Sandy Bridge CPU article online | Matt Sayler | 2010/09/29 03:38 PM |
Sandy Bridge CPU article online | Jack | 2010/09/29 08:39 PM |
Sandy Bridge CPU article online | mpx | 2010/09/29 11:24 PM |
Sandy Bridge CPU article online | passer | 2010/09/30 02:15 AM |
Sandy Bridge CPU article online | mpx | 2010/09/30 02:47 AM |
Sandy Bridge CPU article online | passer | 2010/09/30 03:25 AM |
SB and web browsing | Rohit | 2010/09/30 05:47 AM |
SB and web browsing | David Hess | 2010/09/30 06:10 AM |
SB and web browsing | MS | 2010/09/30 09:21 AM |
SB and web browsing | passer | 2010/09/30 09:26 AM |
SB and web browsing | MS | 2010/10/02 05:41 PM |
SB and web browsing | Rohit | 2010/10/01 07:02 AM |
Sandy Bridge CPU article online | David Kanter | 2010/09/30 07:35 AM |
Sandy Bridge CPU article online | Jack | 2010/09/30 09:40 PM |
processor evolution | hobold | 2010/09/29 01:16 PM |
processor evolution | Foo_ | 2010/09/30 05:10 AM |
processor evolution | Jack | 2010/09/30 06:07 PM |
3D gaming as GPGPU app | hobold | 2010/10/01 03:59 AM |
3D gaming as GPGPU app | Jack | 2010/10/01 06:39 PM |
processor evolution | hobold | 2010/10/01 03:35 AM |
processor evolution | David Kanter | 2010/10/01 09:02 AM |
processor evolution | Anon | 2010/10/01 10:46 AM |
Display | David Kanter | 2010/10/01 12:26 PM |
Display | Rohit | 2010/10/02 01:56 AM |
Display | Linus Torvalds | 2010/10/02 06:40 AM |
Display | rwessel | 2010/10/02 07:58 AM |
Display | sJ | 2010/10/02 09:28 PM |
Display | rwessel | 2010/10/03 07:38 AM |
Display | Anon | 2010/10/03 02:06 PM |
Display tech and compute are different | David Kanter | 2010/10/03 05:33 PM |
Display tech and compute are different | Anon | 2010/10/03 07:16 PM |
Display tech and compute are different | David Kanter | 2010/10/03 09:00 PM |
Display tech and compute are different | hobold | 2010/10/04 12:40 AM |
Display | ? | 2010/10/03 02:02 AM |
Display | Linus Torvalds | 2010/10/03 09:18 AM |
Display | Richard Cownie | 2010/10/03 10:12 AM |
Display | Linus Torvalds | 2010/10/03 11:16 AM |
Display | slacker | 2010/10/03 06:35 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/04 06:06 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/04 10:44 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/04 01:59 PM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/04 02:13 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/04 07:58 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/05 12:39 AM |
current V12 engines with >6.0 displacement | MS | 2010/10/05 05:57 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/05 12:20 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/05 08:26 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 04:39 AM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 12:22 PM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/06 02:07 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 02:56 PM |
current V12 engines with >6.0 displacement | rwessel | 2010/10/06 02:30 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 02:53 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/07 12:32 PM |
current V12 engines with >6.0 displacement | rwessel | 2010/10/07 06:54 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/07 08:02 PM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | slacker | 2010/10/06 06:20 PM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | Ricardo B | 2010/10/07 12:32 AM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | slacker | 2010/10/07 07:15 AM |
Top Gear is awful, and Jeremy Clarkson cannot drive. | Ricardo B | 2010/10/07 09:51 AM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 04:03 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 05:26 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 10:15 PM |
current V12 engines with >6.0 displacement | Howard Chu | 2010/10/07 01:16 PM |
current V12 engines with >6.0 displacement | Anon | 2010/10/05 09:31 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 04:55 AM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/06 05:15 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/06 05:34 AM |
I wonder is there any tech area that this forum doesn't have an opinion on (NT) | Rob Thorpe | 2010/10/06 09:11 AM |
Cunieform tablets | David Kanter | 2010/10/06 11:57 AM |
Cunieform tablets | Linus Torvalds | 2010/10/06 12:06 PM |
Ouch...maybe I should hire a new editor (NT) | David Kanter | 2010/10/06 03:38 PM |
Cunieform tablets | rwessel | 2010/10/06 02:41 PM |
Cunieform tablets | seni | 2010/10/07 09:56 AM |
Cunieform tablets | Howard Chu | 2010/10/07 12:44 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/06 05:10 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/06 09:44 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 06:55 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 07:51 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 06:38 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 07:33 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/07 08:04 PM |
Practical vehicles for commuting | Rob Thorpe | 2010/10/08 04:50 AM |
Practical vehicles for commuting | Gabriele Svelto | 2010/10/08 05:05 AM |
Practical vehicles for commuting | Rob Thorpe | 2010/10/08 05:21 AM |
Practical vehicles for commuting | j | 2010/10/08 01:20 PM |
Practical vehicles for commuting | Rob Thorpe | 2010/12/09 06:00 AM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/08 09:14 AM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/07 12:23 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/07 03:08 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 04:41 PM |
current V12 engines with >6.0 displacement | slacker | 2010/10/07 07:05 PM |
current V12 engines with >6.0 displacement | anonymous | 2010/10/07 07:52 PM |
current V12 engines with >6.0 displacement | Anonymous | 2010/10/08 06:52 PM |
current V12 engines with >6.0 displacement | anon | 2010/10/06 10:28 PM |
current V12 engines with >6.0 displacement | Aaron Spink | 2010/10/06 11:37 PM |
current V12 engines with >6.0 displacement | Ricardo B | 2010/10/07 12:37 AM |
current V12 engines with >6.0 displacement | slacker | 2010/10/05 01:02 AM |
Display | Linus Torvalds | 2010/10/04 09:39 AM |
Display | Gabriele Svelto | 2010/10/04 11:34 PM |
Display | Richard Cownie | 2010/10/04 05:22 AM |
Display | anon | 2010/10/04 08:22 PM |
Display | Richard Cownie | 2010/10/05 05:42 AM |
Display | mpx | 2010/10/03 10:55 AM |
Display | rcf | 2010/10/03 12:12 PM |
Display | mpx | 2010/10/03 01:36 PM |
Display | rcf | 2010/10/03 04:36 PM |
Display | Ricardo B | 2010/10/04 01:50 PM |
Display | gallier2 | 2010/10/05 02:44 AM |
Display | David Hess | 2010/10/05 04:21 AM |
Display | gallier2 | 2010/10/05 07:21 AM |
Display | David Hess | 2010/10/03 10:21 PM |
Display | rcf | 2010/10/04 07:06 AM |
Display | David Kanter | 2010/10/03 12:54 PM |
Alternative integration | Paul A. Clayton | 2010/10/06 07:51 AM |
Display | slacker | 2010/10/03 06:26 PM |
Display & marketing & analogies | ? | 2010/10/04 01:33 AM |
Display & marketing & analogies | kdg | 2010/10/04 05:00 AM |
Display | Kevin G | 2010/10/02 08:49 AM |
Display | Anon | 2010/10/03 02:43 PM |
Sandy Bridge CPU article online | David Kanter | 2010/09/29 02:17 PM |
Sandy Bridge CPU article online | Jack | 2010/09/28 05:27 AM |
Sandy Bridge CPU article online | IntelUser2000 | 2010/09/28 02:07 AM |
Sandy Bridge CPU article online | mpx | 2010/09/28 11:34 AM |
Sandy Bridge CPU article online | Aaron Spink | 2010/09/28 12:28 PM |
Sandy Bridge CPU article online | JoshW | 2010/09/28 01:13 PM |
Sandy Bridge CPU article online | mpx | 2010/09/28 01:54 PM |
Sandy Bridge CPU article online | Foo_ | 2010/09/29 12:19 AM |
Sandy Bridge CPU article online | mpx | 2010/09/29 02:06 AM |
Sandy Bridge CPU article online | JS | 2010/09/29 02:42 AM |
Sandy Bridge CPU article online | mpx | 2010/09/29 03:03 AM |
Sandy Bridge CPU article online | Foo_ | 2010/09/29 04:55 AM |
Sandy Bridge CPU article online | ajensen | 2010/09/27 11:19 PM |
Sandy Bridge CPU article online | Ian Ollmann | 2010/09/28 03:52 PM |
Sandy Bridge CPU article online | a reader | 2010/09/28 04:05 PM |
Sandy Bridge CPU article online | ajensen | 2010/09/28 10:35 PM |
Updated: Sandy Bridge CPU article | David Kanter | 2010/10/01 04:11 AM |
Updated: Sandy Bridge CPU article | anon | 2011/01/07 08:55 PM |
Updated: Sandy Bridge CPU article | Eric Bron | 2011/01/08 02:29 AM |
Updated: Sandy Bridge CPU article | anon | 2011/01/11 10:24 PM |
Updated: Sandy Bridge CPU article | anon | 2011/01/15 10:21 AM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anon | 2011/01/16 10:22 PM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anonymous | 2011/01/17 01:04 AM |
David Kanter can you shed some light? Re Updated: Sandy Bridge CPU article | anon | 2011/01/17 06:12 AM |
I can try.... | David Kanter | 2011/01/18 02:54 PM |
I can try.... | anon | 2011/01/18 07:07 PM |
I can try.... | David Kanter | 2011/01/18 10:24 PM |
I can try.... | anon | 2011/01/19 06:51 AM |
Wider fetch than execute makes sense | Paul A. Clayton | 2011/01/19 07:53 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/04 06:29 AM |
Sandy Bridge CPU article online | Seni | 2011/01/04 08:07 PM |
Sandy Bridge CPU article online | hobold | 2011/01/04 10:26 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 01:01 AM |
software assist exceptions | hobold | 2011/01/05 03:36 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 12:58 AM |
Sandy Bridge CPU article online | anon | 2011/01/05 03:51 AM |
Sandy Bridge CPU article online | Seni | 2011/01/05 07:53 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 08:03 AM |
Sandy Bridge CPU article online | anon | 2011/01/05 03:14 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 03:50 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/05 04:00 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 06:26 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/05 06:50 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 07:39 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 02:50 PM |
permuting vector elements | hobold | 2011/01/05 04:03 PM |
permuting vector elements | Nicolas Capens | 2011/01/05 05:01 PM |
permuting vector elements | Nicolas Capens | 2011/01/06 07:27 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/11 10:33 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/11 12:51 PM |
Sandy Bridge CPU article online | hobold | 2011/01/11 01:11 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/11 05:07 PM |
Sandy Bridge CPU article online | Michael S | 2011/01/12 02:25 AM |
Sandy Bridge CPU article online | hobold | 2011/01/12 04:03 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/12 10:27 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/13 01:38 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/13 02:32 AM |
Sandy Bridge CPU article online | hobold | 2011/01/13 12:53 PM |
What happened to VPERMIL2PS? | Michael S | 2011/01/13 02:46 AM |
What happened to VPERMIL2PS? | Eric Bron | 2011/01/13 05:46 AM |
Lower cost permute | Paul A. Clayton | 2011/01/13 11:11 AM |
Sandy Bridge CPU article online | anon | 2011/01/25 05:31 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/12 05:34 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/13 06:38 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/15 08:47 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/01/16 02:13 AM |
And just to make a further example | Gabriele Svelto | 2011/01/16 03:24 AM |
Sandy Bridge CPU article online | mpx | 2011/01/16 12:27 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/25 01:56 PM |
Sandy Bridge CPU article online | David Kanter | 2011/01/25 03:11 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/26 07:49 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/26 03:35 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/27 01:51 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/27 01:40 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/28 02:24 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/28 02:49 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/30 01:11 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/31 02:43 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/01 03:02 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/01 03:28 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/01 03:43 AM |
Sandy Bridge CPU article online | EduardoS | 2011/01/28 06:14 PM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/01 01:58 AM |
Sandy Bridge CPU article online | EduardoS | 2011/02/01 01:36 PM |
Sandy Bridge CPU article online | anon | 2011/02/01 03:56 PM |
Sandy Bridge CPU article online | EduardoS | 2011/02/01 08:17 PM |
Sandy Bridge CPU article online | anon | 2011/02/01 09:13 PM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/02 03:08 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/02/02 03:26 AM |
Sandy Bridge CPU article online | kalmaegi | 2011/02/01 08:29 AM |
SW Rasterization | David Kanter | 2011/01/27 04:18 PM |
Lower pin count memory | iz | 2011/01/27 08:19 PM |
Lower pin count memory | David Kanter | 2011/01/27 08:25 PM |
Lower pin count memory | iz | 2011/01/27 10:31 PM |
Lower pin count memory | David Kanter | 2011/01/27 10:52 PM |
Lower pin count memory | iz | 2011/01/27 11:28 PM |
Lower pin count memory | David Kanter | 2011/01/28 12:05 AM |
Lower pin count memory | iz | 2011/01/28 02:55 AM |
Lower pin count memory | David Hess | 2011/01/28 12:15 PM |
Lower pin count memory | David Kanter | 2011/01/28 12:57 PM |
Lower pin count memory | iz | 2011/01/28 04:20 PM |
Two years later | ForgotPants | 2013/10/26 10:33 AM |
Two years later | anon | 2013/10/26 10:36 AM |
Two years later | Exophase | 2013/10/26 11:56 AM |
Two years later | David Hess | 2013/10/26 04:05 PM |
Herz is totally the thing you DON*T care. | Jouni Osmala | 2013/10/27 12:48 AM |
Herz is totally the thing you DON*T care. | EduardoS | 2013/10/27 06:00 AM |
Herz is totally the thing you DON*T care. | Michael S | 2013/10/27 06:45 AM |
Two years later | someone | 2013/10/28 06:21 AM |
Lower pin count memory | Martin Høyer Kristiansen | 2011/01/28 12:41 AM |
Lower pin count memory | iz | 2011/01/28 02:07 AM |
Lower pin count memory | Darrell Coker | 2011/01/27 09:39 PM |
Lower pin count memory | iz | 2011/01/27 11:20 PM |
Lower pin count memory | Darrell Coker | 2011/01/28 05:07 PM |
Lower pin count memory | iz | 2011/01/28 10:57 PM |
Lower pin count memory | Darrell Coker | 2011/01/29 01:21 AM |
Lower pin count memory | iz | 2011/01/31 09:28 PM |
SW Rasterization | Nicolas Capens | 2011/02/02 07:48 AM |
SW Rasterization | Eric Bron | 2011/02/02 08:37 AM |
SW Rasterization | Nicolas Capens | 2011/02/02 03:35 PM |
SW Rasterization | Eric Bron | 2011/02/02 04:11 PM |
SW Rasterization | Eric Bron | 2011/02/03 01:13 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 06:57 AM |
SW Rasterization | Eric Bron | 2011/02/04 07:50 AM |
erratum | Eric Bron | 2011/02/04 07:58 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 04:25 PM |
SW Rasterization | David Kanter | 2011/02/04 04:33 PM |
SW Rasterization | anon | 2011/02/04 05:04 PM |
SW Rasterization | Nicolas Capens | 2011/02/05 02:39 PM |
SW Rasterization | David Kanter | 2011/02/05 04:07 PM |
SW Rasterization | Nicolas Capens | 2011/02/05 10:39 PM |
SW Rasterization | Eric Bron | 2011/02/04 09:55 AM |
Comments pt 1 | David Kanter | 2011/02/02 12:08 PM |
Comments pt 1 | Eric Bron | 2011/02/02 02:16 PM |
Comments pt 1 | Gabriele Svelto | 2011/02/03 12:37 AM |
Comments pt 1 | Eric Bron | 2011/02/03 01:36 AM |
Comments pt 1 | Nicolas Capens | 2011/02/03 10:08 PM |
Comments pt 1 | Nicolas Capens | 2011/02/03 09:26 PM |
Comments pt 1 | Eric Bron | 2011/02/04 02:33 AM |
Comments pt 1 | Nicolas Capens | 2011/02/04 04:24 AM |
example code | Eric Bron | 2011/02/04 03:51 AM |
example code | Nicolas Capens | 2011/02/04 07:24 AM |
example code | Eric Bron | 2011/02/04 07:36 AM |
example code | Nicolas Capens | 2011/02/05 10:43 PM |
Comments pt 1 | Rohit | 2011/02/04 11:43 AM |
Comments pt 1 | Nicolas Capens | 2011/02/04 04:05 PM |
Comments pt 1 | David Kanter | 2011/02/04 04:36 PM |
Comments pt 1 | Nicolas Capens | 2011/02/05 01:45 PM |
Comments pt 1 | Eric Bron | 2011/02/05 03:13 PM |
Comments pt 1 | Nicolas Capens | 2011/02/05 10:52 PM |
Comments pt 1 | Eric Bron | 2011/02/06 12:31 AM |
Comments pt 1 | Nicolas Capens | 2011/02/06 03:06 PM |
Comments pt 1 | Eric Bron | 2011/02/07 02:12 AM |
The need for gather/scatter support | Nicolas Capens | 2011/02/10 09:07 AM |
The need for gather/scatter support | Eric Bron | 2011/02/11 02:11 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/13 02:39 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 06:46 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 06:48 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 08:32 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 09:07 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 08:00 AM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 06:49 AM |
Gather/scatter performance data | Eric Bron | 2011/02/15 01:23 AM |
Gather/scatter performance data | Eric Bron | 2011/02/13 04:06 PM |
Gather/scatter performance data | Nicolas Capens | 2011/02/14 06:52 AM |
Gather/scatter performance data | Eric Bron | 2011/02/14 08:43 AM |
SW Rasterization - a long way off | Rohit | 2011/02/02 12:17 PM |
SW Rasterization - a long way off | Nicolas Capens | 2011/02/04 02:59 AM |
CPU only rendering - a long way off | Rohit | 2011/02/04 10:52 AM |
CPU only rendering - a long way off | Nicolas Capens | 2011/02/04 06:15 PM |
CPU only rendering - a long way off | Rohit | 2011/02/05 01:00 AM |
CPU only rendering - a long way off | Nicolas Capens | 2011/02/05 08:45 PM |
CPU only rendering - a long way off | David Kanter | 2011/02/06 08:51 PM |
CPU only rendering - a long way off | Gian-Carlo Pascutto | 2011/02/06 11:22 PM |
Encryption | David Kanter | 2011/02/07 12:18 AM |
Encryption | Nicolas Capens | 2011/02/07 06:51 AM |
Encryption | David Kanter | 2011/02/07 10:50 AM |
Encryption | Nicolas Capens | 2011/02/08 09:26 AM |
CPUs are latency optimized | David Kanter | 2011/02/08 10:38 AM |
efficient compiler on an efficient GPU real today. | sJ | 2011/02/08 10:29 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/09 08:49 PM |
CPUs are latency optimized | Eric Bron | 2011/02/09 11:49 PM |
CPUs are latency optimized | Antti-Ville Tuunainen | 2011/02/10 05:16 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/10 06:04 AM |
CPUs are latency optimized | Eric Bron | 2011/02/10 06:48 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/10 12:31 PM |
CPUs are latency optimized | Eric Bron | 2011/02/11 01:43 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/11 06:31 AM |
CPUs are latency optimized | EduardoS | 2011/02/10 04:29 PM |
CPUs are latency optimized | Anon | 2011/02/10 05:40 PM |
CPUs are latency optimized | David Kanter | 2011/02/10 07:33 PM |
CPUs are latency optimized | EduardoS | 2011/02/11 01:18 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/11 04:56 AM |
CPUs are latency optimized | Rohit | 2011/02/11 06:33 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/14 01:19 AM |
CPUs are latency optimized | Eric Bron | 2011/02/14 02:23 AM |
CPUs are latency optimized | EduardoS | 2011/02/14 12:11 PM |
CPUs are latency optimized | David Kanter | 2011/02/11 01:45 PM |
CPUs are latency optimized | Nicolas Capens | 2011/02/15 04:22 AM |
CPUs are latency optimized | David Kanter | 2011/02/15 11:47 AM |
CPUs are latency optimized | Nicolas Capens | 2011/02/15 06:10 PM |
Have fun | David Kanter | 2011/02/15 09:04 PM |
Have fun | Nicolas Capens | 2011/02/17 02:59 AM |
Have fun | Brett | 2011/02/17 11:56 AM |
Have fun | Nicolas Capens | 2011/02/19 03:53 PM |
Have fun | Brett | 2011/02/20 05:08 PM |
Have fun | Brett | 2011/02/20 06:13 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/02/23 04:37 PM |
On-die storage to fight Amdahl | Brett | 2011/02/23 08:59 PM |
On-die storage to fight Amdahl | Brett | 2011/02/23 09:08 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/02/24 06:42 PM |
On-die storage to fight Amdahl | Rohit | 2011/02/25 10:02 PM |
On-die storage to fight Amdahl | Nicolas Capens | 2011/03/09 05:53 PM |
On-die storage to fight Amdahl | Rohit | 2011/03/10 07:02 AM |
NVIDIA using tile based rendering? | Nathan Monson | 2011/03/11 06:58 PM |
NVIDIA using tile based rendering? | Rohit | 2011/03/12 03:29 AM |
NVIDIA using tile based rendering? | Nathan Monson | 2011/03/12 10:05 AM |
NVIDIA using tile based rendering? | Rohit | 2011/03/12 10:16 AM |
On-die storage to fight Amdahl | Brett | 2011/02/26 01:10 AM |
On-die storage to fight Amdahl | Nathan Monson | 2011/02/26 12:51 PM |
On-die storage to fight Amdahl | Brett | 2011/02/26 03:40 PM |
Convergence is inevitable | Nicolas Capens | 2011/03/09 07:22 PM |
Convergence is inevitable | Brett | 2011/03/09 09:59 PM |
Convergence is inevitable | Antti-Ville Tuunainen | 2011/03/10 02:34 PM |
Convergence is inevitable | Brett | 2011/03/10 08:39 PM |
Procedural texturing? | David Kanter | 2011/03/11 12:32 AM |
Procedural texturing? | hobold | 2011/03/11 02:59 AM |
Procedural texturing? | Dan Downs | 2011/03/11 08:28 AM |
Procedural texturing? | Mark Roulo | 2011/03/11 01:58 PM |
Procedural texturing? | Anon | 2011/03/11 05:11 PM |
Procedural texturing? | Nathan Monson | 2011/03/11 06:30 PM |
Procedural texturing? | Brett | 2011/03/15 06:45 AM |
Procedural texturing? | Seni | 2011/03/15 09:13 AM |
Procedural texturing? | Brett | 2011/03/15 10:45 AM |
Procedural texturing? | Seni | 2011/03/15 01:09 PM |
Procedural texturing? | Brett | 2011/03/11 09:02 PM |
Procedural texturing? | Brett | 2011/03/11 08:34 PM |
Procedural texturing? | Eric Bron | 2011/03/12 02:37 AM |
Convergence is inevitable | Jouni Osmala | 2011/03/09 10:28 PM |
Convergence is inevitable | Brett | 2011/04/05 04:08 PM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 04:23 AM |
Convergence is inevitable | none | 2011/04/07 06:03 AM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 09:34 AM |
Convergence is inevitable | anon | 2011/04/07 01:15 PM |
Convergence is inevitable | none | 2011/04/08 12:57 AM |
Convergence is inevitable | Brett | 2011/04/07 07:04 PM |
Convergence is inevitable | none | 2011/04/08 01:14 AM |
Gather implementation | David Kanter | 2011/04/08 11:01 AM |
RAM Latency | David Hess | 2011/04/07 07:22 AM |
RAM Latency | Brett | 2011/04/07 06:20 PM |
RAM Latency | Nicolas Capens | 2011/04/07 09:18 PM |
RAM Latency | Brett | 2011/04/08 04:33 AM |
RAM Latency | Nicolas Capens | 2011/04/10 01:23 PM |
RAM Latency | Rohit | 2011/04/08 05:57 AM |
RAM Latency | Nicolas Capens | 2011/04/10 12:23 PM |
RAM Latency | David Kanter | 2011/04/10 01:27 PM |
RAM Latency | Rohit | 2011/04/11 05:17 AM |
Convergence is inevitable | Eric Bron | 2011/04/07 08:46 AM |
Convergence is inevitable | Nicolas Capens | 2011/04/07 08:50 PM |
Convergence is inevitable | Eric Bron | 2011/04/07 11:39 PM |
Flaws in PowerVR | Rohit | 2011/02/25 10:21 PM |
Flaws in PowerVR | Brett | 2011/02/25 11:37 PM |
Flaws in PowerVR | Paul | 2011/02/26 04:17 AM |
Have fun | David Kanter | 2011/02/18 11:52 AM |
Have fun | Michael S | 2011/02/19 11:12 AM |
Have fun | David Kanter | 2011/02/19 02:26 PM |
Have fun | Michael S | 2011/02/19 03:43 PM |
Have fun | anon | 2011/02/19 04:02 PM |
Have fun | Michael S | 2011/02/19 04:56 PM |
Have fun | anon | 2011/02/20 02:50 PM |
Have fun | EduardoS | 2011/02/20 01:44 PM |
Linear vs non-linear | EduardoS | 2011/02/20 01:55 PM |
Have fun | Michael S | 2011/02/20 03:19 PM |
Have fun | EduardoS | 2011/02/20 04:51 PM |
Have fun | Nicolas Capens | 2011/02/21 10:12 AM |
Have fun | Michael S | 2011/02/21 11:38 AM |
Have fun | Eric Bron | 2011/02/21 01:10 PM |
Have fun | Eric Bron | 2011/02/21 01:39 PM |
Have fun | Michael S | 2011/02/21 05:13 PM |
Have fun | Eric Bron | 2011/02/21 11:43 PM |
Have fun | Michael S | 2011/02/22 12:47 AM |
Have fun | Eric Bron | 2011/02/22 01:10 AM |
Have fun | Michael S | 2011/02/22 10:37 AM |
Have fun | anon | 2011/02/22 12:38 PM |
Have fun | EduardoS | 2011/02/22 02:49 PM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/23 05:37 PM |
Gather/scatter efficiency | anonymous | 2011/02/23 05:51 PM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/24 05:57 PM |
Gather/scatter efficiency | anonymous | 2011/02/24 06:16 PM |
Gather/scatter efficiency | Michael S | 2011/02/25 06:45 AM |
Gather implementation | David Kanter | 2011/02/25 04:34 PM |
Gather implementation | Michael S | 2011/02/26 09:40 AM |
Gather implementation | anon | 2011/02/26 10:52 AM |
Gather implementation | Michael S | 2011/02/26 11:16 AM |
Gather implementation | anon | 2011/02/26 10:22 PM |
Gather implementation | Michael S | 2011/02/27 06:23 AM |
Gather/scatter efficiency | Nicolas Capens | 2011/02/28 02:14 PM |
Consider yourself ignored | David Kanter | 2011/02/22 12:05 AM |
one more anti-FMA flame. By me. | Michael S | 2011/02/16 06:40 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/16 07:30 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/16 08:15 AM |
one more anti-FMA flame. By me. | Nicolas Capens | 2011/02/17 05:27 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/17 06:42 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/17 04:46 PM |
Tarantula paper | Paul A. Clayton | 2011/02/17 11:38 PM |
Tarantula paper | Nicolas Capens | 2011/02/19 04:19 PM |
anti-FMA != anti-throughput or anti-SG | Eric Bron | 2011/02/18 12:48 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/20 02:46 PM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/20 04:00 PM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 03:05 AM |
Software pipelining on x86 | David Kanter | 2011/02/23 04:04 AM |
Software pipelining on x86 | JS | 2011/02/23 04:25 AM |
Software pipelining on x86 | Salvatore De Dominicis | 2011/02/23 07:37 AM |
Software pipelining on x86 | Jouni Osmala | 2011/02/23 08:10 AM |
Software pipelining on x86 | LeeMiller | 2011/02/23 09:07 PM |
Software pipelining on x86 | Nicolas Capens | 2011/02/24 02:17 PM |
Software pipelining on x86 | anonymous | 2011/02/24 06:04 PM |
Software pipelining on x86 | Nicolas Capens | 2011/02/28 08:27 AM |
Software pipelining on x86 | Antti-Ville Tuunainen | 2011/03/02 03:31 AM |
Software pipelining on x86 | Megol | 2011/03/02 11:55 AM |
Software pipelining on x86 | Geert Bosch | 2011/03/03 06:58 AM |
FMA benefits and latency predictions | David Kanter | 2011/02/25 04:14 PM |
FMA benefits and latency predictions | Antti-Ville Tuunainen | 2011/02/26 09:43 AM |
FMA benefits and latency predictions | Matt Waldhauer | 2011/02/27 05:42 AM |
FMA benefits and latency predictions | Nicolas Capens | 2011/03/09 05:11 PM |
FMA benefits and latency predictions | Rohit | 2011/03/10 07:11 AM |
FMA benefits and latency predictions | Eric Bron | 2011/03/10 08:30 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/23 04:19 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 06:50 AM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/23 09:37 AM |
FMA and beyond | Nicolas Capens | 2011/02/24 03:47 PM |
detour on terminology | hobold | 2011/02/24 06:08 PM |
detour on terminology | Nicolas Capens | 2011/02/28 01:24 PM |
detour on terminology | Eric Bron | 2011/03/01 01:38 AM |
detour on terminology | Michael S | 2011/03/01 04:03 AM |
detour on terminology | Eric Bron | 2011/03/01 04:39 AM |
detour on terminology | Michael S | 2011/03/01 07:33 AM |
detour on terminology | Eric Bron | 2011/03/01 08:34 AM |
erratum | Eric Bron | 2011/03/01 08:54 AM |
detour on terminology | Nicolas Capens | 2011/03/10 07:39 AM |
detour on terminology | Eric Bron | 2011/03/10 08:50 AM |
anti-FMA != anti-throughput or anti-SG | Nicolas Capens | 2011/02/23 05:12 AM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 10:25 PM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/17 05:51 PM |
Tarantula vector unit well-integrated | Paul A. Clayton | 2011/02/17 11:38 PM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/19 01:17 PM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 01:09 AM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/20 08:55 AM |
anti-FMA != anti-throughput or anti-SG | David Kanter | 2011/02/20 12:39 PM |
anti-FMA != anti-throughput or anti-SG | EduardoS | 2011/02/20 01:35 PM |
anti-FMA != anti-throughput or anti-SG | Megol | 2011/02/21 07:12 AM |
anti-FMA != anti-throughput or anti-SG | anon | 2011/02/17 09:44 PM |
anti-FMA != anti-throughput or anti-SG | Michael S | 2011/02/18 05:20 AM |
one more anti-FMA flame. By me. | Eric Bron | 2011/02/17 07:24 AM |
thanks | Michael S | 2011/02/17 03:56 PM |
CPUs are latency optimized | EduardoS | 2011/02/15 12:24 PM |
SwiftShader SNB test | Eric Bron | 2011/02/15 02:46 PM |
SwiftShader NHM test | Eric Bron | 2011/02/15 03:50 PM |
SwiftShader SNB test | Nicolas Capens | 2011/02/16 11:06 PM |
SwiftShader SNB test | Eric Bron | 2011/02/17 12:21 AM |
SwiftShader SNB test | Eric Bron | 2011/02/22 09:32 AM |
SwiftShader SNB test 2nd run | Eric Bron | 2011/02/22 09:51 AM |
SwiftShader SNB test 2nd run | Nicolas Capens | 2011/02/23 01:14 PM |
SwiftShader SNB test 2nd run | Eric Bron | 2011/02/23 01:42 PM |
Win7SP1 out but no AVX hype? | Michael S | 2011/02/24 02:14 AM |
Win7SP1 out but no AVX hype? | Eric Bron | 2011/02/24 02:39 AM |
CPUs are latency optimized | Eric Bron | 2011/02/15 07:02 AM |
CPUs are latency optimized | EduardoS | 2011/02/11 02:40 PM |
CPU only rendering - not a long way off | Nicolas Capens | 2011/02/07 05:45 AM |
CPU only rendering - not a long way off | David Kanter | 2011/02/07 11:09 AM |
CPU only rendering - not a long way off | anonymous | 2011/02/07 09:25 PM |
Sandy Bridge IGP EUs | David Kanter | 2011/02/07 10:22 PM |
Sandy Bridge IGP EUs | Hannes | 2011/02/08 04:59 AM |
SW Rasterization - Why? | Seni | 2011/02/02 01:53 PM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/10 02:12 PM |
Market reasons to ditch the IGP | Seni | 2011/02/11 04:42 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/16 03:29 AM |
Market reasons to ditch the IGP | Seni | 2011/02/16 12:39 PM |
An excellent post! | David Kanter | 2011/02/16 02:18 PM |
CPUs clock higher | Moritz | 2011/02/17 07:06 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/18 05:22 PM |
Market reasons to ditch the IGP | IntelUser2000 | 2011/02/18 06:20 PM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/21 01:42 PM |
Bad data (repeated) | David Kanter | 2011/02/21 11:21 PM |
Bad data (repeated) | none | 2011/02/22 02:04 AM |
13W or 8W? | Foo_ | 2011/02/22 05:00 AM |
13W or 8W? | Linus Torvalds | 2011/02/22 07:58 AM |
13W or 8W? | David Kanter | 2011/02/22 10:33 AM |
13W or 8W? | Mark Christiansen | 2011/02/22 01:47 PM |
Bigger picture | Nicolas Capens | 2011/02/24 05:33 PM |
Bigger picture | Nicolas Capens | 2011/02/24 07:06 PM |
20+ Watt | Nicolas Capens | 2011/02/24 07:18 PM |
<20W | David Kanter | 2011/02/25 12:13 PM |
>20W | Nicolas Capens | 2011/03/08 06:34 PM |
IGP is 3X more efficient | David Kanter | 2011/03/08 09:53 PM |
IGP is 3X more efficient | Eric Bron | 2011/03/09 01:44 AM |
>20W | Eric Bron | 2011/03/09 02:48 AM |
Specious data and claims are still specious | David Kanter | 2011/02/25 01:38 AM |
IGP power consumption, LRB samplers | Nicolas Capens | 2011/03/08 05:24 PM |
IGP power consumption, LRB samplers | EduardoS | 2011/03/08 05:52 PM |
IGP power consumption, LRB samplers | Rohit | 2011/03/09 06:42 AM |
Market reasons to ditch the IGP | none | 2011/02/22 01:58 AM |
Market reasons to ditch the IGP | Nicolas Capens | 2011/02/24 05:43 PM |
Market reasons to ditch the IGP | slacker | 2011/02/22 01:32 PM |
Market reasons to ditch the IGP | Seni | 2011/02/18 08:51 PM |
Correction - 28 comparators, not 36. (NT) | Seni | 2011/02/18 09:03 PM |
Market reasons to ditch the IGP | Gabriele Svelto | 2011/02/19 12:49 AM |
Market reasons to ditch the IGP | Seni | 2011/02/19 10:59 AM |
Market reasons to ditch the IGP | Exophase | 2011/02/20 09:43 AM |
Market reasons to ditch the IGP | EduardoS | 2011/02/19 09:13 AM |
Market reasons to ditch the IGP | Seni | 2011/02/19 10:46 AM |
The next revolution | Nicolas Capens | 2011/02/22 02:33 AM |
The next revolution | Gabriele Svelto | 2011/02/22 08:15 AM |
The next revolution | Eric Bron | 2011/02/22 08:48 AM |
The next revolution | Nicolas Capens | 2011/02/23 06:39 PM |
The next revolution | Gabriele Svelto | 2011/02/23 11:43 PM |
GPGPU content creation (or lack of it) | Nicolas Capens | 2011/02/28 06:39 AM |
GPGPU content creation (or lack of it) | The market begs to differ | 2011/03/01 05:32 AM |
GPGPU content creation (or lack of it) | Nicolas Capens | 2011/03/09 08:14 PM |
GPGPU content creation (or lack of it) | Gabriele Svelto | 2011/03/10 12:01 AM |
The market begs to differ | Gabriele Svelto | 2011/03/01 05:33 AM |
The next revolution | Anon | 2011/02/24 01:15 AM |
The next revolution | Nicolas Capens | 2011/02/28 01:34 PM |
The next revolution | Seni | 2011/02/22 01:02 PM |
The next revolution | Gabriele Svelto | 2011/02/23 05:27 AM |
The next revolution | Seni | 2011/02/23 08:03 AM |
The next revolution | Nicolas Capens | 2011/02/24 05:11 AM |
The next revolution | Seni | 2011/02/24 07:45 PM |
IGP sampler count | Nicolas Capens | 2011/03/03 04:19 AM |
Latency and throughput optimized cores | Nicolas Capens | 2011/03/07 02:28 PM |
The real reason no IGP /CPU converge. | Jouni Osmala | 2011/03/07 10:34 PM |
Still converging | Nicolas Capens | 2011/03/13 02:08 PM |
Homogeneous CPU advantages | Nicolas Capens | 2011/03/07 11:12 PM |
Homogeneous CPU advantages | Seni | 2011/03/08 08:23 AM |
Homogeneous CPU advantages | David Kanter | 2011/03/08 10:16 AM |
Homogeneous CPU advantages | Brett | 2011/03/09 02:37 AM |
Homogeneous CPU advantages | Jouni Osmala | 2011/03/08 11:27 PM |
SW Rasterization | firsttimeposter | 2011/02/03 10:18 PM |
SW Rasterization | Nicolas Capens | 2011/02/04 03:48 AM |
SW Rasterization | Eric Bron | 2011/02/04 04:14 AM |
SW Rasterization | Nicolas Capens | 2011/02/04 07:36 AM |
SW Rasterization | Eric Bron | 2011/02/04 07:42 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/26 02:23 AM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/02/04 03:31 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/05 07:46 PM |
Sandy Bridge CPU article online | Gabriele Svelto | 2011/02/06 05:20 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/02/06 05:07 PM |
Sandy Bridge CPU article online | arch.comp | 2011/01/06 09:58 PM |
Sandy Bridge CPU article online | Seni | 2011/01/07 09:25 AM |
Sandy Bridge CPU article online | Michael S | 2011/01/05 03:28 AM |
Sandy Bridge CPU article online | Nicolas Capens | 2011/01/05 05:06 AM |
permuting vector elements (yet again) | hobold | 2011/01/05 04:15 PM |
permuting vector elements (yet again) | Nicolas Capens | 2011/01/06 05:11 AM |
Sandy Bridge CPU article online | Eric Bron | 2011/01/05 11:46 AM |
wow ...! | hobold | 2011/01/05 04:19 PM |
wow ...! | Nicolas Capens | 2011/01/05 05:11 PM |
wow ...! | Eric Bron | 2011/01/05 09:46 PM |
compress LUT | Eric Bron | 2011/01/05 10:05 PM |
wow ...! | Michael S | 2011/01/06 01:25 AM |
wow ...! | Nicolas Capens | 2011/01/06 05:26 AM |
wow ...! | Eric Bron | 2011/01/06 08:08 AM |
wow ...! | Nicolas Capens | 2011/01/07 06:19 AM |
wow ...! | Steve Underwood | 2011/01/07 09:53 PM |
saturation | hobold | 2011/01/08 09:25 AM |
saturation | Steve Underwood | 2011/01/08 11:38 AM |
saturation | Michael S | 2011/01/08 12:05 PM |
128 bit floats | Brett | 2011/01/08 12:39 PM |
128 bit floats | Michael S | 2011/01/08 01:10 PM |
128 bit floats | Anil Maliyekkel | 2011/01/08 02:46 PM |
128 bit floats | Kevin G | 2011/02/27 10:15 AM |
128 bit floats | hobold | 2011/02/27 03:42 PM |
128 bit floats | Ian Ollmann | 2011/02/28 03:56 PM |
OpenCL FP accuracy | hobold | 2011/03/01 05:45 AM |
OpenCL FP accuracy | anon | 2011/03/01 07:03 PM |
OpenCL FP accuracy | hobold | 2011/03/02 02:53 AM |
OpenCL FP accuracy | Eric Bron | 2011/03/02 06:10 AM |
pet project | hobold | 2011/03/02 08:22 AM |
pet project | Anon | 2011/03/02 08:10 PM |
pet project | hobold | 2011/03/03 03:57 AM |
pet project | Eric Bron | 2011/03/03 01:29 AM |
pet project | hobold | 2011/03/03 04:14 AM |
pet project | Eric Bron | 2011/03/03 02:10 PM |
pet project | hobold | 2011/03/03 03:04 PM |
OpenCL and AMD | Vincent Diepeveen | 2011/03/07 12:44 PM |
OpenCL and AMD | Eric Bron | 2011/03/08 01:05 AM |
OpenCL and AMD | Vincent Diepeveen | 2011/03/08 07:27 AM |
128 bit floats | Michael S | 2011/02/27 03:46 PM |
128 bit floats | Anil Maliyekkel | 2011/02/27 05:14 PM |
saturation | Steve Underwood | 2011/01/17 03:42 AM |
wow ...! | hobold | 2011/01/06 04:05 PM |
Ring | Moritz | 2011/01/20 09:51 PM |
Ring | Antti-Ville Tuunainen | 2011/01/21 11:25 AM |
Ring | Moritz | 2011/01/23 12:38 AM |
Ring | Michael S | 2011/01/23 03:04 AM |
So fast | Moritz | 2011/01/23 06:57 AM |
So fast | David Kanter | 2011/01/23 09:05 AM |
Sandy Bridge CPU (L1D cache) | Gordon Ward | 2011/09/09 01:47 AM |
Sandy Bridge CPU (L1D cache) | David Kanter | 2011/09/09 03:19 PM |
Sandy Bridge CPU (L1D cache) | EduardoS | 2011/09/09 07:53 PM |
Sandy Bridge CPU (L1D cache) | Paul A. Clayton | 2011/09/10 04:12 AM |
Sandy Bridge CPU (L1D cache) | Michael S | 2011/09/10 08:41 AM |
Sandy Bridge CPU (L1D cache) | EduardoS | 2011/09/10 10:17 AM |
Address Ports on Sandy Bridge Scheduler | Victor | 2011/10/16 05:40 AM |
Address Ports on Sandy Bridge Scheduler | EduardoS | 2011/10/16 06:45 PM |
Address Ports on Sandy Bridge Scheduler | Megol | 2011/10/17 08:20 AM |
Address Ports on Sandy Bridge Scheduler | Victor | 2011/10/18 04:34 PM |
Benefits of early scheduling | Paul A. Clayton | 2011/10/18 05:53 PM |
Benefits of early scheduling | Victor | 2011/10/19 04:58 PM |
Consistency and invalidation ordering | Paul A. Clayton | 2011/10/20 03:43 AM |
Address Ports on Sandy Bridge Scheduler | John Upcroft | 2011/10/21 03:16 PM |
Address Ports on Sandy Bridge Scheduler | David Kanter | 2011/10/22 09:49 AM |
Address Ports on Sandy Bridge Scheduler | John Upcroft | 2011/10/26 12:24 PM |
Store TLB look-up at commit? | Paul A. Clayton | 2011/10/26 07:30 PM |
Store TLB look-up at commit? | Richard Scott | 2011/10/26 08:40 PM |
Just a guess | Paul A. Clayton | 2011/10/27 12:54 PM |