The Root of the Problem?
The raging controversy over what benchmarks are valid or ‘real world’ and which are not, along with charges of bias and corruption, indicates that the industry still needs to identify a universally accepted method for measuring the relative performance of components and systems. Perhaps it is a pipe dream to think this can be accomplished, but if we are to even begin to find the answer we need to clearly define the goals and guidelines – and strictly adhere to them. Among these should be a set of clear definitions, such as what constitutes real world vs. synthetic, and what role component benchmarks play in evaluating systems. We also need to have as much information as possible about the benchmarks, without compromising their integrity so meaningful conclusions can be drawn about how the results pertain to any specific usage.
Most professionals and hobbyists realize that there are several different categories of benchmarks that provide different types of information. The generally accepted categories are system vs. component, and application vs. synthetic. It seems apparent that the definitions of these are sufficiently vague to create a great deal of confusion, even amongst those who are looked upon as ‘experts’ in benchmarking.
The problem with system level benchmarks is that to measure the performance difference in a single component requires a lot of work and analysis, which is generally not very cost effective in a business environment. Component level benchmarks are good for testing the capabilities of a specific component, but relating the results to what users will actually see in a system is very difficult, and again, can be time consuming and costly. There are those who believe component level benchmarks are better than system level, while others believe the opposite. I believe that component level benchmarks are useful for identifying specific differences in components, which can then be used to give more insight into the results of system level benchmarks.
Application benchmarks are obviously ideal for comparing components and systems, but only for that specific application. Unless one can accurately profile other commonly used applications (the cost factor again), the results cannot be used to estimate performance for anything other than that application. Synthetic benchmarks can provide information about how a particular algorithm might perform, or even what component or system is best for certain types of code. This is actually more useful for software developers than end users, however. Getting back to the SysMark 2001 issue, the SSE patch for Athlon XP could be considered a synthetic benchmark, because it shows developers what the effects of SSE are in such an application, but provides absolutely no information about how the application will run on an Athlon XP in the real world. As stated previously, however, without knowing why the specific application weighs so heavily on the final score, the results of the entire benchmark should be questioned.
Be the first to discuss this article!