Design and Methodology
The Winstone benchmarks use pre-recorded scripts to execute the same keystrokes and use the data in real applications for every run. This ensures that every run performs exactly the same sequence of events, and provides a measurement that is fairly indicative of what a user might experience using the same applications. These benchmarks have gone through several modifications over the years to better measure the ‘real world’ performance of a system.
In early versions, each application was run sequentially and given a score based upon the time it took to complete the script. After all scripts were completed, an aggregate score was calculated by weighting each application by its market share. The resulting report showed not only the aggregate ‘weighted’ score, but the actual scores of the individual applications as well. These individual scores allowed a knowledgeable person to swap components and determine the effects of such changes by looking at how individual applications responded.
At some point, it became apparent that most users were buying complete suites of applications (such as the MS Office Suite), so the benchmarks were changed to reflect this. The most popular suites were identified, and scripts developed to perform what was considered a ‘typical’ set of tasks. Since market share data was considered secret, and any reasonably intelligent person could derive the percentages fairly closely from the data, the individual scores were dropped in favor of ‘suite’ scores. The current version (BWS 2001 and CCWS 2001) of these benchmarks simply use the most popular applications within the market segments being focused on, based upon market share (sales).
With Winstone 98, some multi-tasking was added, which was then expanded upon in later versions. Today, both Business Winstone and Content Creation Winstone open multiple applications, while switching back and forth between them in an attempt to emulate a real world workflow. With multiple applications being run simultaneously, it is difficult to discern what portion of the resource usage should be attributed to which application, which is why we see only aggregate scores reported now. Though techniques exist in the high-end world to measure such things, the cost to develop such technology for the PC would likely be prohibitive. To time each application specifically would require hooks into the operating system that would detect whenever a task-switch occurs. With the advent of technologies such as multi-threading (or, hyperthreading), even this technique wouldn’t work. Suffice it to say that measuring individual applications in a multi-tasking environment on the PC does not appear feasible at this time.
The most recent modifications to the benchmark have been to eliminate the idle time (i.e., the system waiting for user input) and to focus primarily upon the features that cause users to wait (‘hot spots’). For more details on what hot spots are, and the general design of these benchmarks, read the Benchmark Insider articles, such as
It makes sense to measure the time the user is waiting rather than the time the computer is waiting, since the latter is really a measure of user performance vs. hardware/software performance. Though many argue that users just don’t need any more processing power in office applications, the fact remains that some functions will still cause the user to wait. This is particularly true when opening and saving large files, querying and sorting records, spell checking, searching and replacing text, etc. It is specifically these types of operations that are performed in the Business and Content Creation Winstone benchmarks. It should be intuitively obvious that these operations are going to be more dependent upon I/O performance than either memory or CPU, which has been discussed in other articles on this site.
One could argue that most users infrequently use these particular operations, which seems to be the basis for the belief that office applications don’t require more power. On the other hand, the greatest impact on productivity occurs when response time for any operation exceeds about one second. This is the point at which most people begin to get impatient or distracted. There have been many studies on the effect of response time on productivity, such as this IBM study. At about three or four seconds, users turn their focus to other things. What this means is that those operations that cause users to wait more than one second will constitute a much larger portion of overall ‘user time’ than they might appear. In a business environment, this translates into dollars, particularly when tens, hundreds or even thousands of users are involved.
So, the only real issue is whether the applications in the benchmark reflect those used by readers of the publication running the benchmarks, and whether the features focused upon are the most commonly used. This should be the question asked of every benchmark being run. All too often, it seems that benchmarks are run simply because “everybody else runs them” or sometimes even because nobody else does. These are exactly the wrong reasons to run a benchmark.
Be the first to discuss this article!