Comparison Between Winstone and SYSmark 2001 Methodologies
Both Winstone benchmarks and SYSmark 2001 attempt to measure application response time. In addition, Winstone attempts to measure application startup time (to some degree) and task switching time. It is unclear whether SYSmark 2001 measures application startup time, but it does appear that task switching time is not measured. The Winstone benchmarks start a timer, then perform a series of actions, including starting an application, entering data and performing application specific tasks, and then stop the timer at the end of the sequence of operations. SYSmark 2001, according to the FAQ and White Paper on the BAPCo website, only measures the response to the application specific tasks. The timer does not appear to be taking measurements when tasks are swapped into the foreground or when applications are started. Admittedly, however, the information provided does not directly address this question so I am only basing this on the wording of the documents mentioned.
In the responses received regarding the evaluation of SYSmark 2001, one of the most commonly cited reasons why the benchmark is a more accurate measure of ‘real world’ performance than Winstone was the elimination of ‘user time’. The Winstone benchmarks include the keystrokes used to respond to application requests as part of the performance measurement. This tends to ‘dilute’ the performance of the system because this is essentially idle time. Some mentioned that the keystrokes are ‘unnaturally fast’, so that no real user could ever enter the data so quickly – but this should be a positive, not a negative. The faster the data is entered, the less time the system is waiting for data. Therefore, if one is going to include the user keystrokes, they should be done as quickly as possible. I do agree, however, that ideally a benchmark should eliminate the user wait time entirely, which is what SYSmark 2001 claims to do.
The other issue mentioned frequently was the ‘think time’ that I believe causes the benchmark to be skewed. Those who defended the practice claimed that it is an integral part of the user experience (which is exactly what the BAPCo FAQ says). Unless I misunderstand what BAPCo means by ‘think time’ I tend to disagree for several reasons. When I am switching between applications, the time spent waiting for the requested application to become ‘active’ is very definitely a part of my experience. Depending upon what else is happening, this might be anywhere from instantaneous to several seconds. If this time is not measured, it gives an unrealistic view of the overall system performance from the user perspective. My interpretation of ‘think time’, as written in the BAPCo documents, is that when the system responds the user will spend a few moments thinking about what to do next. My experience, as well as published studies, has shown that this is simply not the case.
In the article on the Winstone benchmarks, I referenced this IBM study done in 1982, which shows that productivity decreases as application response time increases. I believe that the key phrase from this article is found in the second paragraph, where it states “In fact, at one time it was thought that a relatively slow response, up to two seconds, was acceptable because the person was thinking about the next task. Research on rapid response time now indicates that this earlier theory is not borne out by the facts: productivity increases in more than direct proportion to a decrease in response time.” When you look at the graphs presented, you can see that as system response time decreases, the user response time decreases as well, even when system response time is well below a half-second. Note that in this article, user response time is defined as “… the time span between the moment a user receives a complete reply to one command and enters the next command. People often refer to this as think time”. It is my interpretation that this includes the amount of time spent entering the data before hitting the enter key to submit the next request, and may not be the result of the user thinking before performing the next action.
It is important to recognize that this study was based on an application called CICS, which would not need to be started by the user, and would be dedicated to the terminal (i.e., no application switching). Therefore, this study was focused entirely upon transaction processing within a single application. I believe a task switching scenario would result in similar findings, where delays caused by system overhead when swapping to a new application would cause a slowdown in user response time.
Consider any activity you engage in where you will switch between two or more different tasks. Generally, by the time you actually physically begin the new task, you have already thought about what you are going to do, so your activity on it begins immediately. When monitoring my own system activity, this is exactly how I work. In some cases, there is almost no ‘think time’ at all, particularly if I am simply checking on the progress of the background task. If the task switch takes more than a fraction of a second, my concentration will be affected and I will likely not be ‘ready’ when the new task becomes active. Therefore, it is my contention that ‘think time’ is only appropriate when the system is already slow. Furthermore, it should not be the role of the benchmark to make assumptions about how the user will perform, it should only measure how the system performs.
My interpretation of the BAPCo ‘think time’ implementation is that a user will spend some time thinking after the system responds before actually performing another action. The results of the IBM study seem to show that the faster the system responds the less time the user will actually spend ‘thinking’, with possibly no real think time at all.
To sum this up, I believe that the BAPCo methodology is better than the Winstone methodology when it comes to user input. However, I believe that by simply turning off the timer during the user input phase is all that is necessary, and the addition of ‘think time’ skews the benchmark results. Unless and until some compelling evidence is provided to prove that task switching and application startup time are included in the SYSmark 2001 measurements, and that adding in user think time is necessary (or that my interpretation of this is incorrect), I still have to stick with my assessment that SYSmark 2001 is an application performance tool, not a system performance tool. The Winstone benchmarks do have some limitations, however it appears to me at this time that the methodology is more appropriate for measuring overall system performance.
Discuss (15 comments)