Design and Methodology
While there are not a lot of details on the benchmarks and how they are designed, BAPCo does provide a White Paper and a Sysmark2001 FAQ to answer basic questions and give an overview of the design. They use pre-recorded scripts to execute the same keystrokes and use the same data for every run, ensuring that every run performs exactly the same sequence of events. The intent is to emulate what a ‘typical user’ might do throughout the day, and is broken up into Office Productivity applications and Internet Content Creation applications.
In SYSmark 2000 and prior suites, each application was run sequentially and the execution time recorded, along with a ‘score’ based upon the time. After all individual scores were determined, an aggregate score was calculated using an undisclosed method. The resulting report showed not only the aggregate score, but the actual scores of the individual applications as well. These individual scores allowed a tester to swap components and determine the effects of such changes by looking at how individual applications responded, but provided no indication of how the system would perform in a multi-tasking environment.
SYSmark 2001 now opens multiple applications, while switching back and forth between them in an attempt to emulate a real world workload. According to the SYSmark 2001 FAQ “SYSmark 2001 operates in a multitasking environment where many applications execute concurrently. In this scenario the performance of individual applications cannot be captured as they operate concurrently with other applications. Hence the individual application scores are not shown.” However, what I find most interesting about this is that the measurement methodology used doesn’t appear to be consistent with the claimed intent of determining performance in a multi-tasking environment.
According to the White Paper, the fundamental unit of measurement is ‘Response Time’, which is defined as “the latency experienced between the submission of a request by the user and the completion of the processing of that request by the application”. This is done to eliminate ‘user think time’ as part of the measurement, which might be the entry of data, movement of the mouse pointer or some other user delay. Assuming that is the result, it is desirable because we want to measure the system performance, not the user performance. However, later on in the document we are told that up to a one second ‘think time’ is added between operations to create a more realistic test. It goes on to say “Operating system behavior is more realistic when application interaction has think times (just like a real user) as the OS can devote itself to other book keeping activities (like memory management, scheduling etc).” In other words, the background tasks that is the overhead of multitasking will not be measured! This method of measurement seems to belie the statement that individual application measurements cannot be captured. More likely, it would be additional work and overhead to maintain a running total of the times. Furthermore, the swapping/paging activity performed by the operating system seemingly will not be measured because of this. We can verify whether this is the case a bit later in the ‘Profiling’ section.
Discuss (15 comments)