[738 words] Benchmarking interactive systems

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), February 21, 2022 8:51 am
Room: Moderated Discussions
While it seems somewhat straightforward to test a server's interactive performance by connecting a request-generating computer (generating a realistic request stream or a limit-testing stream of requests would be somewhat challenging, but there is probably enough experience to do this reasonably well), simulating a human's response seems more challenging.

For an ordinary desktop computer, one might be able to rig a testing device that attaches to a monitor output and two USB inputs. Keyboard input would seem to be relatively straightforward to simulate (though with autosuggestion typing rhythm and user focus/decisions might have a significant effect — on the other hand, even just the visual feedback of glyphs appearing seems to effect my typing); mouse movement seems likely to be more challenging, especially with cursor visibility/memory of location. "Reading" a screen also seems somewhat challenging even with scripted events (so the tester knows exactly what to seek) and fixed item forms.

For a touch-screen interface, the testing rig would seem to require much more mechanical action and use a camera rather than linking directly to a monitor output. Tablets and phones would also be more difficult to benchmark for power consumption as wired power is unlikely to be the common use case of interest. (Energy use benchmarking of systems seems difficult since ordinary differences in "the same" hardware can make results less representative.)

Accurately simulating human responsiveness is probably difficult, but I suspect there has been enough research that a simple model could be used for useful benchmarks results. From the little I have read, it seems that there are two delay threshhold; "instantaneous" update for an expected event generates a rapid response, a slow update — where the user is waiting but perhaps not really conscious of it — has more "thinking" time, and a very slow update — where the user is more conscious of waiting, focus is diminished, and agitation is greater — can require significant "wake up" time (aside from being unpleasant). (One might note a third threshold where the response is "I will check later" or "ping me". This is not so bad if the delay is expected — 'very slow update' seems worse to me as one cannot reasonably do much in a few seconds and variability in delay is more apparent where a constant three second delay might allow for taking a deep calming breath — but a random long delay with non-extreme probability [a very fast or very slow update one time in ten thousand seems likely to be perceived as a glitch and to be ignored].)

Current benchmarking practice seems to be running back to back scripts. This seems unlikely to represent user experience and would penalize systems that provide incremental updates which allow a user to evaluate a result before completion. Backtracking from ordinary user mistakes and UI-encouraged mistakes (like inputing into the wrong window due to focus stealing) will not be represented by a simple scripted benchmark; assigning values to bad results (e.g., "I saved over a file") caused by the UI might be satisfying but generating reasonable general values seems difficult. I think SPEC CPU merely disqualifies inaccurate results (with some flexiblity for FP where order of evaluation and such might reasonably result in different results — penalizing for using single-rounding FMADD or double-rounding FMADD would seem particularly problematic), but that is intended, I think, to simply filter out bad compiler optimizations (though it doubtless has false negatives — where a language-invalid optimization is used but the result is close enough — and false positives — where a language-valid optimization is used but the result deviates too much from expectation).

Doing all of this with support for different operating environments seems especially challenging. Using a common GUI library will doubtless bias results due to different shim thickness/optimization, but such would also remove UI effectiveness that is part of the system software. Even with such constraints, I think it would be interesting to have a more objective evaluation of different systems for personal computer use.

Going beyond benchmarking to system evaluation — where bottlenecks are identified and choices for improvement are made more clear — seems even more challenging.

This is not an area about which I have read much and I certainly have not done any research, but the concepts seem interesting enough that I hoped some Real World Technologies readers could use this post as a launchpad for insightful discussion about benchmarking and user experience.
 Next Post in Thread >
TopicPosted ByDate
[738 words] Benchmarking interactive systemsPaul A. Clayton2022/02/21 08:51 AM
  [738 words] Benchmarking interactive systemsBrendan2022/02/21 10:01 PM
    [738 words] Benchmarking interactive systemsanonymou52022/02/21 11:49 PM
    [738 words] Benchmarking interactive systemsLinus Torvalds2022/02/22 12:56 PM
      [738 words] Benchmarking interactive systemsanon22022/02/22 07:41 PM
      [738 words] Benchmarking interactive systemsAnon2022/02/22 11:45 PM
      [738 words] Benchmarking interactive systemsBrendan2022/02/22 11:50 PM
        [738 words] Benchmarking interactive systemsanon22022/02/23 04:26 AM
          [738 words] Benchmarking interactive systemsAnon2022/02/23 04:51 AM
            [738 words] Benchmarking interactive systemsanon22022/02/23 05:12 AM
          [738 words] Benchmarking interactive systemsBrendan2022/02/23 01:43 PM
            [738 words] Benchmarking interactive systemsBrendan2022/02/23 01:48 PM
              [738 words] Benchmarking interactive systemsJörn Engel2022/02/23 03:26 PM
            [738 words] Benchmarking interactive systemsanon22022/02/23 05:41 PM
              [738 words] Benchmarking interactive systemsBrendan2022/02/24 01:10 AM
                [738 words] Benchmarking interactive systemsanon22022/02/24 04:43 AM
                  [738 words] Benchmarking interactive systemsBrendan2022/02/24 12:07 PM
                    [738 words] Benchmarking interactive systemsanon22022/02/24 04:38 PM
        [738 words] Benchmarking interactive systems---2022/02/25 11:43 AM
      [738 words] Benchmarking interactive systemsMegol2022/02/23 09:07 AM
        [738 words] Benchmarking interactive systemsLinus Torvalds2022/02/23 02:43 PM
          [738 words] Benchmarking interactive systemsAnon2022/02/23 09:02 PM
          [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/24 02:24 AM
            [738 words] Benchmarking interactive systemsJukka Larja2022/02/24 08:32 AM
              [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/24 10:15 AM
                [738 words] Benchmarking interactive systemsJukka Larja2022/02/24 10:55 AM
                  [738 words] Benchmarking interactive systemsAnon2022/02/24 10:54 PM
                    [738 words] Benchmarking interactive systemsJukka Larja2022/02/25 09:45 PM
                      [738 words] Benchmarking interactive systemsAnon2022/02/25 10:51 PM
                [738 words] Benchmarking interactive systemsLinus Torvalds2022/02/24 11:36 AM
                  [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/24 01:07 PM
                    [738 words] Benchmarking interactive systemsLinus Torvalds2022/02/24 02:19 PM
                      [738 words] Benchmarking interactive systemsBrendan2022/02/24 04:19 PM
                        [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/24 04:36 PM
                          [738 words] Benchmarking interactive systemsAnon2022/02/24 11:25 PM
                          [738 words] Benchmarking interactive systemsBrendan2022/02/25 01:06 AM
                      [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/24 04:25 PM
                      [738 words] Benchmarking interactive systemsAnon2022/02/24 11:54 PM
                        [738 words] Benchmarking interactive systemsBrendan2022/02/25 03:19 AM
                          [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/25 05:08 AM
                  [738 words] Benchmarking interactive systemsAnon2022/02/24 11:15 PM
                    [738 words] Benchmarking interactive systemsJukka Larja2022/02/25 10:18 PM
                      [738 words] Benchmarking interactive systemsBrett2022/02/26 01:49 AM
                        [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/26 02:47 AM
                      [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/26 02:38 AM
                        [738 words] Benchmarking interactive systemsJukka Larja2022/03/02 05:47 AM
                    [738 words] Benchmarking interactive systemsLinus Torvalds2022/02/26 11:23 AM
                      [738 words] Benchmarking interactive systems Anonimus III2022/02/26 06:39 PM
                      [738 words] Benchmarking interactive systemsAnne O. Nymous2022/02/27 02:27 PM
          Benchmarking interactive systemsIan Ameline2022/02/24 08:40 AM
            Benchmarking interactive systemsBrendan2022/02/24 01:01 PM
              Benchmarking interactive systemsIan Ameline2022/02/24 02:14 PM
                Benchmarking interactive systemsBrendan2022/02/24 02:59 PM
                  Benchmarking interactive systemsIan Ameline2022/02/24 06:24 PM
            Benchmarking interactive systemsAnne O. Nymous2022/02/24 01:13 PM
              Benchmarking interactive systemsIan Ameline2022/02/24 02:28 PM
            Benchmarking interactive systemsAnon2022/02/24 11:44 PM
              Benchmarking interactive systemsSimon Farnsworth2022/02/25 05:21 AM
              Benchmarking interactive systemsIan Ameline2022/02/25 08:49 PM
    [738 words] Benchmarking interactive systemsSimon Farnsworth2022/02/22 02:44 PM
      [738 words] Benchmarking interactive systemsJukka Larja2022/02/23 06:56 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?