TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?
The following experiment is conducted to evaluate the usability of a web-interface:
- Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
- Give them a SUS form to fill out
- After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)
So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.
It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:
You can see that the v2 score is lower than v1.
These findings are puzzling, because:
- the qualitative feedback I got from participants was more positive in v2
the changes between v1 and v2 were not ground-breaking, e.g.:
- added tooltips to widgets
- increased the contrast to make the active tab more prominent
- changed wording to avoid technical jargon
- shortened text
nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site
In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.
Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.
Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).
At the same time, a t-test
says that the differences between the scores of v1 and v2 are not statistically significant.
How could one make sense of these results?
Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.
A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.
Answer
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
No comments:
Post a Comment