quantitative analysis - Should different users always be used for successive System Usability Scale testing?

Tuesday, September 1, 2015

quantitative analysis - Should different users always be used for successive System Usability Scale testing?

A given project requires SUS scores for the current version of a piece of software, as well as the next version when it is prepared, to allow quantitative results to demonstrate any usability improvements.

If you're not familiar with SUS, it's the System Usability Scale

Should I use a completely different user sample for subsequent tests?

Is there evidence that users giving repeated feedback on iterations of a system show bias compared to "fresh" users?

Answer

Pro

There are analytical advantages to re-using users. If each user tries both the current and the revised software, then you can run your inferential statistics on the difference in each user’s SUS score (e.g., with a t-test). Such a repeated-measures (a.k.a., within-subject) experimental design allows you to detect smaller changes in the average SUS scores than using a “fresh” sample for each version (for the same number of people trying each version and a given level of significance).

Users may remember how they rated the first version they saw, but that isn't a problem. If they honestly think the second version was better, they'll remember what they marked before, and raise it a bit. If they think the second version was worse, they'll drop it a bit. That's what you want them to do.

Big Con

However, if all users first try the current version then try the revised version, you are confounding exposure with the version. Unless the revised version is a complete redesign that bears no resemblance to the current version, you can expect users will on average perform better on the revised version just because they’ve learned some of the things the two version have in common, making the second time through easier. Users may “mistake” this easier second try at the task as the revised version being more usable by design, and thus rate the revised version better.

I don’t know of any studies showing that SUS is or isn’t responsive to exposure, but it hardly matters. Even if there were a study showing no effect of exposure for a particular software application, I’d still be suspicious that your software application with the changes you make could be susceptible to an exposure effect. In experimental methodology, confounds are generally considered to be a fatal flaw.

Solution

The solution to the exposure problem is to "counter-balance" with a cross-over design: randomly select exactly half your users in your sample do the current version first, and have the other half do the revised version first. Now any exposure effects cancel out when your subtract each user's current-version SUS score from his/her revised-version SUS score*. Obviously, you need a completed revised version before you can do this, and it doesn’t work so well with repeated iterations. However, if this is a summative usability test of the final product (something SUS is very good for), then you can do it.

*You can also compare the SUS scores on the first run against the second run and see if there’s a difference. In that analysis, the effects of the version cancel out, and you’ll see if, in fact, SUS is responsive to mere exposure to your software, thereby answering your question.

Blog

Tuesday, September 1, 2015

quantitative analysis - Should different users always be used for successive System Usability Scale testing?

No comments:

Post a Comment

technique - How credible is wikipedia?