Unreliable observations?
A common way to test usability is by having a user complete a task
In most cases, this involves asking a participant to complete a task (such as "submit a support ticket on this website") and evaluating their experience
Some common metrics to evaluate participant experience on a task include:
Success rate (success/fail/partial success)
Time to completion (in minutes or seconds)
Level of difficulty (easy/medium/difficult)
Lostness metric (number of screens visited during the task)
One challenge with these metrics is knowing WHY a task was difficult or failed
Success or failure in a task can be due to many factors, such as:
System bugs
Unclear instructions
(Un)familiarity with a product or service
Workarounds
Another challenge is reliability
With 1 person rating tasks, there may be bias. With 2 or more raters, there may be bias and disagreements
Combining nuance and reliability: A case study
The study that presented this challenge
We conducted a usability study on a service portal for federal employees to request equipment and repairs
To learn more about how difficult common tasks were in the portal for users, we asked our participants to complete the following tasks:
Requesting copies
Mailing printed materials
Ordering dual monitors
Requesting a ceiling light repair
Checking the status of a submitted request
How to evaluate tasks?
We wanted a rubric that evaluates task success as well as the types of errors, barriers, and feedback that users might have in the process.
For this study, we used a 5-point rubric that distinguished between task failure due to cognitive load and a fatal system bug.
Future iterations on our rubric will also differentiate between errors and bugs as well.
Measuring reliability
We incorporated an interrater reliability (IRR) statistic into our analysis, which is a way to measure agreement between 2 or more raters on the same task with the same rubric.
To do this, 3 researchers independently rated each task and then I used the irr package in R to calculate Cohen's Kappa (an IRR statistics for 2+ raters using a nominal scale).
If you're curious about common types of IRR statistics, I created a flow chart for researchers to reference.
Empirically-backed suggestions for implementation
Our results showed that all tasks had barriers or required workarounds for our participants
All of our task ratings were in the following two categories:
(3) Task is completed by the user with minimal difficulty or obstacle (can be an error, bug, some confusion). User is able to recover and complete the task; may have feedback or input for ways to improve the task.
(4) Task is difficult for the user and due to cognitive load, confusion, or frustration (but not an error or bug), user finds an alternative way to complete task
These results, along with the qualitative feedback from participants, provided us with quantitatively-backed recommendations (for example, removing redundancies or supporting users through errors) to improve user experience on a service portal used by over 6000 federal employees.
We also learned about reliable rating as a team
The IRR measurements were able to tell our research team how much agreement we had when assessing the tasks participants created. This meant that any cases of low agreement needed to be re-rated or arbitrated by an additional rater. Without the IRR measurements, we would not have known how faithfully our team was using our rubric, which may have skewed the recommendations we made to improve the service portal.
See this article for more information about agreement thresholds
Want to learn more?
I gave a talk on this topic to a group of federal employees and contractors
I made an Excel template to calculate different IRR statistics
I also have an R syntax flowchart for IRR statistics