A common way to test usability is by having a user complete a task

In most cases, this involves asking a participant to complete a task (such as "submit a support ticket on this website") and evaluating their experience

Some common metrics to evaluate participant experience on a task include:

Success rate (success/fail/partial success)
Time to completion (in minutes or seconds)
Level of difficulty (easy/medium/difficult)
Lostness metric (number of screens visited during the task)

One challenge with these metrics is knowing WHY a task was difficult or failed

Success or failure in a task can be due to many factors, such as:

System bugs
Unclear instructions
(Un)familiarity with a product or service
Workarounds

Another challenge is reliability

With 1 person rating tasks, there may be bias. With 2 or more raters, there may be bias and disagreements

Combining nuance and reliability: A case study

The study that presented this challenge

We conducted a usability study on a service portal for federal employees to request equipment and repairs

To learn more about how difficult common tasks were in the portal for users, we asked our participants to complete the following tasks:

Requesting copies
Mailing printed materials
Ordering dual monitors
Requesting a ceiling light repair
Checking the status of a submitted request

How to evaluate tasks?

We wanted a rubric that evaluates task success as well as the types of errors, barriers, and feedback that users might have in the process.

For this study, we used a 5-point rubric that distinguished between task failure due to cognitive load and a fatal system bug.

Future iterations on our rubric will also differentiate between errors and bugs as well.

The rubric we used for this study, which has since undergone several makeovers

Measuring reliability

We incorporated an interrater reliability (IRR) statistic into our analysis, which is a way to measure agreement between 2 or more raters on the same task with the same rubric.

To do this, 3 researchers independently rated each task and then I used the irr package in R to calculate Cohen's Kappa (an IRR statistics for 2+ raters using a nominal scale).

If you're curious about common types of IRR statistics, I created a flow chart for researchers to reference.

My flowchart for choosing common IRR statistics

Empirically-backed suggestions for implementation

Our results showed that all tasks had barriers or required workarounds for our participants

All of our task ratings were in the following two categories:

(3) Task is completed by the user with minimal difficulty or obstacle (can be an error, bug, some confusion). User is able to recover and complete the task; may have feedback or input for ways to improve the task.
(4) Task is difficult for the user and due to cognitive load, confusion, or frustration (but not an error or bug), user finds an alternative way to complete task

These results, along with the qualitative feedback from participants, provided us with quantitatively-backed recommendations (for example, removing redundancies or supporting users through errors) to improve user experience on a service portal used by over 6000 federal employees.

We also learned about reliable rating as a team

The IRR measurements were able to tell our research team how much agreement we had when assessing the tasks participants created. This meant that any cases of low agreement needed to be re-rated or arbitrated by an additional rater. Without the IRR measurements, we would not have known how faithfully our team was using our rubric, which may have skewed the recommendations we made to improve the service portal.

See this article for more information about agreement thresholds

Our Cohen's Kappa statistic showed us that the Mailing Documents task needed to be re-rated due to *questionable reliability* between raters

Want to learn more?

I gave a talk on this topic to a group of federal employees and contractors
I made an Excel template to calculate different IRR statistics
I also have an R syntax flowchart for IRR statistics

Unreliable observations?