Is it enough to test with 5 people? 🤔

"What sample size should I take for my user research?" - This is probably the most frequently asked question, especially when starting out.

It's especially true for qualitative research, where you're always worried that you won't get enough feedback to make the right decisions. And this is where one number in particular keeps coming up: the famous number 5!

5 users would be enough to learn everything we need to learn 🤔

Yet it's not that simple: 5 is not a magic number that you can pull out in all sorts of ways and under all conditions, all experts agree on that. But this myth has a hard skin, and the community is having a hard time getting rid of it.

Today, let's untangle the real from the fake, to try to find a satisfactory answer to our question! To do so, we'll have to go into the mathematical foundations of the theory - if you're allergic to maths, don't panic, skip to the "To summarise" segment ;)

‍

The origins of the myth

In 1993, Jakob Nielsen (co-founder of the famous Nielsen Norman Group) and Thomas Landauer published A mathematical model of the finding of usability problems.

This research paper was summarised in 2000 in the blog post Why You Only Need to Test with 5 Users, which went on to become a worldwide phenomenon - and is in many ways considered to be the origin of the myth!

The conclusion of his article: 5 users can discover 85% of the problems of an interface.

‍

Jakob Nielsen dans une de ses vidéos sur ce sujet

Understanding the original paper

Like any scientific text, to understand the conclusion you need to have in mind the assumptions and context on which the text is based.

Framework and definitions

In their paper, Nielsen and Landauer focus solely on user testing and heuristic evaluations, the purpose of which they say is to compile as exhaustive a list as possible of usability problems that occur on an interface.

A new problem = an item added to the list.

There is never any question of severity, frequency or impact on user needs.

Moreover, these two methods are categorised as debugging methods.

Objective

The authors seek to find a way of optimising the costs of a usability study, by optimising the ratio of the number of problems reported to the number of tests performed.

This financial aspect is very important: they dedicate 2 pages out of 7 of their paper to it. Since 1989, Nielsen has been campaigning for Discount Usability, an iterative and "quick and dirty" approach to user research, at a time when research methods are generally subject to the statistical rigour inherited from academic research. According to him, the loss in reliability of the data collected is largely compensated by the flexibility that a shorter and more iterative format can provide. In his view, the loss in reliability of the data collected is largely offset by the flexibility that a shorter, more iterative format brings - a notion not widely accepted by his peers in 1993.

Findings

Nielsen and Landauer rely on the law of diminishing returns. The idea is that if you test with two people, some of the problems they come up with are likely to be the same. If you test with a third person, ditto, they will probably bring up problems that have already been partly detected by either tester 1, tester 2 or both.

Therefore, the more testers you have, the more often you come across problems that have already been encountered, and the fewer new problems you discover.

They will therefore try to determine at what point it is no longer profitable to continue testing, because the number of problems reported will be too low in relation to the investment (both financial and in human time) that a test requires.

Mathematical model

Be careful, this is where we get to the heart of the matter! If you don't want to go into this level of detail, you can jump straight to our conclusions! I promise, we won't say anything 😉

This law of diminishing returns can be modelled according to this formula:

P = N(1-(1-L)^n)

with :

P the number of problems encountered
N the number of problems in total
n is the number of users being tested
L is the average proportion of problems encountered by a user (if on average each user finds 20% of the problems on an interface, and misses 80%, then L=0.2).

Since L is an average, to increase the reliability of the model, the sample of users for testing must be more or less homogeneous, i.e. they must be representative of the same population.

Result

The authors empirically suggest that L = 0.31 (31%), which gives the following graph:

Effectivement, d’après ces résultats, avec n = 5, on trouve 85% des problèmes d’une interface. De plus, on voit bien que, au delà de 5, chaque nouvel utilisateur avec qui on teste ne rapporte que peu de nouveaux problèmes détectés (seulement +5% avec le testeurs 6, +3% avec le testeur 7, etc).

En réalité, le papier de 1993 n’insiste pas tellement sur le chiffre de 5, mais plutôt sur le modèle créé, qui selon les auteurs doit servir de base pour aider les équipes de développement à prédire le bon nombre de tests pour faire avancer leur projet tout en optimisant le ROI de la recherche utilisateur.

C’est surtout Nielsen qui pousse le chiffre 5, avec ses différents articles notamment ceux publiés en 2000, 2009 et 2012, dans lesquels il dit explicitement que tester avec 5 utilisateurs suffit.

Graphical rendering of the mathematical modelIndeed, according to these results, with n = 5, 85% of the problems of an interface are found. Moreover, we can see that, beyond 5, each new user with whom we test reports few new problems detected (only +5% with tester 6, +3% with tester 7, etc).

In reality, the 1993 paper does not insist so much on the figure of 5, but rather on the model created, which according to the authors should serve as a basis to help development teams predict the right number of tests to advance their project while optimising the ROI of user research.

Nielsen in particular is pushing the number 5, with his various articles including those published in 2000, 2009 and 2012, in which he explicitly states that testing with 5 users is enough.

‍

Problems

You may be thinking: "But then, I don't see the problem! The mathematical basis behind these statements seems solid!"

Yes and no! There are in fact two categories of problems that make this statement - which has become an axiom of the user research world over time - a myth:

those related to an over-simplification or misinterpretation of this theory
those linked to the very limits of Nielsen and Landauer's model

Problems of oversimplification

Remember your mathematics lessons. You must have been reprimanded by your teacher for trying to apply a theorem without first checking whether the conditions of the theorem apply.

Well, this is often what happens when you interpret the statement: testing with 5 users is enough!

Here are some common mistakes that we see:

1. Applying the 5-tester rule to the wrong methods

This mistake occurs when you forget that Nielsen talks about finding 85% of the problems in an interface during user testing.

The approach he is interested in is both evaluative and qualitative, and his results cannot be applied to either quantitative or exploratory methods.

Therefore, one should avoid:

❌ conducting interviews with only 5 people: in interviews, the objective is to understand the experience, motivations and needs of a population. The nature of the data is different, and covers much broader notions than that of "problems with an interface". It therefore makes sense that you can't cover an exploratory topic with just 5 people. The Nielsen company itself says that interviews are not usability tests, and that five interviews are not enough. But what is? There is no general rule, except that you should stop when you feel you are no longer learning anything. This can happen after 10 interviews, just as it can happen after 50.

❌ to do unmoderated tests with only 5 people: the objective and measures are not the same, so the rule of 5 does not apply. Where Nielsen's objective in qualitative testing is to obtain a list of problems, his objective in quantitative testing is rather to know how many encounter this problem, on what scale, with what impact, through measures of duration, frequency etc. These measures need to be statistically reliable, and so Nielsen's own recommendation is to go to 20 testers, or 40, depending on the margin of error you are willing to accept.

❌ to do questionnaires with only 5 people: even if this seems for most completely logical, let's remember it anyway! With a questionnaire, the aim is often to be able to extrapolate the results obtained from a sample to a larger population. The statistical significance of this sample then becomes essential. Fortunately, there are many online calculators that can help you determine the right sample size.

🤔 Note: we have been talking about search methods applied to the interface since the beginning. Does Nielsen's theory also apply to testing physical objects or services? Can we find 85% of the ergonomic problems of a seat, or the friction encountered by a traveller at the airport, with 5 participants? Unfortunately, at the time of writing this article, we have not found any answers to this particular question! To be continued...

2. Prendre n’importe quelles 5 personnes

Nielsen mentions this only fleetingly in his research paper and in his article, probably because it is a basic tenet of user research that he did not feel it necessary to repeat - however, he probably should have!

For user research to be valid (whatever the method), the users must be representative of the target of the product being tested.

So when we say "testing with 5 users is enough!" we mean user (or at least potential user), not "nobody".

If you test your job search application with your parents who have been retired for 10 years, you can be sure that you won't get results that are as comprehensive or as relevant as if you were testing with people who are looking for a job right now.

💡 Finding users representative of your target audience sounds easier said than done? Let us help you! At Tandemz, this is our specialty! Don't believe us? Take a look at some of the past recruitments we've done!

3. Test with 5 users with very different profiles

Nielsen's model only really works if the users who test are representative of the same target.

Note: he doesn't explain exactly why in his text, but it must be because in his formula an important element is the variable L, the average proportion of problems encountered by a user. For this average to make sense, users must have comparable profiles and behaviour. Otherwise, it is like saying that a piece of fruit weighs on average 500g, without distinguishing apples from watermelons!

This point is often neglected, and it is not uncommon to see studies carried out with 5 people, each of whom should represent an entire target. In reality, if you have several targets (for example, an end-user target and an administrator target, or even a young target and an older target), you would have to test with 5 people from each target to really apply Nielsen's recommendations.

4. Test with 5 people and call it a day

As mentioned above, Nielsen's primary objective was to maximise the ROI of usability testing. Indeed, in the 1990s, he sought to push companies that had neither the means nor the time to do research to do it anyway - even if it meant being less rigorous about traditional methods. But on one condition: iterate!

In this same vein, Nielsen has always recommended doing multiple iterations of 5 tests. In fact, in a context where interfaces evolve very quickly, it becomes useless to draw up a very exhaustive list of usability problems for a product at once: it is likely that the development team will not be able to resolve it in its entirety anyway, making this list rapidly obsolete as the interface evolves. It is thus necessary to favour a continuous but less exhaustive monitoring, to be able to help the development teams' choices over time.

💡 Iterate quickly and easily with our Tandemz credit offers! Buying prepaid credits allows you to save money and accelerate your research cycles, while avoiding the multiplication of billing round trips. For more information, contact us!

So from the outset, the statement "5 users can discover 85% of the problems with an interface" comes with its share of asterisks, "ifs" and "buts", which do not make it universally applicable. Unfortunately, and as is all too often the case, the community has primarily taken a simplistic and false form of the model.

But even if all these nuances and parameters are taken into account, is the Nielsen model really 100% reliable?

The limits of the model

Is L = 0.31 really universal?

As a reminder, Nielsen argues empirically that, on average, a user will find 31% of the problems in an interface. It is from this hypothesis applied to his model that the statement "5 users find 85% of the problems of an interface" is derived.

If in fact L=0.2 (so one user finds 20% of the problems), the figure for finding 85% of the problems increases from 5 to 9. That's almost double! And with 5 people, only 67% of the problems are found.

However, this variable L actually depends on many factors:

the type of users (are they novices or rather used to this interface?)the complexity of the interface being tested, and the scope of the test tasksthe level of iteration of the interface (indeed, if an interface has already gone through iterations of testing and improvement, theoretically this means that the most obvious problems have already been identified and fixed - only the more subtle and therefore less detectable problems remain)the level of the assessorSo there is no reason to think that L=0.31 is really generalizable.

This means that in order to use Nielsen's formula in a truly reliable way, it would be necessary to be able to calculate L.

Except that, to calculate L, we need to know the total number of problems in the interface! A number that we don't know a priori, since the whole point of the usability testing process is to discover them.

Empirically, it doesn't quite work

Several studies that followed Nielsen and Landauer's work had the approach of testing a site with a certain number of users, and seeing what proportion of problems they would have really found if they had stopped at 5. Surprisingly, they rarely found 85%!

The one we found most interesting (as it would take too long to summarise them all) is the following:

In 2002, Faulkner published a research paper in which she conducted tests with 60 users. Then, using software, she created 100 random selections, in sets of 5, 10, 15 and 20 participants respectively, to simulate what would have happened if she had only tested with these participants.

For the sets of 5, she notes that depending on the users she would have found, she could have expected to find between 55% and 100% of the problems - which is a huge variance! It should be noted, however, that on average, out of the 100 sets of 5 users, she did find 85% of the problems with the interface.

‍

For the sets of 10 users, it was rather between 82% and 100%, with an average of 95%.

And so on :

Ensemble des graphiques pour les sets de 5, 10, 15 et 20 testeurs

This study corroborates the Nielsen and Landauer model, but only on average! In the real case of a field study, this average cannot unfortunately be applied.

The reality is therefore rather the following: with 5 users, you will find between 55% and 100% of the problems of your interface! It is up to you to see whether this range is acceptable or not.

‍

What about the severity of the problems?

Nielsen and Landauer's model does not consider the severity of a problem at all. Discovering 85% of the problems does not mean discovering 85% of the most serious or blocking problems!

One might think that the more blocking a problem is, the more obvious it is, and therefore the faster it will be seen. This was Virzi's theory in 1990 - and Nielsen and Landauer relied heavily on his work to create their model.

However, it has since been theorised that there is in fact no correlation between severity and discoverability of a problem.

The question of severity is therefore rather the other way round: if 5 users discover 85% of the problems, and there is a non-zero probability that among the remaining 15%, there are serious problems, is it really acceptable to stop there?

An example could be products with medical or navigation applications, where the slightest design or development flaw can potentially endanger the well-being or even the lives of their users.

‍

To summarise

The statement "Testing with 5 people is enough" is not completely wrong - it is just very imprecise, and above all, it comes with many limitations and conditions that are easy to forget!

The full statement should rather be: Testing with 5 users will find between 55% and 100% of the problems in an interface.

But beware, for this statement to be really valid, you need :

✅ Place yourself in the context of usability testing of an interface.
❌ Do not try to apply it to exploratory interviews or quantitative methods!
✅ Test with 5 users in your product's target audience.
❌ With people your product is not targeted at, the rates of discovered problems are even lower.
✅ Test with 5 users with similar profiles and behaviors
❌ By taking 5 too different users, the range of problems encountered becomes downright random.
✅ Ensure that each user finds about 30% of the problems in an interface ❌ In practice, this figure is not only very difficult to verify, it is also highly variable. Yet the fewer problems your users find, the more you need for your tests!
❌ Forget about the notions of frequency or criticality of problems. After 5 tests, you still have the possibility of having missed blocking or serious problems.

If not 5, then how many?

Unfortunately, the only real answer is: it depends!

Yes, there are so many factors that influence the quality, relevance and completeness of feedback that, in reality, the only way to know if you've done enough is... to have done enough to realise it! As with interviews, when you don't get any new feedback, you can probably consider that you've done enough.

And that's the problem: the lack of a definitive answer really makes you want to subscribe to this simple and memorable "5 users are enough" theory - and that's why the myth has spread so quickly!

But in the end, is this the right question to ask?

Nielsen’s model is primarily a search ROI optimization model. It was never really about how many people find how many problems - but rather, to accept that :

the more you test, the more it will cost you, both in time and money - and this cost is not linear, but rather exponential, because of the redundancy you will have during testing
the less you test, the less you will understand what needs to be improved in the product, and the more likely you are to miss problems that may turn out to be serious.

The trick is to find the right balance between research effort and completeness of results - and to accept the consequences of your choice.

Our advice? First define your priority and the level of acceptable risk associated with it! Ask yourself the following questions:

Can a problem with your product or pathway have a serious impact on your users? If yes, then prioritise completeness.
If not, do you have the budget and time to test with 15 people? 10 people? 5 people? If no, then stick to what you can test, and try to iterate as often as possible to compensate.
If yes, and these people report a lot of problems, will you have the development capacity to fix them quickly before your next test cycle? Then adjust the number of people targeted according to your development capacity.

📰 This topic of 5 users being enough is one of the most common mistakes in UX research, but there are many others! Discover them in our CPO podcast the most common mistakes in UX research to learn how to recognize and avoid them!

Is it enough to test with 5 people? 🤔

The origins of the myth