How much will it cost to learn to use a new survey software system?
As a software vendor, I am often surprised how fo…
Read moreA lot of our analysis and reporting services work includes the calculation of mean scores. Most questionnaires have rating scales where respondents rate various attributes or statements from agree strongly to disagree strongly or some other scale. For customer satisfaction, employee surveys mystery shopping surveys and the like, almost all of the questionnaire may be comprised of rating scales.
Let’s look at individual mean scores first
One of the most common things to do with these rating scales is to apply a scoring system, so that mean scores can be calculated. For example, +2 for agree strongly to 2 for disagree strongly or 5 to 1 or 100 to 0. In practice, it is the same thing. Let’s say you score the scale as +2 to 2 with agree strongly as 2, agree slightly as 1, neither agree nor disagree as 0, disagree slightly as 1 and disagree strongly as 2. Now, if you get a score of 0.0, what does it mean? Well, at one extreme, it could mean that everyone neither agrees nor disagrees or, at the other extreme, it could mean that half of the respondents agree strongly and half disagree strongly. Or, of course, somewhere in between. Nonetheless, as users of that data quite different marketing strategies might be adopted. The data might be telling you something very different about your brand.
What can you do to show the distribution of values?
A standard deviation can be used to show how closely responses are to the mean. A low standard deviation indicates that responses are generally closer to the mean whereas a high standard deviation shows that answers are more scattered. The problem is that a numerical value may not be meaningful to the user of the data. Therefore, it is common to show a “top two box” and a “bottom two box” which shows the percentage of respondents who agree and disagree. In many cases, it may be necessary to show “top box” data separately from “top two box” data. In a competitive market, it might be important that respondents strongly agree rather than just agree, for example. Similarly, in sensitive markets like car safety or health, agree slightly might not be good enough.
Why you need to be careful looking at a mean of mean scores
The next calculation that is often carried out is an overall calculation of some or all rating scales. Even greater care is needed here. Let’s say, you have asked respondents to agree/disagree with five positive statements, it seems logical to get an overall rating by adding the five mean scores together and dividing by five. This, however, is a dangerous thing to do thoughtlessly and something that, I would suggest, is carried out far too often with insufficient thought.
Problem 1: Do the same people answer each rating scale?
If you have 500 people in your survey and all 500 people answer each rating statement, the number answering each rating statement is not a problem. However, it is not uncommon to have one or more statements that are not applicable to some respondents or a statement for which a high number of respondents don’t have an opinion. Such examples might be: “How did you find the flat bed in first class on this flight?”, “How easy was it to install the new operating system on your computer?”, “How did you rate the French fries with your meal?” In all those cases, only some respondents would answer the question either because they did not fly first class or because they didn’t install the operating system upgrade or they didn’t have French fries with their meal. This means that some mean scores may be calculated on a different base size. When you average the mean scores of each statement, a mean score with a low base will have as much weight or influence as every other statement. Not only that, in the example of the first class passengers, they might get looked after exceptional well and produce an unusually high score. Already, calculating a mean of mean scores may be dangerous.
Problem 2: Are some of the statements asking the same thing?
The next problem often arises unwittingly. Let’s take a slightly extreme example. Let’s say there are three rating statements about a fast food outlet. The first statement asks if the food is tasty, the second statement asks if the food is good value for money and the third statement asks if the food was nice to eat. Now, in this rather blatant example, ‘tasty’ and ‘nice to eat’ are almost certainly closely linked and will produce similar if not identical results. Therefore, if you were to calculate a mean of mean scores, you are effectively giving ‘tasty/nice to eat’ twice as much weight or influence than good value for money.
Problem 3: Beware of small bases
It’s easy to forget that some analysis, reports or online calculations may be on small bases. There is tendency to believe what a report or reporting tool tells us. If it is only calculated on 10 people, that data may be very volatile and skewed by a small number of respondents. The user of your data may not think that way and assume that equal validity but can be attached to small amounts of data as the full set of data.]
How to deal with these problems
The first important thing is to be aware of the problems. However, where results are being distributed to a wide audience or as online dashboards, it might be hard to put health warnings on the data you are reporting that makes sense to the recipients.
Solution 1: Try a responses based mean score for subtotals or overall calculations
One simple test is to produce a mean of mean scores on both an average of the calculated mean scores but also based on responses. In other words, if you have five statements, add up the scores from the five statements and divide by the number of statements answered. This could produce a very different result from the mean of mean scores.
Solution 2: Filter results
Consider removing any statement that is biasing your results and show it separately. In the above example of the comfort of the flat bed on a first class flight, you may exclude that data from any overall calculation and show it separately.
Solution 3: Run correlation analyses
To deal with the problem of statements that are similar like the example of ‘tasty’ and ‘nice to eat’, you could run a correlation analysis and produce factor analysis. This will show you which statements are reflecting the same information and possibly biasing your overall mean calculations.
If you still need proof….
Let’s say there are 4 people in a department and they four aspects of their job satisfaction. Let’s say that respondent 1 scores them as 5,5,5,1, respondent 2 scores 5,1,5,1, respondent 3 scores 4,3,2, but has no opinion for the fourth statement, respondent 4 scores 5,5, but has no opinion for the 3^{rd} and 4^{th} statements. The mean scores you would get would be:
Statement 1: 4.75
Statement 2: 3.5
Statement 3: 4.0
Statement 4: 1.0
The mean of these means is 3.31
Now if you add all of the scores together, the total is 47 and divide that by 13 answers, you get 3.62.
You might say that the different between 3.31 and 3.62 is not much, but it’s almost 10% difference. That could be someone’s bonus or promotion or a decision to expand an outlet or not.
I’ve also seen scores calculated on a respondent basis. This can be even more dangerous, especially where only some scores are given. This method would produce a mean of mean respondent scores of 3.75.
The calculations for all of these are below:
Statement 1 
Statement 2 
Statement 3 
Statement 4 
Total 
Mean 

Respondent 1 
5 
5 
5 
1 
16 
4.00 
Respondent 2 
5 
1 
5 
1 
12 
3.00 
Respondent 3 
4 
3 
2 
9 
3.00 

Respondent 4 
5 
5 
10 
5.00 

Total 
19 
14 
12 
2 

Mean 
4.75 
3.50 
4.00 
1.00 

Mean of statement means 
3.31 

(4.75 + 3.50 + 4.00 + 1.00) / 4 

Sum of all scores given 
47 

Number of scores given 
13 

Mean of statements answered 
3.62 

Mean of each respondent means 
3.75 

(4.00 + 3.00 + 3.00 + 5.00) / 4 
Mean scores are good but need care
Mean scores can give you a quick picture of what the data is telling you. It may only be occasionally that the figures are hiding something more important, but it’s really not something to ignore.