By Dennis Sherwood
In the summer of 2020, the Covid-19 outbreak led to the cancellation in the UK of all school exams. What happened next has been described as a scandal, a disaster, a fiasco, as has been reported extensively elsewhere. In the quest to discover what went wrong, in September and October, the Education Select Committee of the UK House of Commons held a number of hearings, one of which included this exchange between Committee Member Ian Mearns, and the Minister for Schools, Nick Gibb:
Q1129 Ian Mearns: But by Ofqual’s own admission, the grading process has some significant inaccuracies within it. I think Ofqual’s own figures are that one in four grades is inaccurate by one grade one way or the other, 25%. Given that level of existing inaccuracy and all these other things being fed in, how much confidence can parents and young people themselves have in the overall system?
Nick Gibb: I think they can. Exams are the fairest system, and it goes to huge efforts to ensure that the marking is as accurate as possible. There are people whose careers are devoted to this very issue about making sure there is consistency between different types of exam and different exam boards. That expertise is very well accommodated in Ofqual and in exam boards, and that is their bread and butter.
Ofqual is the body that regulates and oversees school exams in England, and at first sight, that all seems quite reasonable. But read it again.
Ian Mearns says “the grading process has some significant inaccuracies”, to which Nick Gibb replies by referring to the “huge efforts to ensure that the marking is as accurate as possible”.
To me, this is a classic example of talking at cross purposes, of not listening, of the absence of mutual understanding, of perhaps deliberate obfuscation. The question is about GRADING, but the reply is about MARKING. These are two very different processes; furthermore, the reply is assuming that the problem with GRADES is attributable to a problem with MARKS.
Many people confuse marking and grading, thinking they are the same. Yet marking and grading are quite distinct:
- marking is currently carried out in England by human examiners, who assign marks to (for the most part) essay-style questions;
- grading is a subsequent process in which a policy (such as “all scripts marked from 62 to 68 marks inclusive are awarded grade B”) is applied to determine the grade to appear on the candidate’s certificate.
The question asked by Ian Mearns is based on the results of some Ofqual research in which whole cohorts of scripts in each of 14 subjects were marked twice – firstly by an ‘ordinary’ examiner, and then by a ‘senior’ examiner, whose mark, and the corresponding grade, are deemed by Ofqual to be ‘definitive’. The grades resulting from each of the two marks were then compared. The overall result was that, on average, across the entire cohort for all 14 subjects, about 75% of the ‘ordinary’ examiners’ grades were found to be the same as those awarded by the ‘senior’ examiners, and about 25% were different. Since grades awarded by the ‘senior’ examiners are ‘definitive’, those 25% of grades awarded by the ‘ordinary’ examiners must have been ‘non-definitive’ – or, in simpler language, wrong. That’s the background to the question.
That about 25% of awarded grades are wrong – or, rather better, ‘unreliable’ in that they would be changed if the scripts were to be re-marked by a senior examiner – is well-established, and acknowledged by Ofqual. The BIG QUESTION is “why?”.
In his reply, Schools Minister Nick Gibb implied that grades are unreliable because marking is inaccurate. “Marking” of course can be neither accurate nor inaccurate; rather, what Nick Gibb is saying is “because human examiners make mistakes”. Perhaps they are not complying with the mark scheme; perhaps they are just sloppy; perhaps the quality control process is failing.
It is certainly possible that some examiners might indeed be sloppy, and that the quality control process might not detect this. But can ‘sloppiness’ explain the scale of the problem?
Every year, about 6 million exam grades are awarded in England, of which about 25% – that’s about 1.5 million – are ‘wrong’. If each of these erroneous grades is attributable to ‘inaccurate marking’ – or, rather, examiner error – then the number of such errors must be at least 1.5 million. And I say “at least” for in addition there are all the marking errors that are within a single grade width and so would not result in a grade change. Is it credible that so many examiners could be so negligent, year after year?
On these grounds alone, I just can’t believe that ‘inaccurate’ marking comes anywhere close to explaining the observed unreliability of grades – especially since there is, to me, a much more plausible explanation.
Rather than attributing the grading problem to ‘inaccurate marking’, let me suggest that this is the inevitable consequence of marking ‘fuzziness’ – my word for the fact that different, equally conscientious, and equally qualified, examiners can give the same script different marks, say, 64 or 66. In giving their marks, neither examiner has been negligent or sloppy, nor has there been a failure of quality control. The two marks simply reflect legitimate differences in professional academic judgement.
If both marks are within the same grade width, the student is awarded the same grade by both examiners. But if the B/A grade boundary is at 65, then the first mark results on grade B, and the second, grade A. Which grade is ‘definitive’? We don’t know. All we do know is that the grade recorded on the certificate must be unreliable, for there is a possibility that it might be changed if the script were to be re-marked.
This is illustrated in the figure. For each of three scripts, the ‘whiskers’ represent the range of marks that could legitimately be given by a fully qualified examiner. The grade B awarded to Candidate 1 is reliable, as is the grade A awarded to Candidate 3. The grade awarded to Candidate 2, however, is unreliable, for the range of legitimate marks straddles the B/A grade boundary, and the grade that appears on the candidate’s certificate is in essence the result of the lottery of which examiner happens to mark the script first, and on which side of the grade boundary that mark happens to lie.
Any script whose fuzzy mark straddles a grade boundary will result in an unreliable grade. Some subjects – such as English Language and History – are especially fuzzy, and the incidence of unreliability will be even higher when grade widths are relatively narrow. So fuzziness could indeed be the explanation of those 1.5 million unreliable grades.
To my mind, the fact that 25% of English exam grades are unreliable is therefore much more likely to be attributable to fuzziness than to ‘inaccurate marking’. And the identification of the right root cause is important, for different causes have different solutions.
If the cause is indeed ‘inaccurate marking’, then the solutions are about tighter marking schemes, better training for examiners, and stricter quality control; if the cause is ‘fuzziness’, then the solution is to devise a wiser policy for determining grades from necessarily fuzzy marks.
So I fundamentally disagree with the Schools Minister, Nick Gibb. Although problems with marking are bound to be present and don’t help, I think that the fundamental, and dominant, cause of the unreliability of grades is fuzziness – if only because even if marking were ‘perfect’, and there were no marking errors at all, some degree of fuzziness must inevitably still be present, and the possibility that the same script might legitimately be marked, say, 64 and 66 by two different examiners must remain.
I appreciate, of course, that the whole process of student assessment should be reformed. But whilst examinations play any role at all, the very least that we should expect is that the final record – in many cases, the grades as shown on the certificates – should be reliable and trustworthy. But in England, they surely are not: as already noted, an average, around 1 grade in 4 is wrong. And wrong because the policy for determining the grade is deeply flawed, for it assumes that the script is associated with a single, precise, mark – say, 64 – and fails to recognise that this mark is necessarily fuzzy.
So let me conclude this blog with a challenge.
Given that the marks given to a script will always be fuzzy, how many different policy solutions can you suggest which will result in the award of a reliable grade – where ‘reliability’ implies a very high probability that the originally-awarded grade would be confirmed, and not changed, as the result of a fair re-mark by any other examiner, and not just a ‘senior’ one?
Dennis Sherwood is a consultant on organisational creativity and innovation, and is an active campaigner in England to solve the problem of unreliable school exam grades.
1 thought on “Why are exam grades unreliable?”
Thank you Dennis. You set out the problem very clearly. It should be required reading for decision-makers within the educational system.