Why better teaching doesn’t always mean better grades (Thanks, Ofqual)
How ‘comparable outcomes’ keeps standards stable – and your students’ scores stuck.
You know that feeling when you’ve pulled out all the stops - rewritten the curriculum, drilled exam structures, embedded the retrieval, modelled until your marker dried up - and results barely budge?
We’re constantly told that students do better now because teaching is better. The Department for Education often links rising results or narrowing gaps to improvements in teaching, curriculum, or policy, even while comparable outcomes are holding grade distributions steady.1 Similarly, Ofsted inspection reports routinely attribute positive inspection outcomes or performance data to improved teaching.2
This reads as if outcomes are moving upwards in absolute terms, even if grades aren’t. Meanwhile, Ofqual says, cheerfully: “Standards remain stable over time.” How can both of these be true? Ofqual themselves have said, “Teachers and schools may be getting better at preparing students. That’s a good thing. But our job is to make sure that a grade means the same standard over time.”3
Welcome to the wonderful world of comparable outcomes, Ofqual’s favoured method of reconciling improvement in education with… well, keeping things exactly the same over time.
The illusion of progress
Here’s the trick. Ofqual doesn’t directly reward better teaching with higher grades. Instead, they use a statistical model to ensure that the proportion of top grades stays roughly the same each year, no matter how brilliant your lessons were.
The idea of comparable outcomes is rooted in the idea that a grade 7 in GCSE Chemistry in 2025 should mean the same as a grade 7 in 2023. To protect public confidence, the distribution of grades is fixed based on what students got in the past.
So how do they do it?
The algorithm behind the curtain
Every summer, Ofqual looks at the prior attainment of the cohort - usually their KS2 SATs or GCSEs in other subjects - and calculates what proportion of them should get each grade if standards are stable.
Then, they adjust the grade boundaries so that the resulting distribution matches those predictions. That’s right: grade boundaries aren’t decided by what students “ought” to know, or by what the examiners think shows mastery. They’re decided by where the numbers fall.
This is all very well, but what happens when due to extraordinary circumstances there is no KS2 data to compare? The students sitting GCSEs this year didn’t take their Key Stage 2 SATs. They were in Year 5 when Covid hit, and SATs were scrapped in 2020. That means Ofqual doesn’t have the usual baseline data it relies on to predict what this cohort ‘should’ achieve.
So how do they apply comparable outcomes without prior attainment? Simple: they fudge it. Instead of modelling predictions based on each student’s SATs, they use last year’s results as a benchmark. That is, they assume this year’s cohort is broadly similar to the 2024 cohort, and award roughly the same proportion of each grade. This keeps grade inflation in check, but it’s built on a statistical shrug. No account is taken of whether this group is stronger, weaker, or simply different.4
In the absence of hard data, Ofqual turns to human judgement. Senior examiners review real scripts at the grade boundaries and compare them to equivalent scripts from previous years. If this year’s borderline grade 4 answers look better than last year’s, exam boards can make a case to Ofqual that boundaries should be adjusted and Ofqual might - just might - lift the boundary. If not, it stays put.
In some subjects, like English and maths, Ofqual can also draw on data from the National Reference Test, a sample assessment taken each spring by a random group of Year 11s. It gives a general sense of whether students this year are performing better or worse than before. But it’s limited in scope and has no bearing on individual schools or students or on outcomes in other subjects.
So, in 2025, “comparable outcomes” has become a bit of a house of cards. The anchoring assumptions are thinner. The predictions are looser. And yet the system rolls on, confident in its ability to deliver fairness through a mixture of precedent, professional judgement, and inertia.
But what if teaching actually has improved?5
In theory, better teaching leads to students getting more raw marks. So the whole bell curve shifts to the right. And if that shift is larger than expected, Ofqual will review the scripts and may shift the grade boundaries up a bit.
In practice? Don’t hold your breath. For any real change to register, it has to be:
Sustained across multiple cohorts
Evident across the whole system
Confirmed by qualitative review from senior examiners
Not explainable by gaming, improved exam technique, or easier papers
It’s not impossible but meaningful change would be glacial.
In the short term, most improvements in teaching are absorbed by the system and hidden under the curve. You may be teaching better, your students may be learning more, but the grade distribution remains comparably the same.
So what’s the point?
From a systems perspective, it sort of makes sense. We don’t want grade inflation. Instead, system stakeholders benefit from stability. Universities, employers, and parents need to trust that a grade 6 means what it always meant.
Grade consistency gives the illusion of objectivity. It makes it easier to believe the system is fair, that your child’s grade means something fixed, reliable, deserved. In reality, what counts as a grade 4 or a grade 7 is not a law of nature, it’s a statistical construct, shaped by policy decisions, historical precedent, and political mood.
In reality, no one really needs stable grades. Certainly not students. Not necessarily universities. And not most employers in any robust, operational sense. What is needed is public confidence. Stability in grading helps the system feel fair, ordered, trustworthy. But that’s not the same as fairness. It’s more like branding. A system that stays the same, even if what’s underneath it changes.
But let’s be honest: this flattening of grades can feel demoralising for students, teachers and schools chasing “improved results” with increasingly marginal gains.
The big lie
What’s most galling is the sleight of hand in the messaging. The powers that be get to say both, “Standards are fixed. A grade means the same over time,” and “Results are improving thanks to better teaching” as if this weren’t a contradiction in terms.
It’s not quite gaslighting but it is misleading. Improvement (or decline) may be real, but it tends to be smoothed, averaged, and statistically neutralised before we ever see it on a results sheet.
So next time someone tells you grades are going up because teaching is getting better, feel free to smile politely then go back to wondering how your department is meant to keep improving outcomes when the ceiling never lifts.
The hidden cost of comparable outcomes is that it makes progress invisible. Teaching may genuinely improve. Students might work harder, write better, know more. But unless the whole cohort shifts significantly (and unless Ofqual agrees that shift is real) grade boundaries barely budge. You can run faster, jump higher, deliver the best lessons of your life… and still stand still.
And the consequences of that are corrosive.
First, we get curriculum narrowing. When grades are rationed, the incentive is to teach what’s measured, not what matters. Substance gives way to strategy. Essay plans replace ideas. Teaching becomes a performance: rehearsed, reductive, relentless.
Schools chase marginal gains, not real learning. More mock exams, more intervention sessions, more data drops. The point isn’t to teach better, but to out-score the school down the road.
Meanwhile, demoralisation sets in. Teachers might do everything right, teach well, plan carefully, mark thoughtfully, and still see flatlining results. The message? Do more. Be better. Close the gap. But don’t expect to see it in the numbers.
Worst of all, the model quietly punishes schools working in the toughest contexts. If your intake has weaker prior attainment, your chances of hitting national benchmarks are slimmer, regardless of how effective your teaching is. Your improvement may be real, but may not be not recognised. You’re swimming against a statistical tide.
All of this adds up to a system that prizes stability over honesty. A system that prefers smooth lines to messy truths. One that measures success, not by how far students come, but by how well they stack up against their statistical clones.
So yes, comparable outcomes offers control and maintain public trust but it also flattens genuine improvement into noise, disguises inequity as rigour, and quietly tells teachers that no matter how good they are, only so many of their pupils will be allowed to succeed.
For instance, former Schools Standards Minister Nick Gibb said in a press release in 2018, “Academic standards are rising in our schools thanks to our reforms and the hard work of teachers, with 1.9 million more children in good or outstanding schools than in 2010.”
For example, “Achievement across the school has improved significantly over the last year as a result of good teaching.” Or “Outstanding leadership and management ensure that the quality of teaching and students’ achievement are continually improving.”
Phil Beach, ‘GCSE marking and grading,’ The Ofqual Blog (2018) https://ofqual.blog.gov.uk/2015/08/05/gcse-marking-and-grading/
I’m about to talk about National Reference Tests. Be patient :)
Or gotten worse?
Even if the grade curve distributions are normalised nationally, can't a school, being a small bubble of data in the mix, still raise overall its relative levels of achievement and be justifiably proud of this?