Five principles of effective assessment

Specify what counts, teach it, check it, fix it: assessment that stops sorting will improve teaching.

Sep 21, 2025

∙ Paid

Assessment is the bridge between teaching and learning.
Dylan Wiliam, Formative assessment and the regulation of learning

Too often, assessment is treated as a post-mortem: we peer through the microscope of question-level analysis to work out what went wrong. Instead, we should view it as the lifeblood of teaching. If we treat it as an afterthought, the curriculum tends to drift toward whatever is easiest to mark or most flattering on a spreadsheet. If instead we understand that assessment is the first step in effective curriculum design, everything sharpens. When we begin by specifying what will be assessed, we increase the likelihood that those things will be taught and learned. Too often we congratulate ourselves for elegant tests that sort the most confident from the most confused, then mistake this sorting for rigour. All this approach proves is prior advantage. The work of effective instruction is to build assessments that faithfully mirror what has been taught, to use them constantly to find out what is known and what is not, and to act on what we find. This is not a call for softer standards. Rather, mastery assessment means working to create a realistic possibility for all students to achieve the same high standards.

Using assessment to improve the curriculum

David Didau

Apr 17

Read full story

1. Specify what counts, at item level

Clarity about what we want students to learn and how we will assess this learning is the most effective way way to improve a curriculum. Vague aims invite vague teaching; precision anchors instruction. By swapping broad concepts for teachable granular details we are most likely to ensure students can do what we needd them to be able to do. In English, a broad aims such as ‘analysis’ or ‘evaluation’ tell no one what to do.

Bad item: “Find a metaphor in this paragraph.”
Better item: “In ‘A river of headlights poured down the hill,’ identify the implied tenor, the vehicle and the ground, then explain in two sentences how this metaphor shapes the mood of the scene.”
Model response: Tenor: the traffic on the road. Vehicle: a river. Ground: continuous, flowing movement and brightness which suggests a relentless, impersonal flow that dwarfs individuals, creating a numbed, slightly ominous mood.

Now the decisions are visible: teachers are clear on what needs to be taught and practised, students are clear on what they need to do to be successful.

In maths, blanket aims like “secure with …” are empty signals unless the task names the representation, the method to apply, the working to show, and the check for reasonableness. If you do not specify represent–choose–execute–check, you are sampling luck and prior coaching, not taught method.

Bad item: “Add these fractions: 3/8 + 5/12.”
Better item: “Add 3/8 and 5/12. Show the lowest common denominator, the conversion for each fraction, the sum in simplest form, and a one-line estimate to justify your answer.”
Model response: “LCD 24; 3/8 = 9/24; 5/12 = 10/24; sum 19/24; estimate: both are a bit under a half, so just under one whole fits.”

Feedback now writes itself because the steps are explicit: if a student jumps to 8/20, you can point to the missing LCD decision rather than vaguely gesturing at “careless errors.”

In science, prompts that say “evaluate” without specifying the variables, the law that links them, and the criteria for judgement reduce assessment to guesswork.

Bad item: “How could you improve this experiment.”
Better item: “In an investigation of light intensity on the rate of photosynthesis using pondweed, identify the independent, dependent and two control variables, predict what happens when distance to the lamp halves, and justify the prediction using the inverse square law.”
Model response: “IV: light intensity via distance to lamp. DV: bubbles per minute. Controls: species, temperature, carbon dioxide. Prediction: halving distance roughly quadruples intensity, so rate increases; justification: intensity ∝ 1/d².”

Now reliability is not confused with accuracy and the mark scheme rewards taught content, not generalised savvy.

History often drifts into performance and imprecision invites poorly thought out responses.

Bad item: “What caused WWI.”
Better item: “Using Source A and Source B, argue which mattered more in the July Crisis: the alliance system or the naval arms race. State a ranking, quote once from each source, and justify your decision in no more than five sentences.”
Model response: The alliance system mattered more than the naval arms race in the July Crisis. Source A shows how binding commitments turned a Balkan quarrel continental, noting that Germany felt “encircled” and Russia’s mobilisation plans forced neighbours to “answer mobilisation with mobilisation.” Source B concedes the dreadnought race sharpened rivalry, as Britain pursued a “two-power standard,” but ships at sea did not compel immediate war in the way treaty obligations did. Alliances created automaticity: once one state moved, partners were dragged in by design, whereas naval competition heightened suspicion without dictating timelines. Therefore the ranking is clear: alliance system first, naval arms race second.

Modern languages often drift into performance and imprecision invites templated waffle.

Bad item: “Describe your weekend.”
Better item: “Write 60 words describing last weekend including two perfect tense verbs, one opinion, one time phrase and one connective, then underline subject–verb agreements.”
Model response: Le week-end dernier, [je suis] allé au parc avec ma sœur et [j’ai] mangé une glace. Ensuite, [nous avons] regardé un film chez moi parce que j’étais fatigué. À mon avis, c’était super, mais il a commencé à pleuvoir et [nous avons] dû rentrer tôt. Après, [j’ai] écrit à mes amis et [j’ai] fait mes devoirs, ce n’était pas amusant.

In the end, precise assessment design is curriculum design. When we decide exactly what counts and how it will be judged, teaching tightens around those decisions, practice targets the fragile steps, feedback names the next move, and success becomes routine rather than accidental. Clarity at item level turns vague aims into teachable routines, improves validity by sampling what was taught, and stabilises reliability because the moves are unambiguous. The result is a curriculum that is easier to teach well and fairer to learn from, where full marks are a live possibility for everyone and grades reflect taught knowledge rather than prior advantage.

2. Do not assess what hasn’t been taught

If an assessment samples what has not yet been taught, success records background advantage and failure is blamed on children rather than on design. If a student answers an untaught item, they have done so despite your lack of instruction. If they cannot, it is because your instruction did not prepare them. This is an insistence on validity: the construct you claim to measure must be the construct you actually teach and sample, otherwise you are measuring noise and mistaking it for rigour. “To validate a test is to determine the degree to which it measures the construct it purports to measure.” (Cronbach & Meehl, 1955)

In English, unseen should not (but too often does) mean untaught. Asking students to compare two poems without first teaching a comparison routine simply converts cultural capital into marks.

Bad item: Compare how poets uses imagery in Poem A and B.
Better item: Compare Poem A and Poem B on [theme]. Write five sentences: a single comparative claim; one short quotation from Poem A naming the technique and its effect; one from Poem B doing the same; a sentence that weighs the effects using however or similarly; a final sentence that links back to your claim.

Now the performance draws on taught moves rather than prior familiarity with poetic discourse. The same applies to grammar.

Bad item: “Improve this sentence.”
Better item: “Combine these two clauses into one sentence using a non-finite opener, then justify your choice in one sentence.”

In the second example you are sampling a taught technique, not vague stylistic taste.

In maths, context is not a free pass to smuggle in untaught ideas. A ratio problem that relies on an unintroduced unit conversion or a buried piece of time arithmetic is not “real world”, it is invalid.

Bad item: “A train leaves at 11:17 and arrives at 13:02 after stopping three times for seven minutes. What fraction of the journey was spent moving.”
Better item: “Given journey time and total stoppage time, find the fraction spent moving using the taught represent–choose–execute–check routine. Show a bar model, select the subtraction, execute the calculation, and estimate to justify plausibility.”

The construct is proportional reasoning, not the ability to decode a trick.

In science, novelty should live in the application of a taught law, not in a thicket of unfamiliar apparatus.

Bad item: “Explain why the measured acceleration was lower than expected in this trolley experiment.”
Better item: “Using Newton’s second law as taught, identify one systematic and one random source of error in the given trolley setup, explain the direction of each effect on the measured acceleration, and propose a fix that addresses the cause.”

You are testing named, taught ideas: law, error type, causal mechanism, remedy. Likewise with calculations:

Bad item: “Calculate gravitational potential energy for this object” when students have never been shown the units.
Better item: “Use E_p = mgh with the provided g and h, show substitution with units, and check reasonableness against a rounded mass.”

Here, you are sampling the method you taught, not the confidence to guess at conventions.

In geography, broad “what causes…?” prompts reward prior background and produce scattergun lists rather than disciplined geographical reasoning.

Bad item: “What causes flooding.”
Better item: “Using the taught claim–evidence–explain–source routine, rank antecedent rainfall and urban impermeable surfaces as causes of flooding in the [River ___] catchment using Figure 1 (48-hour rainfall totals) and Figure 2 (land use map). State a clear ranking, cite one precise figure from each source with units, explain the mechanism linking each cause to increased discharge, and evaluate each source for scale, date and reliability in no more than five sentences.”

The construct is geographical reasoning with data, not prior narrative. If you care about source quality, teach students to name who collected the data, when it was collected, the spatial resolution and the main limitation, then demand that explicitly.

In computing, asking for a list comprehension when you only taught for-loops is selection by stealth.

Bad item: “Return the squares of all even numbers in a list in one line.”
Better item: “Using the taught for-loop pattern, iterate through a list, test evenness with modulo, append squares to a new list, and print the result.”

If you want to assess abstraction, teach it, then write the item so the abstraction is the move being sampled.

The practical discipline is simple. Blueprint your assessment against what has been explicitly taught, including the methods, representations and checks. For each item, be able to point to the lesson where the enabling decision was modelled and practised. Where you want transfer, make the transfer near at first, and only widen the gap once the routine is stable. Surprise belongs in theatre. In classrooms it mostly measures what students brought from home. Equity means teaching the knowledge and methods you intend to assess, then sampling those directly.

3. Make 100 per cent possible

If you claim to teach a mastery curriculum, assessments must be designed so every student could, in principle, score full marks. The difficulty should live in the curriculum sequence, not in gratuitous curveballs. A test should act as a mirror, not a trapdoor. “The degree of learning is a function of time actually spent on learning relative to the time needed to learn.” (Carroll, 1963) As well as sampling from the full breadth of what has been taught, the cognitive steps should be kept legible, and items that hinge on tacit knowledge you never made explicit must be avoided.

Keep reading with a 7-day free trial

Subscribe to David Didau: The Learning Spy to keep reading this post and get 7 days of free access to the full post archives.