What the biggest education experiment ever really tells us

Inside Project Follow Through: why methods matter, why implementation fails, and why Direct Instruction still divides the profession

Jun 18, 2025

In education, boldness is rarely matched by rigour but, occasionally, the attempt is made. Back in the heady days of Lyndon Johnson’s War on Poverty, American policymakers launched what remains the largest, most ambitious educational experiment ever attempted: Project Follow Through. Its goal was deceptively simple: find out how to teach disadvantaged children. For once, ideology was (at least officially) set aside in favour of data. Competing models of early childhood education would be implemented across dozens of sites, with uniform assessments and direct comparisons. The result? Well, it depends who you ask.

Follow Through enrolled over 20,000 children in the 1970s, with each participating school adopting one of multiple sponsored models overa 4 year peiod. These models ranged from highly structured, skills-based methods (notably Direct Instruction) to more progressive, affective or “whole child” approaches.

Evaluation was handled by Abt Associates, who oversaw a national data collection effort involving standardised achievement tests (reading, arithmetic, spelling, vocabulary), cognitive assessments, and affective measures like self-concept and locus of control. A battery of well-intentioned psychometrics if ever there was one.

But of course, the cracks were there from the start:

Sites weren’t randomly assigned.
Implementation fidelity was unknown.
Politics shaped site selection.
Local control and community preferences complicated standardisation.

Still, for all these flaws, Follow Through produced an unprecedented body of evidence.

Abt’s analysis, led by Richard B Anderson produced cautious but striking results:

Most Follow Through sites failed to bring students to national norms. Only 19% of cohorts met national averages on even one academic measure.
The majority of outcomes were null — about 68% of the time, Follow Through students performed no better (or worse) than their non-participating peers.
In other words: for every one time a Follow Through intervention raised achievement, there were at least 5 times it made no difference, and nearly twice as often it made things worse than better.

The clearest comparative data comes from Figure 5 of the Abt evaluation, which tracks outcomes across three domains: basic academic skills, cognitive-conceptual skills, and affective measures like self-concept. The picture that emerges is remarkably consistent.

On basic academic skills - reading, spelling, arithmetic, vocabulary - Direct Instruction stands alone. Its average effect size was substantially higher than any other model (~0.6 SD), with most implementation sites - although, crucially, not all - showing clear positive gains. No other model approached this level of consistency. Some, like Behavior Analysis, demonstrated modest positive effects, but the majority clustered around zero or dipped into negative territory. Several site implementations of the alternative models performed markedly worse than no intervention at all. In short: Direct Instruction more reliably delivered academic gains where others faltered.

In the cognitive-conceptual domain - more complex problem-solving and higher-order reasoning - no model emerged as dominant. The effects here were weaker across the board, with most models producing highly variable outcomes. Even Direct Instruction, while slightly positive, saw diminishing returns as the focus shifted from basic skill acquisition to more abstract reasoning. Tellingly, the programs specifically designed to improve cognitive-conceptual skills performed worst of all. The very interventions most committed to fostering independent thinking and reasoning often delivered the least in measurable gains.

The affective results confounded common assumptions. Direct Instruction, often criticised for being rigid and uninspiring, actually produced some of the most positive outcomes on self-concept and achievement responsibility. By contrast, models designed specifically to enhance self-esteem tended to deliver flat or even slightly negative results. This suggests that academic success, rather than being driven by self-esteem, is more likely to cultivate it: the better your academic performance, the better you feel about your ability to perform academically.

Crucially, across all three domains, the variation within models - between different implementation sites - was often greater than the variation between models themselves. The same program could succeed spectacularly in one school and fail miserably in another. Implementation fidelity, teacher quality, training, and local conditions played a decisive role in determining whether any given model worked at a particular site.

Taken together, these findings point to an uncomfortable but robust conclusion: models that focus directly on building academic competence deliver the most reliable benefits, including for pupils’ self-concept. But success is fragile; even the strongest models require careful implementation. Without it, they are just as vulnerable to failure as their less effective counterparts.

Direct Instruction

The clearest finding concerns Direct Instruction. As the graph below illustrates, this model produced consistently higher rates of positive outcomes in reading, spelling, arithmetic, and vocabulary than any of its competitors. Its emphasis on tightly sequenced, highly explicit teaching delivered reliable academic gains where other approaches struggled. In contrast to the developmentalist hope that self-esteem would drive achievement, the pattern observed was the reverse: academic success appeared to foster improvements in self-concept.

Yet even here, context mattered. The effectiveness of Direct Instruction was not uniform. Some sites saw strong gains; others, much weaker results. The same instructional package, in different hands and settings, could yield widely varying outcomes. Implementation matters almost as much - maybe more - than programme design. However, where other programs resulted in a small number of notable successes but otherwise performed poorly, the reverse was true of DI.

What the critics said

Naturally, not everyone agreed. The evaluation itself became its own mini-war of ideology.

House et al. (1978) argued that Abt’s methods were deeply flawed: sloppy statistics, poor model classification, weak measurement. They found no convincing evidence that any model, Direct Instruction included, outperformed others.
Bereiter and Kurland (1981) strongly disagreed. Their reanalysis, carefully stripping out site mismatch noise, reaffirmed substantial advantages for Direct Instruction in basic skills.
Becker & Gersten (1982) provided follow-up studies showing that Direct Instruction students maintained advantages into 5th and 6th grade, particularly in reading and problem-solving.

And standing somewhat apart, Cathy Watkins (1997) put her finger on what remains, to this day, the most uncomfortable conclusion: even when effective methods are identified, the educational establishment resists them. Institutional inertia, ideological commitments, teacher preparation models, and economic self-interest all combined to ensure that Direct Instruction, despite its empirical support, would remain a marginalised curiosity rather than mainstream policy.

This post is public so feel free to share it.

The hard truth

What emerges from Follow Through is not a tidy “what works” answer, but something much more troubling: Implementation trumps intention.

No model - not even Direct Instruction - succeeded everywhere.
Where it was implemented well, Direct Instruction worked spectacularly. Where it wasn’t, it floundered like everything else.
The default state of large-scale educational reform is noise, drift, and regression to the mean.
Poverty is remarkably resistant to shallow interventions.

What Follow Through ultimately revealed was not so much which method “works” but how hard it is to deliver any method consistently at scale.

Why this still matters

We still hear the call to scale what works. But Follow Through, if it tells us anything, shows how naive that is. The problem has never been lack of promising methods. The problem is reliably translating method into practice, site after site, year after year, under varying local conditions.

In truth, education policy would do well to stop asking what works? and start asking:

Under what conditions?
With what training?
At what cost?
How will fidelity be maintained?
What is the actual causal pathway between intervention and long-term success?

The lesson from Follow Through is not that Direct Instruction is a magic bullet. It isn’t. Such a thing could never exist: the most perfectly designed instructional programme won’t work if it’s implemented half-heartedly by teachers who don’t believe in it. However, it is the best bullet we’ve yet manufactured. We simply lack the will, institutional structures, and professional culture to load it into the gun.1

What the biggest education experiment ever really tells us

Inside Project Follow Through: why methods matter, why implementation fails, and why Direct Instruction still divides the profession

Discussion about this post