Scholastic Alchemy
Posts
Kelsey Piper is Right about Education Research

Kelsey Piper is Right about Education Research

But not for the reasons she mentioned

James Shanahan
April 22, 2026

Hi! This is Scholastic Alchemy, a twice-weekly blog where I write about education and related topics. Mondays usually see me posting a selection of education links and some commentary about each and Wednesday posts are typically a deep dive into an education topic of my choosing. If Scholastic Alchemy had a thesis, I suppose it would go a little like this: We keep trying to induce educational gold from lead and it keeps not working but we keep on trying. My goal here is to talk about curriculum, instruction, policy, public opinion, and other topics in order to explain why I think we keep failing to produce this magical educational gold. If you find that at all interesting, please consider a paid subscription here, or at the parallel publishing spot on Beehiiv. (Some folks hate the ‘stack, I get it.) That said, all posts are going to remain free for the foreseeable future. Thanks for reading!

Subscribe now

Education Research is “Bad”

The reason I wanted to spend the first 1000 words or so of this post on my chief complaint about education research is to try and establish that I, in no way, want to position myself as someone defending all of education research from its critics. If anything, criticism should make our research programs stronger but I think that only happens if criticism is well informed and understands the challenges researchers face. Recently, Kelsey Piper wrote at The Argument that Education Research is Weak and Sloppy. Central to her point was an attack on the work of education researcher and policy advocate, Jo Boaler. When researchers requested to know which schools were involved in a study of hers so they could reanalyze her data, she refused on the ground that the schools had to remain anonymous. Those researchers were able to reverse engineer some of her research to identify the school and found that she inappropriately compared top quartile students with middle quartile students elsewhere. This is highly suspect analysis and seems intended to overstate the performance of students using Boaler’s curriculum. Because Boaler used anonymity to shield her research from criticism, Piper seems to have latched onto anonymity as a core reason education research is so “sloppy.”

The fact that it’s normal to report on a school “confidentially,” without naming it, makes journals reluctant to require data sharing, and researchers almost never want to go to the extra effort to share their work if it’s not required or strongly encouraged.

It’s easy to say as an outsider to the field, but I think the idea of reporting on a school’s results “confidentially” just needs to go. Individual student results should be confidential, of course, but key information about school performance should be, and already is, public.

The norm that you can conceal at which school you conducted an intervention makes life easier for people who are committing fraud: no one can easily catch the fraud by calling up the school to ask if the data is accurate, or look at how it compares to other, publicly available data about the performance of the same students.

I get that in this case, it would be true that knowing which schools were involved up front may have prevented Boaler’s results from getting published but I’m not sure it’s overall going to do much to improve the quality of education research. It’s also just one of several dozen objections the critics Piper cites make about her work, but the others don’t get much mention by Piper. If anything, getting rid of anonymity may make it harder to recruit schools and students to participate in research. Although Piper mentions other ways of improving research, such as preregistration, sharing coding and data, and getting adequate funding so studies can cover lots and lots of schools and students (increasing statistical power), she really only devotes discussion to the problems of anonymity. She wants people “calling up the school to ask if the data is accurate, or look at how it compares to other, publicly available data about the performance of the same students.” Having done this kind of thing before, I can assure you and Piper that the school has no idea if the data is accurate. They’re usually getting results the same way everyone else is, by reading the eventual paper. While there are some exceptions to this, such as action research, they’re not that common. What is needed is some kind of ability after the fact for research and policymakers to track down the study’s data and that could or could not include information about identifying the schools involved.

Given everything else that could be going on in education research, I think Piper is over-indexing anonymity because Boaler was able to convince San Francisco to delay algebra until ninth-grade citywide. For whatever reason, this has become a kind of metonymy for all the problems with education and Piper is writing inside that bubble a bit. A great story for a journalist interested in this kind of thing might be figuring out how this one professor with some heterodox ideas about math education was able to persuade an entire city’s school system to do what she wanted. The quality of her research may turn out to be the least important part of the story. Piper, though, isn’t interested. She has fish to fry.

Thinking about why education research is “bad” requires thinking about the nature of doing research in schools. It’s absolutely true, as Piper reports, that education journals have not adopted transparency pledges or reporting requirements of the sort that psychology journals have. In some cases this is because there just aren’t enough studies of the type that could even be rated according to the TOP Factor. For example, if you observed classrooms that were implementing some kind of new curriculum and wrote up an observational study of the practices of teachers and responses to students — you know, because it’s important to know if a curriculum was implemented with fidelity — how would you ensure that “a party independent from the researchers verified that reported results reproduce using the same data and following the same computational procedures?” Your data are notes. What computation is being done? If you’re not doing a statistical analysis of some kind of quantified data, then this kind of rating system doesn’t make sense. If you’re a journal that regularly publishes work that’s not exclusively quantitative (there are a lot of mixed methods studies in education), you’re probably not going to use TOP Factor. But this is kind of the problem with Piper’s article and many criticisms of education research. They’re too far removed from the work being done to give a good critique. They see the problem but only through a glass, darkly.

That said, Kelsey is right that education research doesn’t meet the standards of, say, clinical psychology or more quantitative fields like economics. She identifies a deeper problem, but it’s one she really doesn’t come up with a way to address besides asking economists to study homeschoolers.

Studies that are comprehensive and well-designed enough to find meaningful results are generally large and expensive. If every researcher is trying to prove the merits of their own quixotic curriculum by convincing one school at a time to enroll in a pilot and try it, we’ll get what we currently have, which is a huge number of fairly low-quality studies of individual one-off interventions — none of which constitute convincing evidence because they simply don’t have enough statistical power.

Instead, it would be much better to have fewer, much higher-quality studies which look at the rollout of a new policy across a district or across many districts, conducted and analyzed according to the (much higher) standards for careful work from disciplines like economics.

I’m not sure Piper knows but there is an entire field of Education Economics. They have journals (more than one). She has a weird idea that researchers are all out there trying to research their own “quixotic curriculum” but in reality very few academic researchers are writing curriculum at all. There was actually a whole movement in curriculum studies in the mid 20th century called “the reconceptualization” and it happened in part because curriculum writing moved out of universities and into the hands of publishers and government. A bit more journalism may be in order for Piper but The Argument saw fit to publish and here we are. I don’t get the sense that Piper has a great grasp of what researchers actually do or the challenges involved in education research. For example, the corollary to all the high stakes testing of the last quarter century is the availability of large quantitative data sets that allow for analysis by, among others, economists. The thing she asked for already happened! None of that high profile research illuminated some perfect set of best curricula or practices that are guaranteed to be effective.

Education Research is Hard

One thing I’m fond of saying about education research is that it has to take place under conditions that would be intolerable to researchers in other fields. Part of the challenge is shared with other social science research and with humanities research. A portion of what’s involved in studying education will always be quantitative. Getting some test scores and running a regression can tell you what happened in terms of changes in test scores, but it doesn’t tell teachers much about what they should be doing day-to-day in the classroom. It doesn’t tell you anything about students’ behaviors or cognition. It can’t tell you if the materials are bad or the teacher isn’t equipped to deliver instruction adequately. Some education research will always necessarily involve qualitative work, too. Someone needs to observe classrooms, talk to kids, examine materials. You can make that kind of work empirical, in the sense that you have established practices and procedures, make your notes and records available, and have written up qualitative coding, but that’s not really what Piper is talking about adopting from the world of economics and it won’t qualify as the kind of reproducibility that she’s asking for in her article.

Conducting the kind of research Piper wants is, as she says, expensive. I would also add that it’s labor intensive and requires multiple layers of consent and agreement. One does not simply walk into a school and declare herself to be doing research. A lot of attention is placed on the IRB process and a sense of over-protectiveness toward keeping kids and schools anonymous when it may not be necessary, but this is only one layer of policy. Often times school and student anonymity is a requirement of the grants that pay for big research projects. School districts and schools themselves also usually want to be anonymous. Parents want their kids anonymous and want the schools to be anonymous. All of these same requirements also apply to consent. The school district must consent to research happening in any of its schools or classrooms. The schools themselves must consent. Teachers and students also have to all provide consent. At any point in this process, they can also pull out of the study. I have personally lost access to schools where I had already received agreement to research in because someone at the district decided they now had a rule that researchers from institutions outside of that state cannot do research in their district.

These are conditions that most other research fields don’t have to deal with. It’s very common to have attrition of participants in studies but that attrition is usually because of individual factors and doesn’t happen to an entire set of participants simultaneously. Economists just don’t face these challenges. Nobody is going to strip them of access to their research sites because, as far as I can tell, they typically don’t research at sites. When that happens to clinical trials it’s newsworthy and unethical. When it happens in education research, well, them’s the breaks.

Education research also includes a whole mess of confounding conditions that aren’t easy to eliminate. Everything from non-random assignment of students to classes and schools to teacher training to fidelity to the implementation of an intervention are essentially toss-ups in education research. Increasing statistical power can help but it’s no guarantee and, as is well documented, effect sizes dwindle. Doing these kinds of studies well means having lots of training for participating teachers, doing audits and observations of the actual interventions, and ensuring that participating students are all actually showing up to school and participating. This is resource intensive stuff and takes time and money and people.

While Piper looks at all this from the outside and sees weakness and sloppiness, from the inside it looks like working within larger systemic constraints. None of that dooms us to a world of bad education research but we do have to consider that decisions in education need to be made faster and with more urgency than what well done randomized controlled trials can deliver. Let’s look at some examples.

Flexible Phonics RCT

A 2024 study funded by a charity in the UK looked at over 3,000 students in 123 schools. The study, a randomized controlled trial, evaluated the efficacy of a program called Flexible Phonics.

Flexible Phonics approaches teach children to add another step after they have blended phonemes, to recognise whether they have successfully identified a word or if they need to use alternate strategies to do so. This ‘set for-variability’ approach could enable children to read unfamiliar exception words independently (words that break phonic rules, such as ‘the’, ‘two’, or ‘above’).

The trial was properly registered in the UK and subject to ethics reviews as well as third party review. Notably, even in the UK, the schools are kept anonymous. It seems that it’s not just IRBs at American universities run amok!

Anyway, it turns out, Flexible Phonics made things worse.

Pupils who participated in Flexible Phonics made the equivalent of one month less progress, on average, in early word recognition than pupils who did not receive the programme.

A negative finding is not on its own a bad thing. It’s kind of exactly what Piper wants, evidence of efficacy (or a lack thereof). I want to draw your attention to an interesting bit, though. In the section where they detail their analysis, they mention there’s a particular subgroup they perform a separate analysis on.

Subgroup analysis was conducted to examine whether the effect of the intervention differed among three different groups of pupils: FSM pupils, low-ability pupils, and pupils at schools that were not participating in the Nuffield Early Language Intervention (NELI).

Later they explain how this played in their findings.

To assess whether the impact of Flexible Phonics differed depending on whether the school had any pupils participating in NELI, further analysis was carried out using an interaction between whether any pupils at the school took part in NELI and whether the school was part of the intervention group. This is reported in Table 18. The 95% credibility intervals spanned zero and so it is uncertain whether Flexible Phonics was more or less effective in schools where some pupils participated in NELI. While the 95% credibility intervals reported in Table 17 also spanned zero, the analysis provided marginal evidence that Flexible Phonics was more effective in schools which participated in NELI as the lower bound was very close to zero. This suggests that perhaps other schools in the intervention group would have benefited from the additional catch-up support offered by NELI. Had this been available, it is possible that the Flexible Phonics programme would have been more effective.

So, you may be thinking, “wait, I thought they were investigating Flexible Phonics. Why are they also talking about the efficacy of the NELI program?” Because them’s the breaks! They set up a randomized controlled trial of a literacy program only to find that some of their sample included kids receiving a separate literacy intervention from some other initiative. In any other field this just wouldn’t happen. You wouldn’t have people receiving two interventions for the same thing. The entire point of an RCT is that this kind of thing isn’t supposed to happen but in education, this crap happens all the time. Nobody in this study being fraudulent or trying to deceive anyone but the usefulness of these findings is less because of the confounding NELI intervention. Despite being the “gold standard” there’s just nothing here that tells me much about what kind of intervention I’d like. Maybe Flexible Phonics pairs well with NELI. Maybe NELI is the bee’s knees all on its own and Flexible Phonics is crap. We don’t know from this study even though it does pretty much everything Piper wants education research to do.

Instructional Coherence Intervention in Tennessee

Michelle Caracappa brings us a great write-up of a study conducted in Tennessee to evaluate several tiers of literacy interventions. Here’s how the report from the Tennessee study describes the intervention.

In the fall of 2024, four elementary schools in Knox County Schools (KCS) piloted a new approach to supporting students who are academically behind. For over a decade, most Tennessee schools — including these four — have used intervention-specific materials when providing academic support to students outside of Tier I settings. However, for the 2024-25 academic year, the four schools are working to align the materials used in small-group settings to those used in core instruction — an approach known as instructional coherence — for kindergarten through third-grade literacy.

and

The KCS pilot builds on the work of a previous cross-district network that tracked student literacy growth for first through third graders during the 2022-23 school year, based on the kind of academic interventions provided to students. Some students only received Tier I instruction — the core instruction that all students receive — while others received additional supports like high-dosage tutoring (HDT) or Tier II and III intervention.

To translate this a bit, these schools in Knox County would evaluate kids’ literacy and then some would get regular instruction (Tier 1), some would get High Dose Tutoring, some would get 30 minutes daily of literacy instruction in groups of five (Tier 2), and the lowest performers would get 45 minutes of literacy instruction in groups of three (Tier 3). As noted above, the main thing they studied here was instructional coherence which seeks to ensure that all aspects of instruction rely on the same high-quality materials, are delivered effectively, and not some kind of scattershot approach where kids in intervention groups get a radically different curriculum than their peers. How’d that go?

Data from that network suggest that, regardless of performance level, students grew more when they received no additional support whatsoever compared to students placed in the most intense Tier II or III intervention, which meets three to five times per week. Students who grew the most were, in fact, the students placed in high-dosage tutoring, a less intensive intervention focused on support using Tier I curriculum and meeting two to three times per week.

(emphasis added)

Students grew more when they received no additional support. Caracappa makes a point about what went into building this study that has now, effectively, made things worse.

It’s worth sitting with that finding for a minute, as it reflects a reality that is intolerable but perhaps not anomalous in many schools and districts. When I read that sentence I think about all of the effort that must have gone into providing intensive intervention instruction — the minutes spent administering assessments, analyzing data, creating groups, developing schedules, making staffing plans, and, at long last, delivering instruction. And yet, despite all of that effort — the time, the resources, the blood, the sweat, the tears — students on the receiving end grew less than had they never been pulled for additional small group support at all. It’s an unacceptable outcome for kids, and also for the adults who serve them, as confronting that reality is no doubt a recipe for demoralization.

One takeaway from the study is that they relied on universal screeners to make placement decisions for students.

In KCS’s prior intervention model, the district relied entirely on universal screeners to make intervention decisions for K-3 students. While they were able to identify students at risk with this approach, they were not able to concretely identify why students were struggling or how to support them. As a result, students were not grouped for intervention according to their specific needs, and interventionists rarely had sufficient data to determine what to teach to address students’ gaps.

This seems like a pretty big component that could have undermined the efficacy of the tier 2 and 3 interventions.

You know what else? None of this is a finalized study. What they’re doing in Tennessee is all pilot work and they’re working towards an RCT. (Note: they say the results of that RCT will be available in fall 2025 but I couldn’t find anything. Perhaps it was delayed for some reason?) So, we again have conditions that are less than stellar from the perspective of Piper and her need for economics-like research. This isn’t an RCT and the researchers changed the intervention when they started to see kids failing to improve. It’s closer to what we call “action research” than to the replicable “scientific” stuff we’re told education needs more of.

Yet, unlike the RCT above that did everything right and yielded no really useful conclusions, the preliminary findings out of Knox County, Tennessee offer actionable insights any school could learn from right now. Caracappa connects this with two other research reports out of Tennessee and some work published elsewhere to make a good set of recommendations, including a checklist teachers can use to ask about their own school’s screeners, materials, and intervention setup.

It’s Scholastic Alchemy

I’ve come across as critical of Kelsey Piper, I’m sure, but I hope it at least reads as a criticism meant to contextualize and inform, rather than tear down. As I’ve said before, I like Piper’s writing and think she’s genuinely advocating for public schools to be better. Unlike many critics of schools, she’s not zero-sum and doesn’t seem to think that good schools are inherently scarce. I will always take the time to read what she writes about education and it is often great food for thought. Indeed, I think she’s right that education research fails to meet the standards of many other academic disciplines. In some ways, this is because the incentive system is all messed up and drives researchers into publishing for an audience of other researchers assuming someone else somewhere else will translate their findings into something useful for schools. In other ways, though, where Piper sees weakness and sloppiness, I see the challenge of working within a living breathing system that is operating and changing as you study it. For educational researchers, you’re never really in control of your research setting and that means even the most carefully designed research can go off the rails. Often times federal, state, or district education policies will shift in the middle of a study. A district might purchase a new curriculum, throwing your study into chaos because the control treatment has changed. It’s not supposed to!

You can eliminate these problems by removing yourself from the school context. Maybe you recruit kids and study them in a lab. Maybe you evaluate homeschoolers. Maybe you study college kids and argue the findings generalize. These are all approaches that have been used (except maybe homeschoolers?) but the problem is that the findings don’t really generalize. Schools aren’t labs and they aren’t homes. Teachers mostly aren’t labortorians and they mostly aren’t their students’ families. The studies that are designed most effectively from a research quality standpoint can be the studies that tell us the least while some half-aborted pilot study provides deep insights if only we’d listen.

To put a cap on the scholastic alchemy of it all, let’s return to something I wrote about the Science of Reading last fall.

We just don’t have a strong evidence base for massive whole-school phonics overhauls or even for whole-class phonics instruction. When it comes to teaching phonics to whole classes of children, we have one RCT study that “has been the subject of some substantive methodological critique” and we have one intervention study that “yielded a modest and nonsignificant effect of the intervention.”

It’s true! Despite all the claims you may hear on podcasts or in social media, the kind of studies that show SoR approaches are effective are usually studies of kids with disabilities and of kids receiving instruction in clinical settings. There aren’t many studies of SoR approaches in general education environments during whole-class instruction of the sort. That’s right, even the much-vaunted SoR wouldn’t rise to the standard of evidence that Piper calls for. We’re just supposed to assume that what works for dyslexic kids and autistic kids is effective reading instruction for neurotypical kids. We’re just supposed to assume that what works in small groups or in literacy clinics delivered by speech and language pathologists is just as effective delivered to a whole classroom of kids by a teacher. People don’t know this! They think the science of reading is some massive scientific consensus with reams of supporting studies. It’s not!

I don’t know what iWumbo is but I had this saved in my meme pictures folder and it’s one of my go-to quotes. Originates with baseball great, Oscar Gamble.

But you know what, it’s good that we’re not waiting for the RCTs. It’s good that schools nationwide are once again embracing teaching phonics and aligning themselves to the science of reading. Despite my complaints, I do think that it’s a necessary change that will yield some benefit. That’s the alchemy of it all! We want high quality evidence but waiting for that evidence means kids not learning. We’re told academic researchers are too slow, too stuck on institutional bloat like IRB or keeping schools anonymous, but the requested research modalities are even slower and take even longer. Since Piper has been writing so much about Mississippi, let’s use them as an example. If Mississippi waited for RCTs or for the economists to swoop in and deliver educational research excellence, then there would be no Mississippi Miracle. Piper is, in effect, advocating against her own policy advocacy. If that ain’t scholastic alchemy, I don’t know what is. Mississippi had to make choices with imperfect information, with a limited research base, and they appear to have made a good call. In education, perhaps more than in most other disciplines, uncertainty is the only thing that’s certain. You can rail against it all you want, but that won’t change what it is.