- Scholastic Alchemy
- Posts
- Kelsey Piper is Right about Education Research
Kelsey Piper is Right about Education Research
But not for the reasons she mentioned
Hi! This is Scholastic Alchemy, a twice-weekly blog where I write about education and related topics. Mondays usually see me posting a selection of education links and some commentary about each and Wednesday posts are typically a deep dive into an education topic of my choosing. If Scholastic Alchemy had a thesis, I suppose it would go a little like this: We keep trying to induce educational gold from lead and it keeps not working but we keep on trying. My goal here is to talk about curriculum, instruction, policy, public opinion, and other topics in order to explain why I think we keep failing to produce this magical educational gold. If you find that at all interesting, please consider a paid subscription here, or at the parallel publishing spot on Beehiiv. (Some folks hate the ‘stack, I get it.) That said, all posts are going to remain free for the foreseeable future. Thanks for reading!
Allow me a bit of a diversion before I move in to address the larger point of today’s post.
While I was penning the reads to start your week post over the weekend, I mentioned that I was going to spend today’s post ranting about educational research. That’s still kind of the case, but I’ve sat on the draft and rewritten it a bit because I think that what I wanted to say has changed over the last few days. (That’s also why I’m hitting publish so late Wednesday night.) Initially, I’d drafted something about how academic incentives around research are kind of screwed up and that often times it feels like you advance your career by moving away from the classroom. It’s not a complaint unique to the field education. Every discipline has this tension and professors are under a lot of pressure to prioritize research over their own teaching obligations. What bothers me, though, is that there was a kind of discordance between prestige and impact. Having a meaningful impact on classrooms, schools, or the education system is at best a secondary concern if you’re trying to advance your career and get those tenure track jobs.
I still remember when our program organized a kind of send-off and final “here’s the real world” program for all the people who’d successfully defended dissertations that year and were, mostly, headed to their first academic jobs. At the front of the room, a highly accomplished professor, endowed chair, and chief editor of one of the top journals, who had held various leadership roles in research organizations and professional bodies explained what really makes and breaks academic careers. “Don’t be too focused teachers and schools. Do your research but write for the academic audience, for your peers,” she explained.
In education, I learned, there are three tiers of academic writing. The highest tier is theoretical or for developing models of how some facet of education works. The results of your studies are all fine and dandy, but what really moves a career forward is using those studies to create something with some explanatory power that other researchers in the field might adopt and move forward with. There are only a few journals that even publish this kind of stuff and getting published in one of these journals a handful of times should be the goal of every early career scholar. It’s how you gain prominence in the field and what tenure committees will pay attention to. Below that is a tier for publishing findings under other academics’ frameworks and theories. This is the publish or perish layer and you should have two to three articles in this tier annually as a productive early career scholar. This tier also serves as a place to build up a research base that you can use to develop theories or models to go write about in a top tier journal. The bottom layer are practice-focused publications and books. The audience for these publications is primarily teachers or teacher-educators. They’re practice-focused because they are primarily about the actual work of teaching or learning to teach. These, we were told, rarely “moved the needle” in our careers and we should really focus on publishing elsewhere if we wanted careers in higher education. We should wait until we were established and had the time and freedom to devote ourselves to supporting kids or teachers in classrooms. It should wait until later.
Of course, this same accomplished professor had also lamented many times over the years that university-based researchers and scholars have limited impact on the actual classroom. It’s all run, you see, by economists and lawyers and politicians. This all bothered me because you can’t make this complaint while also perpetuating a system whereby the lowest status thing is also the thing most directly connected to classrooms. What better tool to have an impact than the training and preparation of future educators? What better tool to have an impact than the development of classroom practices? It’s enough to drive someone mad. It’s also not a super interesting read and real-world relevance is a problem that exists throughout higher education, although it is perhaps most galling in colleges of education — to me at least.
This is why I linked Ferlazzo’s article in EdWeek. Deep inside, everyone in education knows that “Ivory Tower Syndrome” is a problem but challenging it may mean early career scholars struggle to get established. People do it, but it takes a special kind of person and the right university environment that will support your work in school and teacher ed settings. Ferlazzo frames his decision to work as a substitute teacher as something of an obligation.
The pressure of a tenure-track job at a major research university can be intense. Given the demands of research, teaching, and service, I was hesitant to take on anything new. But in 2021, shortly after submitting my file for tenure and promotion, I applied to be a substitute teacher in a local school district. It is easily one of the best decisions of my career.
When I began teaching courses at the college level more than 15 years ago, I committed to two key principles.
First, I am committed to ensuring every strategy, technique, and intervention is consistent with the best available scientific evidence. Teaching teachers to use methods with clear evidence of effectiveness gives them the best possible chance of success with their students.
Second, I am committed to never teaching students anything that I would not do myself. It is this second commitment that led me back into schools as a substitute teacher.
I do want to note, however, that he’d submitted his application for tenure, so he’s moving beyond the early career scholar stage! Anyway, the big point here is that it’s rare for professors at colleges of education to devote meaningful time and attention to the regular functioning of K-12 classrooms beyond whatever specific thing they’re researching. The outcomes of that research are too rarely cycled back into the classrooms where research took place. Rather, researchers build their own academic careers on the backs of students and teachers and schools. It’s not a new problem or one that is particularly hidden. I think pretty much everyone knows it but it has proven intractable.
Education Research is “Bad”
The reason I wanted to spend the first 1000 words or so of this post on my chief complaint about education research is to try and establish that I, in no way, want to position myself as someone defending all of education research from its critics. If anything, criticism should make our research programs stronger but I think that only happens if criticism is well informed and understands the challenges researchers face. Recently, Kelsey Piper wrote at The Argument that Education Research is Weak and Sloppy. Central to her point was an attack on the work of education researcher and policy advocate, Jo Boaler. When researchers requested to know which schools were involved in a study of hers so they could reanalyze her data, she refused on the ground that the schools had to remain anonymous. Those researchers were able to reverse engineer some of her research to identify the school and found that she inappropriately compared top quartile students with middle quartile students elsewhere. This is highly suspect analysis and seems intended to overstate the performance of students using Boaler’s curriculum. Because Boaler used anonymity to shield her research from criticism, Piper seems to have latched onto anonymity as a core reason education research is so “sloppy.”
The fact that it’s normal to report on a school “confidentially,” without naming it, makes journals reluctant to require data sharing, and researchers almost never want to go to the extra effort to share their work if it’s not required or strongly encouraged.
It’s easy to say as an outsider to the field, but I think the idea of reporting on a school’s results “confidentially” just needs to go. Individual student results should be confidential, of course, but key information about school performance should be, and already is, public.
The norm that you can conceal at which school you conducted an intervention makes life easier for people who are committing fraud: no one can easily catch the fraud by calling up the school to ask if the data is accurate, or look at how it compares to other, publicly available data about the performance of the same students.
I get that in this case, it would be true that knowing which schools were involved up front may have prevented Boaler’s results from getting published but I’m not sure it’s overall going to do much to improve the quality of education research. It’s also just one of several dozen objections the critics Piper cites make about her work, but the others don’t get much mention by Piper. If anything, getting rid of anonymity may make it harder to recruit schools and students to participate in research. Although Piper mentions other ways of improving research, such as preregistration, sharing coding and data, and getting adequate funding so studies can cover lots and lots of schools and students (increasing statistical power), she really only devotes discussion to the problems of anonymity. She wants people “calling up the school to ask if the data is accurate, or look at how it compares to other, publicly available data about the performance of the same students.” Having done this kind of thing before, I can assure you and Piper that the school has no idea if the data is accurate. They’re usually getting results the same way everyone else is, by reading the eventual paper. While there are some exceptions to this, such as action research, they’re not that common. What is needed is some kind of ability after the fact for research and policymakers to track down the study’s data and that could or could not include information about identifying the schools involved.
Given everything else that could be going on in education research, I think Piper is over-indexing anonymity because Boaler was able to convince San Francisco to delay algebra until ninth-grade citywide. For whatever reason, this has become a kind of metonymy for all the problems with education and Piper is writing inside that bubble a bit. A great story for a journalist interested in this kind of thing might be figuring out how this one professor with some heterodox ideas about math education was able to persuade an entire city’s school system to do what she wanted. The quality of her research may turn out to be the least important part of the story. Piper, though, isn’t interested. She has fish to fry.
Thinking about why education research is “bad” requires thinking about the nature of doing research in schools. It’s absolutely true, as Piper reports, that education journals have not adopted transparency pledges or reporting requirements of the sort that psychology journals have. In some cases this is because there just aren’t enough studies of the type that could even be rated according to the TOP Factor. For example, if you observed classrooms that were implementing some kind of new curriculum and wrote up an observational study of the practices of teachers and responses to students — you know, because it’s important to know if a curriculum was implemented with fidelity — how would you ensure that “a party independent from the researchers verified that reported results reproduce using the same data and following the same computational procedures?” Your data are notes. What computation is being done? If you’re not doing a statistical analysis of some kind of quantified data, then this kind of rating system doesn’t make sense. If you’re a journal that regularly publishes work that’s not exclusively quantitative (there are a lot of mixed methods studies in education), you’re probably not going to use TOP Factor. But this is kind of the problem with Piper’s article and many criticisms of education research. They’re too far removed from the work being done to give a good critique. They see the problem but only through a glass, darkly.
That said, Kelsey is right that education research doesn’t meet the standards of, say, clinical psychology or more quantitative fields like economics. She identifies a deeper problem, but it’s one she really doesn’t come up with a way to address besides asking economists to study homeschoolers.
Studies that are comprehensive and well-designed enough to find meaningful results are generally large and expensive. If every researcher is trying to prove the merits of their own quixotic curriculum by convincing one school at a time to enroll in a pilot and try it, we’ll get what we currently have, which is a huge number of fairly low-quality studies of individual one-off interventions — none of which constitute convincing evidence because they simply don’t have enough statistical power.
Instead, it would be much better to have fewer, much higher-quality studies which look at the rollout of a new policy across a district or across many districts, conducted and analyzed according to the (much higher) standards for careful work from disciplines like economics.
I’m not sure Piper knows but there is an entire field of Education Economics. They have journals (more than one). She has a weird idea that researchers are all out there trying to research their own “quixotic curriculum” but in reality very few academic researchers are writing curriculum at all. There was actually a whole movement in curriculum studies in the mid 20th century called “the reconceptualization” and it happened in part because curriculum writing moved out of universities and into the hands of publishers and government. A bit more journalism may be in order for Piper but The Argument saw fit to publish and here we are. I don’t get the sense that Piper has a great grasp of what researchers actually do or the challenges involved in education research. For example, the corollary to all the high stakes testing of the last quarter century is the availability of large quantitative data sets that allow for analysis by, among others, economists. The thing she asked for already happened! None of that high profile research illuminated some perfect set of best curricula or practices that are guaranteed to be effective.
Education Research is Hard
One thing I’m fond of saying about education research is that it has to take place under conditions that would be intolerable to researchers in other fields. Part of the challenge is shared with other social science research and with humanities research. A portion of what’s involved in studying education will always be quantitative. Getting some test scores and running a regression can tell you what happened in terms of changes in test scores, but it doesn’t tell teachers much about what they should be doing day-to-day in the classroom. It doesn’t tell you anything about students’ behaviors or cognition. It can’t tell you if the materials are bad or the teacher isn’t equipped to deliver instruction adequately. Some education research will always necessarily involve qualitative work, too. Someone needs to observe classrooms, talk to kids, examine materials. You can make that kind of work empirical, in the sense that you have established practices and procedures, make your notes and records available, and have written up qualitative coding, but that’s not really what Piper is talking about adopting from the world of economics and it won’t qualify as the kind of reproducibility that she’s asking for in her article.
Conducting the kind of research Piper wants is, as she says, expensive. I would also add that it’s labor intensive and requires multiple layers of consent and agreement. One does not simply walk into a school and declare herself to be doing research. A lot of attention is placed on the IRB process and a sense of over-protectiveness toward keeping kids and schools anonymous when it may not be necessary, but this is only one layer of policy. Often times school and student anonymity is a requirement of the grants that pay for big research projects. School districts and schools themselves also usually want to be anonymous. Parents want their kids anonymous and want the schools to be anonymous. All of these same requirements also apply to consent. The school district must consent to research happening in any of its schools or classrooms. The schools themselves must consent. Teachers and students also have to all provide consent. At any point in this process, they can also pull out of the study. I have personally lost access to schools where I had already received agreement to research in because someone at the district decided they now had a rule that researchers from institutions outside of that state cannot do research in their district.
These are conditions that most other research fields don’t have to deal with. It’s very common to have attrition of participants in studies but that attrition is usually because of individual factors and doesn’t happen to an entire set of participants simultaneously. Economists just don’t face these challenges. Nobody is going to strip them of access to their research sites because, as far as I can tell, they typically don’t research at sites. When that happens to clinical trials it’s newsworthy and unethical. When it happens in education research, well, them’s the breaks.
Education research also includes a whole mess of confounding conditions that aren’t easy to eliminate. Everything from non-random assignment of students to classes and schools to teacher training to fidelity to the implementation of an intervention are essentially toss-ups in education research. Increasing statistical power can help but it’s no guarantee and, as is well documented, effect sizes dwindle. Doing these kinds of studies well means having lots of training for participating teachers, doing audits and observations of the actual interventions, and ensuring that participating students are all actually showing up to school and participating. This is resource intensive stuff and takes time and money and people.
While Piper looks at all this from the outside and sees weakness and sloppiness, from the inside it looks like working within larger systemic constraints. None of that dooms us to a world of bad education research but we do have to consider that decisions in education need to be made faster and with more urgency than what well done randomized controlled trials can deliver. Let’s look at some examples.
Flexible Phonics RCT
A 2024 study funded by a charity in the UK looked at over 3,000 students in 123 schools. The study, a randomized controlled trial, evaluated the efficacy of a program called Flexible Phonics.
Flexible Phonics approaches teach children to add another step after they have blended phonemes, to recognise whether they have successfully identified a word or if they need to use alternate strategies to do so. This ‘set for-variability’ approach could enable children to read unfamiliar exception words independently (words that break phonic rules, such as ‘the’, ‘two’, or ‘above’).
The trial was properly registered in the UK and subject to ethics reviews as well as third party review. Notably, even in the UK, the schools are kept anonymous. It seems that it’s not just IRBs at American universities run amok!
Anyway, it turns out, Flexible Phonics made things worse.
Pupils who participated in Flexible Phonics made the equivalent of one month less progress, on average, in early word recognition than pupils who did not receive the programme.
A negative finding is not on its own a bad thing. It’s kind of exactly what Piper wants, evidence of efficacy (or a lack thereof). I want to draw your attention to an interesting bit, though. In the section where they detail their analysis, they mention there’s a particular subgroup they perform a separate analysis on.
Subgroup analysis was conducted to examine whether the effect of the intervention differed among three different groups of pupils: FSM pupils, low-ability pupils, and pupils at schools that were not participating in the Nuffield Early Language Intervention (NELI).
Later they explain how this played in their findings.
To assess whether the impact of Flexible Phonics differed depending on whether the school had any pupils participating in NELI, further analysis was carried out using an interaction between whether any pupils at the school took part in NELI and whether the school was part of the intervention group. This is reported in Table 18. The 95% credibility intervals spanned zero and so it is uncertain whether Flexible Phonics was more or less effective in schools where some pupils participated in NELI. While the 95% credibility intervals reported in Table 17 also spanned zero, the analysis provided marginal evidence that Flexible Phonics was more effective in schools which participated in NELI as the lower bound was very close to zero. This suggests that perhaps other schools in the intervention group would have benefited from the additional catch-up support offered by NELI. Had this been available, it is possible that the Flexible Phonics programme would have been more effective.
So, you may be thinking, “wait, I thought they were investigating Flexible Phonics. Why are they also talking about the efficacy of the NELI program?” Because them’s the breaks! They set up a randomized controlled trial of a literacy program only to find that some of their sample included kids receiving a separate literacy intervention from some other initiative. In any other field this just wouldn’t happen. You wouldn’t have people receiving two interventions for the same thing. The entire point of an RCT is that this kind of thing isn’t supposed to happen but in education, this crap happens all the time. Nobody in this study being fraudulent or trying to deceive anyone but the usefulness of these findings is less because of the confounding NELI intervention. Despite being the “gold standard” there’s just nothing here that tells me much about what kind of intervention I’d like. Maybe Flexible Phonics pairs well with NELI. Maybe NELI is the bee’s knees all on its own and Flexible Phonics is crap. We don’t know from this study even though it does pretty much everything Piper wants education research to do.
Instructional Coherence Intervention in Tennessee
Michelle Caracappa brings us a great write-up of a study conducted in Tennessee to evaluate several tiers of literacy interventions. Here’s how the report from the Tennessee study describes the intervention.
In the fall of 2024, four elementary schools in Knox County Schools (KCS) piloted a new approach to supporting students who are academically behind. For over a decade, most Tennessee schools — including these four — have used intervention-specific materials when providing academic support to students outside of Tier I settings. However, for the 2024-25 academic year, the four schools are working to align the materials used in small-group settings to those used in core instruction — an approach known as instructional coherence — for kindergarten through third-grade literacy.
and
The KCS pilot builds on the work of a previous cross-district network that tracked student literacy growth for first through third graders during the 2022-23 school year, based on the kind of academic interventions provided to students. Some students only received Tier I instruction — the core instruction that all students receive — while others received additional supports like high-dosage tutoring (HDT) or Tier II and III intervention.
To translate this a bit, these schools in Knox County would evaluate kids’ literacy and then some would get regular instruction (Tier 1), some would get High Dose Tutoring, some would get 30 minutes daily of literacy instruction in groups of five (Tier 2), and the lowest performers would get 45 minutes of literacy instruction in groups of three (Tier 3). As noted above, the main thing they studied here was instructional coherence which seeks to ensure that all aspects of instruction rely on the same high-quality materials, are delivered effectively, and not some kind of scattershot approach where kids in intervention groups get a radically different curriculum than their peers. How’d that go?
Data from that network suggest that, regardless of performance level, students grew more when they received no additional support whatsoever compared to students placed in the most intense Tier II or III intervention, which meets three to five times per week. Students who grew the most were, in fact, the students placed in high-dosage tutoring, a less intensive intervention focused on support using Tier I curriculum and meeting two to three times per week.
(emphasis added)
Students grew more when they received no additional support. Caracappa makes a point about what went into building this study that has now, effectively, made things worse.
It’s worth sitting with that finding for a minute, as it reflects a reality that is intolerable but perhaps not anomalous in many schools and districts. When I read that sentence I think about all of the effort that must have gone into providing intensive intervention instruction — the minutes spent administering assessments, analyzing data, creating groups, developing schedules, making staffing plans, and, at long last, delivering instruction. And yet, despite all of that effort — the time, the resources, the blood, the sweat, the tears — students on the receiving end grew less than had they never been pulled for additional small group support at all. It’s an unacceptable outcome for kids, and also for the adults who serve them, as confronting that reality is no doubt a recipe for demoralization.
One takeaway from the study is that they relied on universal screeners to make placement decisions for students.
In KCS’s prior intervention model, the district relied entirely on universal screeners to make intervention decisions for K-3 students. While they were able to identify students at risk with this approach, they were not able to concretely identify why students were struggling or how to support them. As a result, students were not grouped for intervention according to their specific needs, and interventionists rarely had sufficient data to determine what to teach to address students’ gaps.
This seems like a pretty big component that could have undermined the efficacy of the tier 2 and 3 interventions.
You know what else? None of this is a finalized study. What they’re doing in Tennessee is all pilot work and they’re working towards an RCT. (Note: they say the results of that RCT will be available in fall 2025 but I couldn’t find anything. Perhaps it was delayed for some reason?) So, we again have conditions that are less than stellar from the perspective of Piper and her need for economics-like research. This isn’t an RCT and the researchers changed the intervention when they started to see kids failing to improve. It’s closer to what we call “action research” than to the replicable “scientific” stuff we’re told education needs more of.
Yet, unlike the RCT above that did everything right and yielded no really useful conclusions, the preliminary findings out of Knox County, Tennessee offer actionable insights any school could learn from right now. Caracappa connects this with two other research reports out of Tennessee and some work published elsewhere to make a good set of recommendations, including a checklist teachers can use to ask about their own school’s screeners, materials, and intervention setup.
It’s Scholastic Alchemy
I’ve come across as critical of Kelsey Piper, I’m sure, but I hope it at least reads as a criticism meant to contextualize and inform, rather than tear down. As I’ve said before, I like Piper’s writing and think she’s genuinely advocating for public schools to be better. Unlike many critics of schools, she’s not zero-sum and doesn’t seem to think that good schools are inherently scarce. I will always take the time to read what she writes about education and it is often great food for thought. Indeed, I think she’s right that education research fails to meet the standards of many other academic disciplines. In some ways, this is because the incentive system is all messed up and drives researchers into publishing for an audience of other researchers assuming someone else somewhere else will translate their findings into something useful for schools. In other ways, though, where Piper sees weakness and sloppiness, I see the challenge of working within a living breathing system that is operating and changing as you study it. For educational researchers, you’re never really in control of your research setting and that means even the most carefully designed research can go off the rails. Often times federal, state, or district education policies will shift in the middle of a study. A district might purchase a new curriculum, throwing your study into chaos because the control treatment has changed. It’s not supposed to!
You can eliminate these problems by removing yourself from the school context. Maybe you recruit kids and study them in a lab. Maybe you evaluate homeschoolers. Maybe you study college kids and argue the findings generalize. These are all approaches that have been used (except maybe homeschoolers?) but the problem is that the findings don’t really generalize. Schools aren’t labs and they aren’t homes. Teachers mostly aren’t labortorians and they mostly aren’t their students’ families. The studies that are designed most effectively from a research quality standpoint can be the studies that tell us the least while some half-aborted pilot study provides deep insights if only we’d listen.
To put a cap on the scholastic alchemy of it all, let’s return to something I wrote about the Science of Reading last fall.
We just don’t have a strong evidence base for massive whole-school phonics overhauls or even for whole-class phonics instruction. When it comes to teaching phonics to whole classes of children, we have one RCT study that “has been the subject of some substantive methodological critique” and we have one intervention study that “yielded a modest and nonsignificant effect of the intervention.”
It’s true! Despite all the claims you may hear on podcasts or in social media, the kind of studies that show SoR approaches are effective are usually studies of kids with disabilities and of kids receiving instruction in clinical settings. There aren’t many studies of SoR approaches in general education environments during whole-class instruction of the sort. That’s right, even the much-vaunted SoR wouldn’t rise to the standard of evidence that Piper calls for. We’re just supposed to assume that what works for dyslexic kids and autistic kids is effective reading instruction for neurotypical kids. We’re just supposed to assume that what works in small groups or in literacy clinics delivered by speech and language pathologists is just as effective delivered to a whole classroom of kids by a teacher. People don’t know this! They think the science of reading is some massive scientific consensus with reams of supporting studies. It’s not!

I don’t know what iWumbo is but I had this saved in my meme pictures folder and it’s one of my go-to quotes. Originates with baseball great, Oscar Gamble.
But you know what, it’s good that we’re not waiting for the RCTs. It’s good that schools nationwide are once again embracing teaching phonics and aligning themselves to the science of reading. Despite my complaints, I do think that it’s a necessary change that will yield some benefit. That’s the alchemy of it all! We want high quality evidence but waiting for that evidence means kids not learning. We’re told academic researchers are too slow, too stuck on institutional bloat like IRB or keeping schools anonymous, but the requested research modalities are even slower and take even longer. Since Piper has been writing so much about Mississippi, let’s use them as an example. If Mississippi waited for RCTs or for the economists to swoop in and deliver educational research excellence, then there would be no Mississippi Miracle. Piper is, in effect, advocating against her own policy advocacy. If that ain’t scholastic alchemy, I don’t know what is. Mississippi had to make choices with imperfect information, with a limited research base, and they appear to have made a good call. In education, perhaps more than in most other disciplines, uncertainty is the only thing that’s certain. You can rail against it all you want, but that won’t change what it is.