Scholastic Alchemy
Posts
Education Reads to Start Your Week, April 20th

Education Reads to Start Your Week, April 20th

Accountability Never Died, Khanmigo Postmortem, Ed Professor Substitute Teaches

James Shanahan
April 20, 2026

Hi! This is Scholastic Alchemy, a twice-weekly blog where I write about education and related topics. Mondays usually see me posting a selection of education links and some commentary about each and Wednesday posts are typically a deep dive into an education topic of my choosing. If Scholastic Alchemy had a thesis, I suppose it would go a little like this: We keep trying to induce educational gold from lead and it keeps not working but we keep on trying. My goal here is to talk about curriculum, instruction, policy, public opinion, and other topics in order to explain why I think we keep failing to produce this magical educational gold. If you find that at all interesting, please consider a paid subscription here, or at the parallel publishing spot on Beehiiv. (Some folks hate the ‘stack, I get it.) That said, all posts are going to remain free for the foreseeable future. Thanks for reading!

Accountability Never Died

Freddie deBoer tries to set the record straight on the NCLB/ESSA era. He is writing in response to the argument made by some liberal commentators that No Child Left Behind was successful. Freddie makes three main points. I think readers of Scholastic Alchemy will find several of these awfully familiar. First, he says that the accountability era never ended.

This continuity matters for the empirical question of whether accountability “worked.” The basic architecture - annual census-level testing, achievement-gap reporting, consequences for low performers, evaluation regimes tied to score - has been running continuously for almost twenty-five years. So, can flat long-term standardized test trend lines, persistent (and in some subjects widening) racial achievement gaps, and continued middling U.S. performance on PISA honestly be blamed on a premature retreat from “accountability”? What does that even really mean? The ed reformers claim that the accountability project was abandoned before it could deliver what it was intended to, but that’s a claim the statutes, state rating systems, and evaluation policies do not support. A more candid reading of the record is that accountability was sustained long enough and consistently enough to be judged on its own terms - and on those terms, the theory of action has not been vindicated. Because while federal policy can have big impacts on fairness and quality of life for teachers and schools, there’s simply no reason to think that it has consistent or meaningful effects on test scores.

Second, Freddie argus that the causal claims from impacts from NCLB are “contentious at best.”

NAEP scores in math and reading were already rising before NCLB was signed in January 2002. The 1990s had seen consistent gains, particularly among Black and Hispanic students. Why those gains were occurring is subject to as much debate as all the rest of this stuff; it will not surprise you to hear that I suspect that the remarkable decline in concentrated poverty during that decade played a large role. Whatever the causes, to credit NCLB with gains that began a decade prior doesn’t make much sense. More fundamentally, there is no counterfactual: every American public school was subject to NCLB simultaneously, so there is no control group against which to measure its effect. Without one, the claim that NCLB caused gains is simply not falsifiable in any rigorous sense, which means it’s a story imposed on a trend line rather than a scientific claim…

Third, he pointedly wonders how education policy in the US, especially in small subsets of the US could have global education impacts. He looks to PISA and finds declines in student performance across wealthy western nations that would not have been subject to changes in American educational policy.

The PISA declines visible in American math and reading scores over the 2003–2022 period aren’t remotely anomalous; they’re part of a near-universal pattern among wealthy, developed democracies. In particular, the Netherlands, Finland, Belgium, Canada, and Australia - that is, countries with many economic and social similarities but radically different curriculum philosophies, funding structures, pedagogical traditions, etc - all show trajectories strikingly similar to that of the United States. (In fact Finland, long held up as the gold standard of education reform and frequently invoked as a rebuke to American approaches, has seen some of the steepest reading declines in the developed world.) If policy and pedagogy were the primary drivers of American underperformance, one would expect American trends to diverge from those of peer nations, to look distinctively bad in ways that track distinctively American choices. Instead, what the data show is convergence: a broad, shared downward drift across the developed world that almost certainly reflects forces operating above the level of any individual nation’s classroom policy. Pinning these trends on American policy choices, without accounting for why virtually identical trends appear in countries that made very different choices, is not serious analysis.

In particular, Freddie seems fed up with complaints that implied middle school algebra policies in San Francisco were somehow responsible for nationwide or even global education trends.

Were there some bad educational ideas batted around in the 2010s? Sure. Ending algebra in 8th grade out of specious equity concerns, for example, was misguided in profoundly predictable ways. (The affluent kids just get algebra instruction outside of public schools, whether in private schools or with private tutors, rendering the whole thing a farce.) But there’s a reason that San Francisco is so often singled out for this bad policy: the number of districts who enacted it was tiny. I can’t get straight numbers, but I believe that the number of districts that eliminated the option for 8th graders to take algebra is in the single digits; there are more than 13,000 public school districts in the United States. Even San Francisco has rolled this policy back. Yes, participation rates are down, but that’s largely due to changes in standards, particularly Common Core sequencing - yes, the de facto national curricular standards beloved by the accountability people. This is one of the weird things about this whole debate, the way that the rhetoric of a loud fringe and the actions of a tiny number of outlier schools and districts are mistaken for actual meaningful pedagogical and policy change.

As I commented on a different article in The Argument, it’s “wild to think that a single school district’s math curriculum is the reason Trump is president, but here we are.” And yet, this is what they seem to earnestly believe. Were it not for San Francisco’s ~10-year dalliance with delaying algebra, Silicon Valley would not have swung to the right and embraced Trump. Freddie reminds us that to some extent we’re reading today’s ideological battles onto past events where, even in the recent past, those frameworks may not have existed or been correct.

I’ll also point out that I have been making some similar arguments here at Scholastic Alchemy. I wrote that School Accountability Never Died. I pointed out that NCLB was not a great success and pushed back against the idea that it was Democrats or “the left” who were solely responsible for returning accountability to the states. Indeed, it seems like commentators are practically trying to get the facts of ~2010s education policy wrong on purpose. So, you know, read all those at your leisure.

Khanmigo Postmortem

There are a bunch of takes flying about Khan Academy’s semi-scrapping of their AI chatbot Khanmigo but the one you should read in full is Dan Meyer’s RIP Khanmigo. A few key bits.

On a recent webinar, Kristen DiCerbo indicated that student usage of Khanmigo was not what Khan Academy wanted. “I will tell you, we see more ‘IDK IDK,’” she said, “more passive kinds of interaction than we would like.” Critics suggested that the difference between Khanmigo and human tutors were vast, with the chatbot unable to draw on a relationship with students, unable to initiate or end conversations with the sensitivity possessed by even quite average human tutors.

At this point, with Khanmigo already struggling to meet Sal Khan’s expectations, critics began to scrutinize Khan Academy’s efficacy research. Laurence Holt dubbed its deficiencies “The 5 Percent Problem,” so called because Khan Academy’s strongest effect sizes (0.26 standard deviations above the mean in a 2022 study, for example) were achieved only after excluding 95% of the study population. In a more recent study, Khan Academy lowered their inclusion threshold significantly, avoiding the 5 Percent Problem, but also watching their effect size boil away to nearly nothing.

Sal Khan has blamed teachers for low student Khanmigo usage, saying that teachers “need to figure out ways to engage them more” with Khanmigo. In Chalkbeat, DiCerbo blamed students, saying, “Students aren’t great at asking questions well,” which will come as a surprise to anyone who has ever known a small child. Yes, it’s possible that some of the most inquisitive beings on earth aren’t all that great at asking questions, but it seems more likely that chatbots like Khanmigo aren’t all that great at inviting, understanding, or answering those questions.

Allow me to tie a few of these things together. First, we have to remember that motivating students to do some kind of work is hard. Our model for motivating students usually sees good academic outcomes as something downstream from being motivated. The opposite may be just as likely: kids who perform well academically become motivated (or engaged, those are often used interchangeably even if they’re not). If kids aren’t getting the support they need to be academically successful, they won’t me motivated to try, say, using a chatbot. Moreover, because they lack some kind of important foundational knowledge, they may not even be able to engage with the chatbot in a way that could leverage the bot’s capabilities. That’s why we get things like kids saying “IDK IDK” to the bot or being more “passive” than the bot’s designers expected. They lack enough knowledge about the math they’re working on to intelligibly chat with the chatbot. There may also be language and disability issues complicating matters. I’ve also, more crudely, made the point that kids with AI rich classrooms are probably going to abuse that AI and use it for their own entertainment ends rather than focus on academics.

Second, we have to recognize that tutoring, while effective, is not going to deliver unprecedented performance growth. A lot of the public’s expectations of tutoring are filtered through pretty bad readings of Benjamin Bloom’s work on tutoring back in the 1980s. What you hear is that tutoring done right should deliver a “2-sigma” improvement in student performance. That is, their performance will improve by two standard deviations of the mean performance on a given measure. The thing is, he called this the 2-sigma problem. It’s a problem because, as Bloom pointed out, this is more or less a theoretical maximum under ideal conditions and not the kind of gain to be expected in any kind of real-world environment. While one problem he listed gets all the attention, the challenge of scaling in-person tutoring, he also listed other challenges. In particular he points to the “Home Environment Process” and “Peer Group” as something that has a large impact on student performance and need to be “solved” in order to resolve the 2-sigma problem. Matt Barnum discussed Bloom and over-interpreting his “results” at length back in 2018 after a different edTech venture fell apart.

The applicability of these studies today is an open question. Combined, the studies focus on just three schools and a few hundred students. And since this was done more than 30 years ago, things like what traditional instruction looks like may have substantially changed.

The papers include little information about those final tests, but it appears they were designed by the researchers, unlike a traditional standardized test. Researcher-created assessments on subjects that are totally new to students — like cartography and probability, in this case — tend to see students make the largest gains.

Bloom’s work also doesn’t focus on technology-based tutoring, a point personalized learning advocates usually acknowledge. “If it supports anything, it supports one-on-one human tutoring,” Riley said.

I’m on the record as saying that I think AI-based tutoring will deliver something like 0.25 standard deviations in effect size, because it’s a combination of tutoring (which delivers 0.4SD and computer aided instruction (which delivers 0.05SD). That estimate may actually have been too high. Using Khan Academy appears to have a roughly 0.085SD effect size for kids who use it 30 minutes a week. That would put it in the ballpark of other interventions like providing children with rewards for learning, after school programs, and summer programs, all of which are among the lowest effect sizes for interventions.

But hey, Kahn is going to create a MOOC, basically. Maybe it’ll work this time?

Ed Professor who Substitute Teaches

I was planning on saving Larry Ferlazzo’s EdWeek essay for my Wednesday post, but go ahead and read it now. The quick version is that this professor of education works as a substitute teacher a handful of times a year to stay grounded in the reality of classrooms.

Second, subbing helps keep “Ivory Tower Syndrome” at bay. Ivory Tower Syndrome is the kitschy term for when faculty members become disconnected from the realty of life beyond campus. Serving as a substitute teacher keeps me grounded in a way that cannot be achieved in any other way. I face the same challenges with disruptions and off-task behavior. I face the same pressures of time management, engagement, and differentiation. Though I only get a small taste of what it is like to face these issues day after day, I become more connected to the challenges teachers and students face in schools every day.

I’ll be writing more about it (and a few other articles with similar topics) this week in an essay where I gripe about higher education’s misaligned incentives and how they can skew our perspective of classrooms.

Have a good week. Thanks for reading.