One More Thing About Testing

Testing is only as good as the tests

Welcome to Scholastic Alchemy! I’m James and I write mostly about education. I find it fascinating and at the same time maddening. Scholastic Alchemy is my attempt to make sense of and explain the perpetual oddities around education, as well as to share my thoughts on related topics. On Wednesdays I post a long-ish dive into a topic of my choosing. On Fridays I post some links I’ve encountered that week and some commentary about what I’m sharing. Scholastic Alchemy will remain free for the foreseeable future but if you like my work and want to support me, please consider a paid subscription. If you have objections to Substack as a platform, I maintain a parallel version using BeeHiiv and you can subscribe there.

Today’s post is a bit quicker and is meant as a smaller, but important, addendum to my last two posts about public opinion, standardization, and testing. The subtitle really says it all: testing is only as good as the tests. Typically when people are complaining about the standardized tests administered by the states, we hear complaints about their validity. That is, people who dislike these tests think they are not accurate measurements of students’ learning and abilities. We often hear that standardized tests are racist, are biased toward high SES kids, or are not holistic enough for kids’ true skills and talents to show through. These are seen as left-coded issues and so a lot of the time we just assume that the left is the primary opponent of standardization and testing.

However, if you’ve read my last two posts, as well as the May 13th post, you will recognize that one thing I’m persnickety about is the political alignment of people who dislike testing and standardization. We hear, including from major publications like the NYT, that there’s a kind of “both sides” problem for standardization and standardized tests where the left and right hate them in equal measure. I think this is false. While some on the left have opposed standardization and testing, if you look at who was opting out of tests, where opt out was most prominent, what polling around issues like testing looks like, and what political activities surround all these issues, it’s pretty clear that people from the center-right and right, conservatives, are most opposed. As I noted in the last post, plenty of people, especially on the right, don’t think standardized tests capture meaningful information about their kids learning as individuals. This is a validity concern! Just because the first things that come to mind when opposing tests are race, income, and disability doesn’t mean that the public’s concerns are also focused on those things. There are other validity issues that are worth considering and, just as important, we have to contend with the perception that standardized tests aren’t valid.

That said, sometimes you have someone who is liberal also criticizing standardized tests and pointing to their limitations. Economist Raj Chetty, for example, thinks testing is too flawed to give policymakers and researchers all the useful information that they need.

Furthermore, if you look at test score data, which is the basis for most prior theories about differences in ability, the fact that black kids when they’re in school tend to score lower on standardized tests than white kids, that actually is true for both black boys and for black girls to the same extent. In contrast when you look at earnings there are dramatic gender differences.

And so that suggests that these tests are actually not really capturing in a very accurate way differences in ability as they matter for long-term outcomes, which casts doubt on that whole body of evidence. So, based on that type of reasoning, we really think this is not about differences in ability. One final piece of evidence that echoes that is if you look at kids who move to different areas, areas where we see better outcomes for black kids, you see that they do much better themselves, which again demonstrates that environment seems to be important. This is not about immutable factors like differences in ability.

[emphasis added]

Chetty is saying that, as a researcher, he doesn’t think he can use test score data to generate meaningful long-term information about students’ outcomes but what’s really important here is that he’s not only making a point about validity. The main thrust is that he’s not able to get the data he needs in the quality he needs it from testing — a validity concern — but let’s look at the other piece of evidence he includes. He remarks that kids who move into a different environment have higher scores. This is, I think, a reliability problem. Reliability is when a measure can reproduce results consistently because it takes place under the same circumstances. As Chetty points out, when you change circumstances, kids get different scores! Our testing apparatus lacks reliability because it cannot reproduce the same conditions.

It’s not just about environmental changes or did a kid get enough sleep or eat enough breakfast or whatever. The tests themselves change, sometimes often, and that makes results incomparable between the two versions of the test. South Carolina offers us a perfect example of the reliability problem here. If you’re following some recent press releases or checking local media, you may have heard that South Carolina’s students are improving their reading scores (math is a different story). Good for them! The South Carolina Daily Gazette remarks:

…reading scores reached all-time highs, according to state standardized testing data released Tuesday. State Superintendent Ellen Weaver and teachers’ advocates have attributed much of the improvement in reading scores to a shift in how students learn to read.

That’s the highest overall with proficient scores in what’s officially called “English language arts” since students began taking SC READY tests 10 years ago, according to department data.

South Carolina’s Department of Education seemed pretty excited in its press release, too.

Thanks to the dedication of South Carolina’s educators—now equipped with powerful tools through clear and consistent training and high-quality instructional materials—student achievement continues to climb, with momentum building in classrooms across the state. More South Carolina students are reading on grade level since SC READY was first administered in 2015-16.

If you’re a lawmaker or someone looking to implement educational policy in a different state, you might think South Carolina is a success story that can be replicated. After all, there was a “shift in how students learn to read” and we’re ostensibly in the middle of a literacy crisis.

Thing is, it’s not the same test. The test South Carolina’s kids took in the 2024-2025 school year was a new test based on new standards that were enacted by the legislature in 2023. It is, therefore, inappropriate to compare scores on this test with the old test. There is no reliability here. We cannot tell from this test whether or not the “shift in how students learn to read” was actually responsible for the higher scores because the students were tested on different things. We do not know whether scores “grew” because this is the first time the new test has been used. What if it’s just an easier test?

I don’t mean to beat up on South Carolina here, but they just released scores and I was looking into it. All the states do this all the time. New York has re-jiggered its Regents exams every couple of years and even makes a point to remind people that different series of tests can’t be compared. When Common Core came and then a few years later went, states often created new state exams on both ends. As I pointed out last week, Texas is redoing its state tests.

Whether or not our systems of standardized tests put on and administered by the states are valid sometimes feels like a matter of perspective. There are criticisms of many flavors and, I think, the polling gives us a good indicator of how parents and the general public feel about them: generally sportive but want less emphasis on testing in the curriculum and more local control over testing and accountability. What is under-reported is that these testing regimes not only lack validity, they lack reliability. We simply cannot expect these tests to produce consistent results across time and over many settings. What’s more, we under-index the extent to which the test scores being reported are from the same test as prior years. In many cases, they aren’t. I suppose I could add yet another wrinkle: tests aren’t the same from state to state. South Carolina is happy that about 60% of their students can read on grade level but we don’t know how they would score on Mississippi’s exams or Georgia’s exams or California’s exams. The tests most commonly given don’t allow us to compare at the most meaningful unit of analysis, the state level. (I think this is the most meaningful because states control education policy and there are vast differences in what states do.)

You might read all of this and think that I am opposed to standardized testing. I am not. I think standardized tests are valuable tools but I think our approach to them is flawed and counterproductive. You know, this dance we do every year, the thing we call testing, it is supposed to be an apparatus that gives us valuable information. Teachers are supposed to get assessment data they can use to improve their teaching (hahahaha). Districts and states are supposed to get insight into which schools are over and under performing so that they can make adjustments to policy. Nationally, we’re supposed to have an understanding of “is our kids learning” and of the overall efficacy of schooling in the US. But it doesn’t really work out that way. Our tests don’t always give us useful information about our students in a timely manner, nor do they give us the kind of consistency across time that we need. The NAEP is supposed to fulfill some of this role but its reduced budget and delays mean we don’t get as much detailed reporting as we’d like. For example, the 12th grade NAEP scores don’t come broken down by state. We deserve a testing system that is both valid and reliable. We have neither but we continue to make proclamations and policy based on interpretations of test scores that are not empirically sound. It is, after all, Scholastic Alchemy.

Thanks for reading!