On Data and Decisions in Education

Data is alive and can make you do strange things

Welcome to Scholastic Alchemy! I’m James and I write mostly about education. I find it fascinating and at the same time maddening. Scholastic Alchemy is my attempt to make sense of and explain the perpetual oddities around education, as well as to share my thoughts on related topics. On Wednesdays I post a long-ish dive into a topic of my choosing. On Fridays I post some links I’ve encountered that week and some commentary about what I’m sharing. Scholastic Alchemy will remain free for the foreseeable future but if you like my work and want to support me, please consider a paid subscription. If you have objections to Substack as a platform, I maintain a parallel version using BeeHiiv and you can subscribe there.

Making data

One of the stories I like to tell about the limits of high stakes testing is about the limits of data’s ability to tell us useful things. In another life, I worked with special needs high school students in Georgia. Each year, just like every other kid in the state, these students would take the EOCT — End of Course Test — as required by No Child Left Behind. The tests were usually given in April over the course of a week and the whole school would transform into a testing center with limits on who could go where, which rooms were open to staff, specialized rules for attendance, and so on. One year, the tests became electronic, and students would rotate through computer labs class by class taking the EOCT for each course. This created a huge throughput bottleneck because there were, if I remember correctly, only five computer labs plus the school’s library that were set up to allow students to access the tests. This was long enough ago that most schools did not have a one-to-one program, so it was the labs or paper. That meant about 2,500 students had one week to use about 175 computers anywhere from two to four times per student depending on their grade and courses. Beyond that, the online tests were only accessible during their specified two-hour testing window and would lock out students as soon as the time ended.

This system of testing was not designed for the students I was working with. For example, one of the most common accommodations for students with disabilities is extra time to take the test. In previous years, students needing this kind of accommodation could use all their time consecutively. Once the test moved to the electronic model, the students had to return at a later time to be able to use their legally mandated extra time and a district official had to unlock each student’s test individually. As this problem became apparent, special needs students’ testing times were all moved to the end of the week because the school could not take computers out of rotation in order to accommodate students with disabilities. At the end of the week, there turned out not to be enough time for all of the special needs students to take all of their exams. They would have to continue testing the following week. Needless to say, this entire process badly stressed the students and there were many tears, outbursts, and other impediments to taking the tests. For many, this was their first time taking a test on a computer and deeply stressful. Some kids just didn’t come back the next week despite frantic calls to parents explaining that the EOCT scores were a requirement for completing the course. If I kid didn’t take the EOCT, she technically would not finish the class or earn a grade.

Toward the end of the year, score reports came back for all of the students except those who were doing all of this extended testing. Their scores were still being tabulated, and we’d get them at the beginning of the next school year. When the school later touted the huge improvement in their test scores, I understood that this was a score report did not include students with disabilities and was not surprised that the revised scores released the following school year were not, in fact, improved; they were worse than the year before. This, as you might expect, is where “the data” begin to play a role.

“The data,” you see, told us that students with disabilities at this school were struggling at an unprecedented rate and needed far more support than in previous years. Worse still, the school failed to make Adequate Yearly Progress for the first time ever meaning the state was now going to be watching this school closely. If the school did not return to making Adequate Yearly Progress in three years, it could be defunded, closed, all the staff fired, turned into a charter school, or any other number of punishments. The failure was blamed entirely on the special education teachers and their students whose abysmal performance “caused” the drop in the school’s scores.

Sadly, the aberrations of moving to a computer-based test for the very first time were not considered meaningful, nor could they excuse the school from being placed on the list of possible failures to be eventually closed. The district placed the whole school on a performance improvement plan that included mandatory sessions before and after school for a week to analyze “the data” from the tests and develop plans to improve on those areas. The thing is, state law also prohibited teachers from seeing the test or directly reviewing students’ answers or scores. This was to prevent teachers from know what content was on the test and then teaching to the test. Nevertheless, the expectation was that teachers should spend the upcoming year teaching to the test because passing that test was what would save the school. It was a maddening dilemma.

“The data” we received came in the form of 3"x5" cardstock readout explaining each students’ performance on each subject test they took. We could not take these cards from the office in which they were stored, nor could we copy or record what was on the cards for purposes of data security. At the top of the card was the student’s testing ID — you could not put the students’ names or school ID numbers on the card for some reason which meant someone had to be physically present with a list of student test IDs and teachers had to speak with that person to correlate their students with the cards. Students’ performance was broken down across whatever domains were on the test. For example, the 9th grade ELA domains were Reading and Vocabulary, Texts, and Language. Each of these three domains was scored with between one and three ‘+’ signs. ‘+++’ meant that a student was passing that domain. ‘+’ and ‘++’ meant a student was not passing that domain. This was odd, actually, because the EOCTs were not scored on a range of 1-3 but on a range of 1-4. It was not clear how the 1-3 scores on these cards related to the 1-4 scores of the actual test and nobody present could explain it. Finally, the bottom of the cards included recommendations for how to help the students improve their scores. In ELA all of the suggestions were books that were considered appropriate to that student’s Lexile level and for their grade. Need help with language? Read these three books. Need help with “texts”? Read these three books. Need help with “reading and vocabulary”? Read these three books.

Teachers broke out by subject and grade level and discussed these results with each other. They discovered every single kid was recommended the same three books: Of Mice and Men, The Old Man and the Sea, and the one nonfiction choice of Richard Preston’s The Hot Zone. Our best guess was that the district or whatever vendor implemented this system could not recommend books below grade level and that these three books were simply the three lowest Lexile level books permitted in 9th grade. Even so, these suggestions were probably not going to help. Students with disabilities manifest a wide variety of symptoms and needs. In several cases these students were simply never going to learn to read because, for example, they’d experienced a traumatic brain injury as a child. Many other students could and did learn to read but required significant assistance with sounding out letters and decoding words. Others needed help with comprehension or with maintaining focus while reading for any duration of time. Disabilities are complex.

On top of all that, the school already collected reams and reams of data about each of these kids. That data collection is required by law and is reviewed annually by a committee that analyses student performance, sets new goals for each student with disabilities, and determines their accommodations and supports for the year. None of that data counted for the purposes of school accountability. And that’s the point of sharing this story. It didn’t matter that these students probably had “more data” than anyone else in the school about their academic capabilities. It didn’t matter that the shift to electronic testing proved highly disruptive and clearly wasn’t planned with students with disabilities in mind. The effects of botched testing didn’t count. Special education teachers had to make decisions to “save the school” based solely on the data that did count, but they were refused access to that data and instead received a weird interpretation that offered little value. Still, the school dutifully ordered several classroom sets of The Hot Zone. The other books they already had in stock.

We aren’t getting better at making data

A few years back, Heather Hill, a researcher at the Harvard Graduate School of Education surveyed the literature on teachers getting together and analyzing student data just like the teachers I worked with back in the day. The theory is sound, right? Teachers getting together and making decisions around curriculum, interventions, and whatever else might be useful for helping students improve. With the explosion of testing data available after NCLB, there was more information than ever to help inform teachers. It would appear, however, that this simply isn’t the case. Much as my vignette above explains, “the data” is often too limited and offers little in the way of constructive support.

Rigorous empirical research doesn’t support this practice. In the past two decades, researchers have tested 10 different data-study programs in hundreds of schools for impacts on student outcomes in math, English/language arts, and sometimes science. Of 23 student outcomes examined by these studies, only three were statistically significant. Of these three, two were positive, and one negative. In the other 20 cases, analyses suggest no beneficial impacts on students. Thus, on average, the practice seems not to improve student performance.

Hill argues that teachers end up doing the “wrong” thing with data. Rather than revamp their curriculum, they make plans to go back and reteach the content, usually in the same way as before. This is reminiscent of how we were told to teach the same three books that were already present and approved and technically appropriate for the students age and grade. No matter the students’ shortcomings, the recommended solution was just doing the same old thing. Indeed, much of Hill’s description of what happens at these data meetings sounds very familiar, even years later. For example, from her observations of data meetings by ELA teachers:

Teachers reported on each student, celebrating learning gains or giving reasons for poor performance—a bad week at home, students’ failure to study, or poor test-taking skills. Occasionally, other teachers chimed in with advice about how to help a student over a reading trouble spot—for instance, helping students develop reading fluency by breaking down words or sorting words by long or short vowel sounds. But this focus on instruction proved fleeting, more about suggesting short-term tasks or activities than improving instruction as a whole.

Common goals for improving reading instruction, such as how to ask more complex questions or encourage students to use more evidence in their explanations, did not surface in these meetings. Rather, teachers focused on students’ progress or lack of it.

Importantly, she isn’t blaming teachers here. What she calls for is more time for teachers to collaborate in planning and improving instruction instead of focusing on standardized assessment data. Sadly, two years later Hechinger Report caught up with Hill and little seemed to have changed.

Data speak

I want to bring this to a close with a few relevant observations. First is that, with regard to schools, data are not neutral or objective. Tests are rarely just someone trying to get a good read on what students know and can do. The tests I witnessed were a complex combination of assumptions and political priorities. They were tests to hold teachers and schools accountable, a political project. They were tests designed for mainstream students, discounting the needs of students with disabilities (among others). The data they produced was not very useful, nor did it help form a more complete picture when considered alongside other data gathered about those same students. In fact, there was no effort to form a complete picture by combining data sets. Yet, the data from these tests carried with it a requirement to act on that data in the ways suggested by the data according to some vision somewhere of what kids with these scores needed academically and without regard for what those students needed developmentally, psychologically, or emotionally. Teachers and administrators, meanwhile, had to go through ritualized bureaucracy to retrieve the data and then pantomime the creation of curricular interventions even though these were simply handed down on a little piece of cardstock. None of this was meaningful for students but all of it was required.

Second, and I think follows quite naturally from recognizing the above, is that data has a kind of agency now. That is, the data is no longer a static statistical description but a force that carries with it its own imperatives to act and behave. I think we’re quite used to the role algorithms play in our life outside of school. Social media recommends things to entertain us, to encourage us to buy products, or to promote particular ideas. Certainly, there is nobody left who thinks social media is simply neutral or without any agenda. The same goes for our nascent AI products that leverage massive amounts of data to produce generative outputs for various purposes and needs. Indeed, the lack of neutrality is so great that many LLM and generative image/video models have to build in explicit guardrails against certain kinds of violent, sexually exploitative, or politically radical content. Nobody thinks those guardrails are anything but adhering political and social requirements of some kind (whether we like them or not, we get what they’re doing). And yet, somehow, we ignore that this is the case for data in schools, too.

One of my favorite articles is actually starting to show its age now, but makes an important point that should be obvious but is not. Ben Williamson’s Learning in the Platform Society takes a look at ClassDojo.

ClassDojo is a commercial platform for tracking students’ behaviour data in classrooms and a social media network for connecting teachers, students, and parents. The hybridization of for-profit platforms with a key public institution of society raises significant issues. ClassDojo is designed to influence how school leaders and teachers make decisions, how schools connect with parents, and how teachers act to change students’ behaviour.

The editor’s overview gives a good synopsis of these influences:

As a result, ClassDojo is forming and shaping the discourses and practices of classrooms and public education. In his contribution to this special issue, Williamson carefully excavates all of the various actors, forces, and entities, both human and non-human, that make up the sociotechnical assemblage of ClassDojo. He shows how the technology is allied with the psychology community, working through evaluative mechanisms of concepts such as grit, perseverance, and mindset that are set into the platform. He also shows how this platform is well positioned to respond to the US policy demands of ESSA that require states to include at least one ‘non-academic’ measure of learning as an outcome for accountability.

Collecting data, such as data about “grit, perseverance, and mindset” carries with it the requirement to act on that data, to make each student’s data conform to the expectations of both the assessment (in this case the continual assessment of an app) and of the policy apparatus (ESSA) and of the chosen research tradition providing evidence (basically just Angela Duckworth). It doesn’t matter, really, that grit is questionable as a psychological concept. Nor should we care that much of what makes the thing we call grit possible is simply a stable home, loving parents, and decent nutrition. As one critic puts it, “The notion that kids in poverty can overcome hunger, lack of medical care, homelessness, and trauma by buckling down and persisting was always stupid and heartless”.

Scholastic Alchemy

This is all, of course, Scholastic Alchemy. The intermingling of many influences from politics to education research to EdTech slams right into the reality of kids in schools and their many varieties. The outcomes end up being quite alchemical in that they do not produce the scholastic gold expected of them. No, despite the many arcane data rituals teachers I worked with endured, what made the improvement in the following year’s score was probably that we had laptops brought in so kids could take the tests in better accordance with their disability accommodations, in environments they were comfortable with, and in a format that we did more to prepare them to encounter. Although we implemented the curricular changes required by the data, many students with disabilities gained little to nothing from reading The Hot Zone because their disabilities were of a nature that limited their ability to read at all, much less comprehend high school texts. We are now starting to suspect that pushing for grit is going to backfire no matter how much the data suggested we work on improving students’ grit. We have new data now, I guess, and it says to do something else. So we will.

Thanks for reading!