Boats of Gravy: No, Not Every Student Can Succeed

In any discussion about education, you can be pretty darn sure that somebody’s going to say some version of the following: “We need to make sure that every student can succeed.” Even our most recent federal education law, passed in 2015, is called the Every Student Succeeds Act. And the one that it replaced was the infamous No Child Left Behind, which is pretty much the same platitude expressed in different words. “All children can learn.” Or “every kid can be successful.” So on and so forth. They’re cliché, yeah, but these phrases are still a pretty good, succinct summary of the whole point of compulsory public education (which is, after all, a fairly recent invention, historically speaking - and one that not everyone is completely sold on yet.)

But these inspiring stock phrases tend to obscure an important question: what if it’s not possible for every student to succeed?

I’m not talking about natural ability or work ethic or the far-reaching effects of poverty and trauma. (Those are important questions, too, but beyond the scope of this post.) What I want to talk about are the ways that we measure success. Because we often measure and define success in relative terms: you succeed by doing better than other people. And in such a system, it is literally impossible for everyone to succeed. Trying to get everyone to succeed when success is understood as relative makes about as much sense as - to paraphrase Alfie Kohn, who said he was paraphrasing Deborah Meier - telling your whole class that they should all be in the front half of the line.

I think this conception of success as relative is ubiquitous in education. Sometimes it is explicit, sometimes implicit. Probably the most obvious example that we can discuss is a norm-referenced test. In a norm-referenced test, each participant receives a score that really only tells us how he/she performed compared to others who took the test, not anything absolute. IQ tests are famously norm-referenced. The average is always set at 100. So if we imagine a national movement to “raise IQ scores” - it wouldn’t matter what methods were used, it wouldn’t even matter if the whole country got smarter in the process, the movement would be doomed to fail from the get-go. The average would always remain 100 - because that’s what 100 means in the context of an IQ test.

Same darn thing.

Though it does have a complicated and interesting history, the SAT is also basically a norm-referenced tests. Which made perfect sense when it was primarily used for individual college admissions. One student’s relative ability to read, write, and do mathematics is absolutely relevant in college admissions decisions, since after all, the whole point of admissions is to compare that student to other students. But now some states (such as New Hampshire, my own state) have chosen to use the SAT as the mandatory standardized test for high school students. If they are going to try to show growth or progress in SAT scores, it’s going to be rather challenging. Not impossible, because there are still students in forty-nine other states that NH high-schoolers would be up against. But it’s a zero-sum game. In order for the scores of New Hampshire students to improve, it has to be at the expense of students in other states. And then imagine if every state adopted the SAT as it’s high-school standardized test. Then there would be no possibility for growth or improvement, and we’d get to see a million articles shared on Facebook all about how scores were stagnating and students weren’t getting any better at reading or writing or doing math (when they very well could be).

Hopefully, though, we never really reach the point of such absurdity. I don’t think we are in danger of it - not when it comes to tests that are explicitly, openly norm-referenced. But I also believe there is a much more subtle and invisible form of normativity that pervades many measures of success, which therefore makes them subject to the same potential problems. And that is my real point here: that we often understand success in relative terms without even realizing that we are doing so.

For instance, let’s think about classroom grades. (Full disclosure: I did my entire Master’s research colloquium on the detrimental impact grades have upon learning, so that’s where my bias lies. (It's over a hundred pages long with three appendices and I’d be thrilled if anyone besides my advisor ever wanted to read it.) And yeah, I still give grades in my class. Because I am required to and it will be a long time before that ever could change. Just like how your local Marxist bought his untouched-beyond-page-ten copy of Capital at a Barnes and Noble.) Now, I don’t think anyone who has spent more than an hour in a public school would ever seriously claim that grades are completely objective. We do, however, have a sense that grades tell us something about what a particular student can and can’t do, that they communicate something absolute. But my suspicion is that grades are largely relative.

I think most people who work in schools - as well as most people who live in our culture - have really taken to heart and internalized the notion of the bell curve. It is incredibly intuitive. I’d bet even people who have never heard the terms “bell curve” or “normalized distribution” believe in the general concept. We all sort of expect that, in any group of students, there will be a few who excel and a few who do poorly, while the majority will fall somewhere in the middle. And because that is our expectation, we sort of semi-unconsciously tweak things so they do happen that way. I mean, imagine for a second that you’re a math teacher and you see that every one of your students has just failed your latest test. What are you going to do? Probably conclude that the test was too hard and make the next one easier. Likewise, what if every student gets a 100? Test must have been too easy. Then take a look at your grades at the end of the semester. If everyone’s got an A, your class is too easy; if everyone’s failing, your class is too hard. If a few have high A’s, a few failed, and the majority fall in the middle, you’ll probably take that as a sign that the class is, as Goldilocks might say, "just right."

"It's, like, all about how the workers need to
control the means of production and stuff, bro."

(There are, of course, a few teachers out there that take pride in the fact that a lot of students fail their class. And there are others who hand out good grades far too easily to placate parents. But I think the majority of teachers, and the majority of what we would consider good teachers - assign grades in a way that conforms more or less to the bell-curve. And this may be less because students “naturally” fall into that pattern than because we unconsciously impose that pattern upon populations of students.)

Certainly, many districts that have moved to standards-based grading - which everyone is supposed to be doing in the near future, or actually were supposed to have already done, like, three years ago - have attempted to do so in a way that makes grading less about comparing a student to his or her peers and more about comparing him or her to an objective, stated standard. Unfortunately, I can’t really say much about how that works in practice. I’ve worked in three school districts since 2014 and, in all three, traditional (A-F; 0-100) grades were still being used, with a dim, atmospheric sense that we were “moving in the direction” of standards-based grading.

But it is also worth noting how vague some of the standards are. For instance, one of the Common Core State Standards for Writing - one of the standards that I am expected to “teach to” - expects an eighth grader to “write arguments to support claims with clear reasons and relevant evidence.” Well, there’s a lot of room for judgment in there. What counts as a clear reason? What counts as relevant evidence? And how much evidence has to be included in order for the argument to be sufficient? (All the writing standards are like this. Which is not necessarily a bad thing. Assessing writing is always going to be subjective, which is fine as long as we embrace that instead of pretending that we are being objective. Which comes back to one of the many reasons why I would support using narrative reports for assessment instead of grades but oh well.)

Furthermore, every school whose standards-based grading system I have actually seen has included categories for “exceeds the standard” and “partially meets the standard” as well as the expected groups of “meets” and “does not meet.” So that’s four categories that students can fall into. Seems awfully easy to apply the bell-curve model to that framework - a few students per class or grade will exceed the standard, a few will not meet it at all, and most will fall somewhere in the middle. So there is certainly a possibility that grading in a standards-based system is still, essentially, a process of comparing one student to other students - that a student’s grade is really a determination of relative ability.

(By the way, the fact that each teacher is engaging in this process only with regard to his or her classes, in his or her district, means that you really can’t compare one student’s grades at one school to those at another school - even if they’re ostensibly learning the same content based on the same standards. I have seen this firsthand: a paper that I would probably give a B in the district where I work right now would have likely gotten an A at my previous school - the overall performance of the students, for a variety of reasons, was just generally lower there.)

Which, I suppose, is why standardized tests exist. Most standardized tests are not norm-referenced and they are graded by people (or computers, but we’ll get there in a second) who have no connection to the students, no reason to give them anything other than an objective score. Or so it would seem. I’d like to look first at standardized assessments of writing - for a couple of reasons. The first is that my job is to teach writing, so these are the tests that reflect on me; the second is that I am currently reading a book about writing assessments, so it’s all fresh in my mind; the third is that writing tests are unique in that they can be graded by either computers or human beings, and that this significantly and substantially affects their scoring. (You can grade multiple-choice tests either by computer or by hand, but they are scored the same either way.)

First let’s tackle computers. (Just not literally.) I think it’s pretty easy to imagine the problems that can arise when computer programs are in charge of grading writing. For instance, research has shown that they can be fooled by big words, convoluted sentences, and lengthy but ultimately substance-less responses. (Some of this research is summarized here.) Because they can’t pay attention to the things that actually matter in writing - coherence, audience awareness, voice - they presumably focus on the superficial features of the writing instead. But we really don’t know for sure, and that’s one of the main problems here. The testing companies who use computers to score the writing sections of their standardized tests tend to be pretty protective of them, so that we don’t really know how they work. That means we don’t have any real way of knowing whether they are valid and reliable. All we can do is trust that the corporations who make millions off of these tests really have our students’ best interests at heart.

Now for humans. In theory, having humans score writing should be a million times better than using computers. And it is - in some ways. Humans are capable of making decisions about whether a piece of writing uses “relevant evidence” or contains “clear claims,” at least. But the way that the people who do the grading are expected and trained to do so leads to other sorts of problems. A lot of it is explained pretty well in these articles (and other similar ones.) In the community of test-scorers, there is a strong emphasis on consensus. Which is putting it pretty lightly. Anyone who does not fall in line and score writing pretty much the same as others do is let go from the position. There is no room for difference of opinion; there are no conversations about why you gave a particular score. It’s like a jury which demands a unanimous verdict but, instead of permitting long sessions of deliberation, simply dismisses anyone who doesn’t vote wth the majority.

So if you’re a writing-scorer and you have a vested interest in continuing to be one, in keeping your job - you are going to want to make sure that you give every piece the score that other people would give it. It’s an exercise in conformity. It’s more like playing Family Feud than playing Jeopardy - the goal is not to find the best answer, but just the most common answer. (Which is one of several reasons why Family Feud is the dumbest game show.) And if everybody is doing that, what are they all going to default to? The lowest common denominator. The bell curve. A few really high scores, a few really low scores, and a lot that fall somewhere in the middle. That is the safest route. And besides, I really think it is so deeply embedded in our minds that we almost can’t help but think in terms of it.

"If we evolved from monkeys, why we still got monkeys?"

So when we talk about human beings grading writing for standardized tests, we’re not really talking about human beings as human beings. That would mean that they had the freedom to be subjective, to offer and defend their own opinions (even if they were unpopular.) What we’re really talking about here are human beings who are trying their best (who are expected to try their best) to be computers. Hell, one person I’ve met who worked as a test-scorer for one of the big testing companies has told me that they refer to the process of making sure you agree with the majority of other test scorers as being “calibrated.” So if it turns out that humans are slightly worse at being computers than the computers are - well, who’s really surprised by that?

As for the tests that make sure each piece of writing is scored by a computer and a human being - I guess the question is: where does the buck stop? When there’s a discrepancy, who do they go with: the computer or the human? But then I suppose it doesn’t really matter that much. If you go with the computer, you go with a score that is objective, but could very well be completely unreliable and invalid; if you side with the human, you are getting a score that is based on conformity and probably relative.

So my point is that some aspects of standardized testing are subject to the same problems as an explicitly norm-referenced test like an IQ test. A movement to improve the writing scores of all students nationwide would likely be similarly doomed. (A probably obvious point: this is not the same as a movement to improve the writing of all students nationwide.) If the people grading these tests are falling back on the idea of a bell curve, of a normal distribution of scores - then those scores are never going to improve. Can you imagine being a test-scorer for and trying to say that every single essay that fell into your pile belonged in the highest category? That’d take some serious guts, that’s for sure.

Of course, it is true that the people who grade the writing of standardized tests have more at their disposal than just the idea of a bell curve. They have samples of writing that correspond to each of the score categories as well. But I don’t think this actually invalidates my point - it simply expands the scope of it. Though again we can’t actually know for sure, I think it’s a reasonable assumption that they don’t use the same samples year after year. If they don’t change them every year, I bet they change them every couple of years or so. And what would happen if there was a massive improvement in the quality of writing that was coming in? I imagine the writing samples would change to reflect that, so that what was once the top category would now be the second-from-the top, and so on. (Yes, this is all speculation, but I think it is makes sense.) It would be a gradual process, of course, but eventually the bar for success would be raised.

And that is how this same principle affects not just assessments of writing - which, admittedly, is inherently more subjective than other subjects - but all standardized tests. It’s just like what classroom teachers do. When everyone is doing well, we take that as a sign that we need to make the tests a bit harder.

For instance, during the 2013-2014 school year, 77% of sixth grade students in New Hampshire scored proficient or above on the NECAP Reading test. 70% scored proficient or above in math. That sounds pretty darn good.

But then the following school year, we switched to using the Smarter Balanced test. That brought the numbers down to 57% for reading and 46% for math who were proficient or above (or, in SBAC parlance, “meet or exceed the achievement level.”) Now, these are completely different tests. But I don’t think it’s a coincidence that the proficiency rates are lower. It’s hard to even imagine anyone implementing a test that led to higher proficiency rates; they’d be accused of “dumbing down education” and (without fail) “giving everyone a trophy.” But the way it did happen, we got to be subjected to a slew of articles about how abysmal students were these days at reading and math, and how they used to be so much better before [insert whatever innocuous thing you feel like demonizing here. Hip-hop or iPhones or Gogurts that are too easy to open or whatever].

And yet the simplest conclusion - Occam’s razor and all that - is: it was a harder test.

And that’s going to keep happening. We are going to keep adjusting all of our assessments

The worst-designed meme of all time?

- standardized or otherwise - so that they correspond to where the students actually are and what they can actually do. We are going to keep trying to make it so that they fall into a normal distribution because we implicitly believe that that’s the way the world naturally is. And maybe there’s nothing wrong with that. But we’ve got to recognize that we are perpetually constructing a system in which it is actually impossible for every student to succeed.

I think that this way of thinking about success is basically an instance of the fallacy of composition. It’s possible for each member of a group to do something, so we tend to assume that it is possible for every member of a group to do the same thing. To go back to that intentionally absurd “front half of the line” example: yes, of course it is possible for any individual student who is in the back half of the line to move up. But it’s not possible for them all to move up. If they all tried to move up, it would change the structure of the line, which is the very system within which they were trying to move. And the same principle is true in many real-world cases.

For instance, at my school, students are divided into an A group and a B group (except for in my class, but that’s a whole separate issue.) And it is certainly possible for an individual student from the “B group" to improve so much, or to show so much determination and grit, that he or she is moved up to the “A group.” It has happened before. But you can’t move all of them up - no matter how much they all improve - or else you no longer have an A group and a B group at all. So if insist upon measuring success in relative terms, then we can’t also insist upon every student being successful. It’s just plain intellectually dishonest. We've got to pick one or the other.

This wouldn’t be an issue if it was confined to such benign things as middle-school A groups and B groups - but the same fallacious logic seems to be almost ubiquitous in our conversations about education. There is all the assessment related stuff that I have already discussed, of course. But sometimes it goes even further than that. For example, in the book The Teacher Wars, Dana Goldstein mentions a 2006 paper that estimated that “firing the bottom 25 percent of first-year teachers annually . . . could create $200 billion to $500 billion in economic growth for the country, by enabling poor children to earn higher test scores and go on to obtain better jobs.”

Really? Every poor student will be able to obtain a better job? How would that even be possible? It’s not as if low-skill, low-wage jobs would just disappear. Nor would we need more doctors and lawyers just because more people were educated enough to pursue those careers. What we’d end up would probably be a bunch of highly educated people working in the food service industry. (Shout out to my 2015-self, who had a Master’s degree and worked at Subway.) But it’s the fallacy of composition again. Yes, any individual can receive a better education and therefore obtain a “better job.” The principle even works if you think about a small enough group - the students at one school, for instance, or even at the schools of one city. But when you try to apply it to all students across the country, which is what those authors did, it falls apart.

The same goes for the plans of people like Bernie Sanders (and many others on the left) who want, through various methods, to make it so that more young people can go to college. I don’t disagree with the idea, necessarily, but I think we need to not kid ourselves: the value of a college degree is mostly relative. The reason I can get a higher-paying job than many others my age is because I have a degree, not because of what I actually learned or did in college. (Which, by the way, is not the same as saying that what I did in college has no value. It has a heck of a lot of value to me. Just not to anyone who would sign my paychecks.) So to the extent that the value is relative, the same principle applies. We’d end up with degree inflation.

Like the bad guy from The Incredibles says: “If everyone’s super, no one is.” (Because he’s defining super as a relative term, of course.)

Perhaps the best solution to this is to try to stop defining success as relative. Instead of setting up competitive, zero-sum frameworks (like A groups and B groups, like norm-referenced tests, or writing scored with the assumption of a bell curve) - we could try to set up systems where it is possible for everyone to succeed. Maybe we are moving in that direction. Maybe that is where standards-based grading is headed (though I don’t think it is quite there yet). But until then, we’ve got to at least change our platitudes to match the reality.

So I’m looking forward to the upcoming “A Predetermined Percentage of Students Can Succeed Act."

Boats of Gravy

Saturday, September 9, 2017

No, Not Every Student Can Succeed

No comments:

Post a Comment