I work in academia. When we grade exams, the order of the exams on the stack is the order in which they were collected in the room (people can sit wherever they like). For grading, we are usually 5 people in a single room, and everyone grades a specific exercise for consistency. The exams are getting shuffled heavily, with everyone just grabbing stacks, looking for exams where "their" exercise was not yet graded, and taking them out. So basically, the order in which we grade exams can be considered random.
However, I also grade weekly exercise sheets during the semester, and these are committed into a repository, where each student has a folder that... begins with the first letter of their first name. Everyone I have ever worked with acknowledges that you have to shuffle the order in which you grade these submissions each week, for fairness. Several effects come into play: (1) your are usually less tired at the beginning, (2) your mood gets better during the last 2 sheets because you know you are done soon, (3, and crucially) at the beginning, you have not yet seen all the common errors / developed a "feeling" for them, and you might thus miss them in early submissions, but spot them immediately in later submissions.
Another alphabetic effect: In elementary school, my name was on top of the list of students in my class. I remember that I often had to do some special job simply because I was the first name on this list (for example, carry a group ticket when we visited some museum, keep track of something, be the first at something where nobody wanted to be the first, with everyone watching, be the first to be graded in PE, again with everyone watching, etc.). As a fairly shy kid, this already annoyed me in first grade.
My strategy was to, like you said, grade problem by problem. Then for each problem, first find all those who got full marks. Then group the others into piles based on what mistakes they made.
This ensures that everyone who made the same mistake(s) gets the same grade. It also tends to shuffle the order of the exams after every problem.
Obviously you don’t need this strategy for simple multiple choice questions, and it’s probably also not a great fit for long-form essays. But it worked great for technical short answer problems in CS and security.
This sounds like an organisational nightmare to be honest. You'd be going through the pile of exams multiple times (at least twice) and what do you do if there are multiple mistakes that are common in a single exam question?
Also: if you're sorting into "mistakes piles" for single exercises, how can you parallelise marking of separate and independent questions?
Teach at a broke public university, and you never have to juggle huge teams of TAs.
Even at top-notch universities (public or private), when I talk to retired faculty, grading almost always comes up as a reason they don't want to teach anymore.
[Edit: not disagreeing with your point.]
Not only is it generally time intensive, you are also subject to lots of tiring back and forth with some students about their grades.
No grading is perfect, but there’s also some undercurrent of an attitude that students have paid to be there and are entitled to a certain grade.
Given that students have taken on hundreds of thousands of dollars in debt that they'll have to repay no matter what and on top of that a lot of jobs being completely out of reach these days without an academic degree (that for fucks sake isn't remotely required by virtually all jobs requiring it!), that's completely understandable.
Want to fix higher education? Bring the hammer down on companies abusing it as a proxy for legally discriminating against classes of society that are closely correlated with poor academic outcomes. Academic education should be reserved for the best of the best of our youth, and it should be fully paid for by the government, not simply another hurdle to pass to get a job that pays barely more than flipping burgers.
Would that my students were this engaged before the exam. Guess which students show up the most often for office hours? ... yeah, the ones that are getting the best grades.
If my students spent half as much time learning the subject as arguing with me about grades, they would be getting a higher grade than the one they are arguing for.
I think it is rational that students can feel entitled to that.
I also think that the vast majority of poorly paid, non-tenured professors and other teaching staff don't love being the targets of this harassment, since it's not their fault and largely out of their control, and it's not like they're getting the bulk of the tuition money. (That mostly goes to administrative expenses and sports programs.)
Heck, most adjunct faculty are often paid below minimum wage and qualify for food stamps.
I do (I'm a mathematican). We are usually between 4 and 10 people marking an exam with anywhere between 50 and 600 participants.
Online tools like Gradescope make this a little less painful (but still painful), but sometimes it's what's needed, especially on problems that are a little open-ended.
Sibling comment already said so, but I want to emphasize - this requires two run-throughs (at least).
When I was grading homework, it took about 5 hours a week per class per run through. They didn't pay me enough to make sense for it to be 10 hours.
A second pass wouldn't necessarily take the same amount of time, especially if you note the issues/concerns on your first pass.
True, but the overhead is large. I graded into linear algebra and intro calculus, so there were a lot of students - I think 150 or so - and most of them were wrong.
Graders know that wrong homework takes much longer than correct homework to grade. It's correct? Full marks, move on. Is it wrong? Well, how wrong is it? Did they make a bad assumption, but followed it through to its conclusion? Did they forget a minus sign? Or is it complete hogwash?
So it might not be 10 hours, but still would be around 8 hours. And that's still too much.
When I was a TA at CMU, we used Gradescope https://www.gradescope.com/ for this. Every exam would be scanned and divided into problems (based on a predefined template - fixed page space for answers).
Then, each problem was assigned to a TA. Either there's a predefined rubric, or you create it as you go (-1 point for mistake X, half credit for mistake Y, etc.). There's a pretty slick interface where you just read the answer, and use keyboard shortcuts to apply the relevant deductions.
It still has the issue that every time you change the rubric, you'd need to go back and re-do previously-graded instances of that problem. But it was way faster and (equally important) less tiring.
There’s also open-source software that does the same job at TU Delft: https://zesje.tudelft.nl/
(disclaimer: I briefly worked on the software for my bachelor’s thesis)
For final exams, we use to mark across all sections of a course (so for 101 type courses, this can be hundreds to 1000s of papers).
Get all the profs and TA's together, break in to groups taking one problem or set of problems. Then you random sample (each group takes a stack) to get a feel for the 'typical' errors, once that's done - you are a machine going through the stacks.
Every once in a while (not that often) you run into a novel error or approach, and the group discusses.
My CS school implemented OCR test sheets, with some exceptions, and equivalent strategies, such as test suites and benchmarks for programming assignments. This was done to avoid subjective grading, as it was a big issue even in well-intentioned cases.
Often, you still get big problems, but the set of solutions is small. It's always three options plus a fourth option (none / all). If you make a mistake you score negative points. It's not perfect, sometimes wording is ambiguous and it's unclear whether you need to tick the fourth catch-all option, but I found it better than the alternatives as it removes most arbitrariness from the process, but obviously has other issues.
Regular exams often had wildly different grading standards for the same course depending on the class, and thus on the professor who was correcting exams. This was really annoying.
An even better strategy is to have the papers scanned by a double-sided scanner and graded by an AI grader.
Everything in this thread just randomizes who doesn't get graded fairly.
Is there a better solution? It's not for teachers to be perfect. Since that's not possible, it's not a solution.
No, the solution is for the scoring to be handled by software that doesn't exist yet. Some things have easy, objective measures of correctness. STEM is mostly this way. Others, your humanities et al, are fairly subjective.
You could probably cover most of this with an LLM, and access to a large body of graded material for a given course, provided said material was graded fairly. Generating that data would be time consuming, as, any given assignment would need to be graded by as many people as possible in order to find a fair average.
From there, it's simple comparison between your sample work and the presented work. We're probably a decade from this really being viable en masse, but, it no doubt will happen, and for better or worse we'll likely end up with EDUAAS (education as a service).
LLMs are not going to be a solution. LLMs have absolutely no concept of truth.
And not everything has an objective solution. Even those that do often have a process associated with them and factoring in that work/process is an important part of grading. Reducing that subjective grading process to only objective solutions being right is grossly reductive and disproportionately punishes students who have the process right and understand the material but make small errors. That's exactly what you don't want to do.
---
Instead the solution is to make sure each assignment gets multiple eyes on it and in a random order. Then to document biases and trends in biases so that the TAs and professors can be aware of them and mitigate them.
It's a process problem that can only be solved by a process solution. Replacing the graders with technology or reducing problems to a binary right/wrong will never ever solve this and in many cases will end up being more harmful than the biases they claim to solve would be.
The LLM can compile verbose prose down to a short summary. If the summaries of each chunk are consistent, then it’s at least structurally well written. Then you grade the summary itself.
At that point you are grading the work of the LLM, not the student.
Yes, you can grade objectively.
Automatically unskew the results after grading based on this finding?
Probably it would be something like as follows:
Have a group of N graders. And a parity of k. Let's say N is 6 and k is 2. Randomly shuffle the assignments and partition the assignments into N groups.
Each grader gets assigned k of the N groups such that they share at most 1 overlap with any other grader and each group is assigned to k people. The assignment orders are shuffled for each grader. They mark up and then grade the assignments.
Then for each of the N groups, randomly shuffle the group and equally distribute the assignments to the N-k graders.
Now each grader reviews the assignment grades/markups (in random order) and assigns a grade based on the k grades/markups from the previous rounds along with a rationale for the grade assigned.
From there the student receives the final assigned grade, the rationale for the grade, and the k markups. If they have a complaint they can go to the professor (who then can also see the k initial grades along with everything else) to dispute the grade for the assignment.
---
This way each TA only has to mark up (class size * k / N) assignments, and review (class size / N) assignments to assign a final grade (which should take far less time to do than the initial markups). On top of that every assignment has a guaranteed (k + 1) separate eyes on it. And then the professors can serve as an unbiased arbiter while retaining all the context from the process.
To take it an additional step further, the professors could sample a random subset of the assignments to verify the markup and grading is going properly.
And those reviews/grade adjustments can then be recorded (along with the final grade/rationales) to document how a given TA's grading deviates from the final reviewed grade or the grade the professor assigns. Likewise for a TA's final assigned grade deviating from the professor's. This would allow deviations to be mitigated over time and major deviations to be identified.
For a single assignment, yes. But at least randomization might mitigate the effect across a term.
I don't think this is fair. It's just a more randomly distributed unfairness, rather than by a deterministic factor (like the student's name)
'Fair' would be each student is assessed independently for the work they did, rather than their mark being impacted by how early or late they were marked.
It would be essentially impossible to have something "truly" fair for open-ended questions since humans are stateful.
Maybe this is a case that AI could actually do quite well.
Manually grade the answers and identify the classes of mistakes. Then hand the classes of mistakes to the AI and ask for it to determine which answers have which types of mistakes.
Once you've done that, you just need to associate a deduction for each type of mistake and do some simple math.
what do you mean AI? you must be joking.
Imagine a question: compare bubble sort and quick sort algorithm.
Some students might mix up the algorithms, some might give the incorrect computation complexity, some might describe them incorrectly in some way.
Manually grade some (or all of) the answers by noting the kinds of things students got wrong (e.g. the above criteria). Then, feed in to ChatGPT (or your favorite alternative) the answer + the categories of mistakes to expect.
Here's a simplified example: https://chat.openai.com/share/bf801e12-51d5-4255-9968-bbf91b...
There are many notions of "fairness", many of which are logically incompatible with each other.
In this example, I think it's kind of fair to give everyone an equal chance of being advantaged. You're not hurting anyone specifically.
I think an important difference is that when you shuffle them, the unfairness stops being correlated across multiple assignments, so the "aggregate" unfairness over the course of the semester is much lower.
Is that distinction worth making here? There’s no way to “assess independently” the work of each student without some amount of randomness. But I think that’s okay, because isn’t randomly distributed unfairness just… fairness?
In around the year 2000 I had an essay due that day I had forgotten, and about ten minutes of computer lab time before home room in the morning. I wrote an introduction and conclusion; then filled the remainder with copy pasted chunks of the introduction and conclusion. The thought being at least I’d get a laugh. If anyone had read the thing it would have been clear it was nonsense.
I received an 80% with no notes or markup.
I have been left wondering for the last 25 years how much student work is actually even reviewed.
I work in EdTech and every time we add a feature that requires manual teacher review of student work you will see that some teachers are VERY diligent while others never touch it.
I know a guy who copy/pasted a wikipedia article, in line citations and all, and submitted it for a sociology class and got an A, no notes, nothing.
He “only cheated himself.” :-D
The point is to develop skills and knowledge, so I would agree. Do you disagree?
I agree, but we used to cringe at this saying when young, so funny to bring it back now.
There was this numerical calculus class at Uni where the teacher forbid us to use the calculator. So I just programmed the integral on it, got the partial steps, and just wrote random numbers to fill the the substeps. Got full grade :D The other case everybody got to pass the class, but after vacation we found the stack of exams completely untouched under a desk. The teacher had a side business to run...
A teacher friend of mine always goes through his stack twice. Once to correct all mistakes and a second time to write down points. As you said, once you have seen all mistakes you know how "bad" of a mistake it actually is.
Crucially, this is not quite what the poster said. It’s not about stack ranking students against each other.
Say every paper makes the same subtle mistake, and you only notice it halfway through the pile. Unless you go back through them all, you’ll unfairly grade the later entries more harshly.
It's not, but it sort of has that effect, albeit indirectly, and definitely unfairly.
I think we’re talking about the same thing, but to clarify my meaning:
If you weigh the severity of students mistakes (or successes for that matter) in relation to each other rather than to an objective rubric, you’re effectively stack ranking them whether you mean to or not.
I'm not a big fan of putting everything in the cloud, but one of the advantages of online grading systems is that it is easier to make this kind of adjustment. The workflow goes like this: make a rubric item for a specific kind of mistake (it takes a little experience to know which mistakes are likely one-off and which ones are likely to be repeated by other students), assign X points, and later if you decide there are worse mistakes, adjust the points and that gets applied to everyone.
This might come off rude on accident, but I mean genuinely without malice. When I'm writing an essay to submit to my professor/teacher, I am asked to make multiple drafts to get a proper end result that is ready to submit. Understanding that educational staff is already often overworked, should I expect _less_ from the person I receive my education from? If you acknowledge that many of the grades I receive are actually not fair to me, and there's an attempt to randomize the order that papers are graded, many of the grades that I received (whether high or low) were done partially (that is to say, the opposite of "impartially".) And there's a real concern that in your example where the submissions are committed to a repository that you need to shuffle, that my submission ends up in a similar position in the stack week after week, unless you're actually doing something to ensure my position in the stack is different between submissions. It's probably sufficient in many cases but doesn't guarantee randomness unless the algorithm to randomize submissions takes previous stack orderings into account.
It’s simply human nature. Teachers can either lie to themselves and you about it or mitigate it. What more could you possible want from them as humans?
I somewhat assumed there would be commenters suggesting the human angle as a retort. That's why I prefaced with both "this is what the teacher expects of me" and "understanding that educational staff is already often overworked." It just seems to me that the current systems aren't sufficient, and acknowledging that is what leads people to improving those systems. The above commentor suggested what they do in academia as workarounds to what the study showed, and I'm saying even that is not sufficient.
It seems like you're agreeing with me, but jumping to their defense with "people are fallible." People are fallible, that's why we build systems to take human elements out of it. Recognizing where humanity has soured something is key to that.
I know it's not the point of your post, but I think it's worth pointing out that you're misunderstanding randomness (albeit in a very typical way). Although randomness is likely eventually (over a lot of instances) going to be the most "fair" way to distribute where your submission is in the order, it does not guarantee that it will always be different, and in fact a "random" algorithm that took previous orderings into account would be provably less random than one that didn't
It's also worth noting that randomization in a context like this is inherently an imperfect solution to a problem that generally can't be solved perfectly. If we find out that weird ordering biases exist, I think randomization is done on the assumption that many we don't know about could also exist, that there's no clear way to mitigate them completely, and then randomizing the order per-instance is just the best we can do to ensure it's fair (Which, again, won't be perfect. Perfect isn't available)
When I saw the title I would have thought that the higher concentrations of Asian names starting with V, W, X, Y, Z would have led to higher grades at that end of the alphabet, and thought that effect would have eclipsed anything else.
Anecdotally, the course I grade has this effect (just looking at the average score). I have been grading this course from last 5 years(9-10 times). Last names with L-Z score slightly more than A-L.
Indian names start with A,B, N. Chinese names also start with, C, F, L.
When I was a TA I always did a second pass to make sure everything was even. It’s not that hard.
It's hard when you are the only TA for 260 students who get 3 assignments per week, you must also hold free hours and you aren't allowed to go over 27 hrs each week so the school isnt breaking federal laws.
We tried a lot of things. What eventually worked was ending grades. You mastered the material or you did not; perhaps a couple of students mastered it with high marks.
Obvs this takes an administration that is OK with that, which most aren't.
Having hired a lot of engineers, I can tell you that mastery of material is nothing close to a bimodal distribution.
We graded similarly, incidentally, when I was at U-of-M (lol). I don’t think we ever sorted by name so I don’t know if we’d have a bias effect by name unless it’s an implicit bias towards lexicographical esthetics. I won’t deny that grading fatigue can have subjective effects. I always thought we did a pretty fair and objective job. I taught Computer Architecture and we we developed answer keys and grading scales before grading a single test. Of course assigning partial credit always ended up being pretty subjective. Typically though people would error in the same ways and so those would be subjectively identical. I never thought names factored into this much but, to be fair, no one ever collected data…
Finally, I guess I’ll admit that I’m probably very biased because my initials as A.B. and I’ve always gotten excellent grades, so… maybe maybe maybe
While this helps the students with names lower down the order, people who are graded later still suffer.
There are all sorts of good ways to avoid these biases. I use the same practice described above for paper exams, and grading order for eg question 2 may be affected by score on question 1, but it won’t be affected by name or ID number.
If you use Canvas or Gradescope with the default settings, it’s almost impossible to avoid this sort of bias.
Worse yet, in Gradescooe you’re strongly steered towards grading with a fixed “rubric” with specific points off for each of N pre-defined errors, allowing grading to be done by TAs with little more knowledge than the students themselves, resulting in scores which have little relationship to the quality of the student answer.
Have you ever thought about just passing out a set of grades on random to random individuals and see how that shakes out. Like totally random and unjustified grades. D minus for an A+ student. A+ for fails etc. Just random chaos. Then just score the final correctly and see the effect?
Or just having a Kafkaesque pass fail grade with no feedback for each student relative to their own performance over time with an expected growth rate applied?
For grading essay assignments, and possibly also essay-style exams:
It is important to get a feel for the collective level of writing before grading essays individually, and it is important to avoid over-grading or under-grading essays at the beginning or end of a stream of papers. Therefore I did a three-stage grading process, with three colors of pens:
The first pass, with a red pen, is marking up single-point problems like misspellings and glaring usage errors. Also of course the general level of writing begins to seep in. This pass of course includes all papers, and it can pass fairly quickly.
The second pass, with a green pen, mostly just marks in the margin where a (good) point is made or a conclusion is reached. This is to prepare for the next pass. Again, all papers are done in this pass.
The third pass, in blue pen, is where the quality of the writing is assessed and critiqued. Maybe some short notes in the margins, maybe just comments at the end of the essay.
When students get their papers back, there are some chuckles (or whatevs) when students see all the pretty colors. But after I explain the method and its rationale, the method is clear and understood (and also appreciated?).
(I suppose the cons outweighted the cons.)
Did you perceive any pros?
I suppose one way to do grades is first read through all papers to get an idea of the levels of the students. Though you still have bias/nepotism and such then. Perhaps a teamwork or commitee would work, or teachers swapping classes/schools?
I had a French teacher on high school who dropped a pen on list of students and then where it landed that person would get rehearsal. People in mid (waves) were fried.
Plus, there is also the issue of certain last names being common in certain cultures, leading to skewed statistics.