Grading is subjective. Period.
And sometimes three minutes is all you get. I designed an entire graduate course around this idea.
In 2017 I was asked to be on a committee choosing PhD fellowships for international students in Quebec. The criteria were strict, and the applications were long (click here to scroll through a typical one). I ended up spending an entire weekend carefully reading the 50+ page applications of 20 students whose career prospects were literally in my hands.
Then we had our committee meeting, and I realized that my grades were very different from those of my colleagues. Yet, somehow, we all managed to agree on the top five candidates, who were the only ones to get the fellowship. Surprised by this easy consensus based on very noisy data, in 2018 I told my wife that I will spend exactly three minutes per application. She told me that was unfair to the students who probably spent three weeks (or maybe three months) putting that application together. I told her that they will not know this (the perks of anonymous peer review).
I opened the applications at 7am on the date of the deadline, and took three minutes to scroll through each application and assign it an approximate ranking. I submitted my scores at 8am when the evaluations site closed, and then I waited for the committee meeting…
At the meeting we reached an easy consensus once again. The best applications rose to the top, and even though our scores were noisy, our recommendations were consistent. That is when I decided to do some statistics. I entered all the reviewers’ scores in a spreadsheet and calculated the Pearson’s correlation coefficients to see how much reviewers agreed with each other. The analysis is here and the results are below:
On average, reviewers agreed with each other with a correlation coefficient around 0.5, which leaves about 75% of the variance unexplained. Not bad, not great, on par with any decision-by-committee. The correlations are significant, yet not very strong. Surprisingly, my scores did not stand out in any way. In 2017 I had a strong correlation (r = .75) with Reviewer 3, whereas all the others’ correlations were between 0.35 and 0.55. In 2018, the two other reviewers did not agree well with each other much (r = 0.3), whereas I managed to correlate with both of them slightly more (0.5 with each).
All to say that my 3-minute experiment went completely unnoticed. Nobody knew about it, nobody raised an eyebrow or questioned my judgement, and statistically my 2017 and 2018 contributions were indistinguishable. Yet, my effort was reduced by an order of magnitude. I took that as a huge win. Whether I spent a single hour or an entire weekend on a review, the scores would remain noisy, the agreement between the reviewers would stay low, and the best would rise to the top. I was particularly happy when I told this story to a group of students in a previous class, and one of them approached me to say that he was ranked first in the 2018 competition. I looked up my score, and he was my top choice, even after a 3-minute review. He was also the best student in the class.
DOES THIS MAKE ME A BAD REVIEWER?
Not really sure about this one. In the years since, I have scaled down on reviewing papers and grants, and the above analysis contributed to my decision. Why waste my time reviewing when the effort is not proportional to the outcome? However, there was one aspect of reviewing that I could never escape, and that is grading my students in the classes I teach.
That is why in 2019 I started a course where the tagline is ‘3 minutes is all you get’. I told the students that they will give three 3-minute presentations over the course of the class, and that those scores will contribute to 30% of their grades. Because the class requires presentations on a broad range of biomedical technologies, I would invite experts to serve as the jury, I would give the jury a list of criteria (clarity, accuracy and production values) and we would have our own ‘Poly’s got talent’ grading sessions.
The class became popular, and I ended up with long waitlists. The course evaluations were very high in all categories, except the section titled ‘Grading criteria are clear’. That section’s scores were very low. Obviously students did not understand how we put the scores together, but I thought this lack of clarity came with the territory. The experts were three different scientists, they each took the criteria to mean slightly different things, and the same student could end up with a high score from one judge, but a low score from another. It was understandable, but also disorienting for the class.
GRADING IS RANKING
Last year I finally had a realization! It wasn’t the scores that were disorienting, it was me trying to be clear about the criteria. I asked the judges to grade based on clarity, accuracy and production values, but those three concepts are very vague. So in 2023, I told my judges to design their own criteria, and that the only thing I care about is how they rank the students. I also told the students that they will be ranked by the judges and their grades will scale down with the ranking (First-ranked gets 10 points, second-ranked gets 9.5, etc.)
This led to several students dropping out of the class. They were uncomfortable with the idea that they are competing against their classmates and thought that grades should reflect knowledge, not competition. I told them that I understand their point of view, but that the criteria are clear. I also told them that classes are one of the few places where ‘objective’ criteria are still the norm. Companies do not hire based on passing a threshold. They hire based on a ranking where the criteria are often inscrutable and opaque to the applicants. The ‘best’ candidates get the jobs, whatever ‘best’ means. A low ranking should not be taken personally, because no matter how good they are, somebody will think they are not good enough. And if they didn’t like this message, they could always take another class.
The students that stuck around gave me the highest teaching evaluations I have ever received.1 Somehow my methods clicked with this group. They got to hear from amazing speakers, including May Griffith, Danielle Levac, Julien Cohen-Adad, Stuart Buck, Adam Mastroianni, Matt Clancy, and Evelyn McLean. They asked hundreds of questions during class. They created brilliant presentations that got noticed on social media. They got internships based on their final projects. They wrote testimonies about the class (here, here), and four of them will return in class today, to give a guest lecture for my course that starts in one hour. They will only get 3 minutes, I will time them, rank them, and they will not take it personally! Hopefully that convinces this year’s class that ranking is not as bad as it sounds. We pick our jobs, our friends and our partners by ranking, so what’s wrong with assigning grades that way?
Today, following my old students’ 3-minute lectures, a group of 20 new students will show their first 3-minute presentations. I will be their first judge. They are probably uncomfortable with my (lack of) criteria, and are wondering if they made the right decision to stay in a class that explicitly tells them they will be ranked by a noisy and unpredictable algorithm. Some of them will drop out, but some of them will understand that ‘noisy and unpredictable algorithm’ is the very definition of our reality.
I hope all of them will remember the lessons of this blog post!
P.S. If you like the idea for my class, you can subscribe to our new YouTube channel, or follow us on Twitter/X. Yes, social media participation is required for the students’ participation scores, and they will be ranked based on their stats.