Processing math: 100%
Skip to main content

Calibrating Assessments

Last year I wrote a post on how Oxford selects undergraduate students for computer science. Oxford undergraduate admissions are politically charged: the British public suspects that Oxbridge is biased towards privileged applicants.

This post is unpolitical though. I'm advocating a better way of solving a typical problem in academia: you have many candidates (think: papers, grant proposals, or Oxford applicants) and few assessors (think: members of program committees, assessment panels, or admission tutors), and the goal of the assessors is to select the best candidates. Since there are many candidates, each assessor can assess only some candidates. From a candidate's point of view: a paper may get 3 reviews, a grant proposal perhaps 5, and a shortlisted Oxford applicant may get 3 interviews. Each such review comes with a score that reflects the assessor's estimate of the candidate's quality. The problem is that different assessors will score the same candidate differently not only because they view the candidate differently but also because they view the scoring scale differently: some assessors are friendly, some are strict, some are not aware of grade inflation, and some assessors have a stronger pool of candidates than others. As a result, the chances of a candidate may be skewed by their assessors' (perhaps unconscious) generosity.

The standard way of dealing with the problem is to ignore it and hope that the scores will “average out”. That is, for each candidate, take the average score and hope that the overly strict reviews are balanced out by overly generous reviews. After all, what else can you do? Well, this is what this post is about.

One may throw more resources at the problem: increase the number of reviews per paper from 3 to 4; give borderline-promising candidates another interview, etc. Increasing the number of reviews per candidate works: more reviews will lead to more “averaging out”. And it is hard to argue against, because the decisions are really important, aren't they. What I'm advocating is to improve the overall ranking without spending more time on assessments.

The crucial idea is: it is relatively easy to compare candidates, even when you are not familiar with the scoring scale. As an admission tutor, I ask different questions every year, so although I can rank applicants right after having interviewed them, it is much harder to assign them a score that is compatible with my colleagues' use of the scale (or even with my own use of the scale last year). Estimating the absolute quality of a candidate is hard. Therefore, I believe the overall ranking of all candidates should depend mainly on the assessors' comparisons between candidates.

One may consider asking each assessor for a ranked list of the candidates they have assessed, and then for each candidate sum up their ranks in these lists. That won't work well, for instance because some assessors will have a stronger field than others, so candidates who are assessed along with strong candidates will be at a disadvantage. A simple ranking also ignores the fact that some candidates are equally strong, something an assessor can express with a score but not with a rank.

Here is the solution I advocate: think of each review score s_{a,c}, where a is an assessor and c a candidate, as the sum of the generosity g_a of Assessor a and the quality q_c of Candidate c: \begin{equation} s_{a,c} = g_a + q_c \tag{1} \end{equation} A list of scores s_{a,c} is given, and the g_a and the q_c are unknown. We don't particularly care about the g_a. We care more about the q_c, but most crucially we care about differences of q_c between different candidates, because we are after the candidates with the highest quality. For each review we have an equation (1), but the number of unknowns g_a and q_c is generally smaller, so the system defined by equations (1) is overdetermined. Indeed, we can't expect to find a perfect solution: if two assessors assess the same two candidates but exactly one of the two assessors gives the two candidates the same scores, then the system (1) is not solvable. There is a principled solution to this dilemma: least squares. That is, find (g_a)_a and (q_c)_c such that the sum, over all reviews, of (s_{a,c} - g_a - q_c)^2 is minimized. There is a standard, efficient, linear-algebra based way of computing this least-squares solution.

The system (1) is, on the other hand, underdetermined: adding 1 to each g_a and subtracting 1 from each q_c leads to the same right-hand sides and thus to the same error. This is harmless: we care about the candidates c with the highest q_c, not about how high those q_c may be. Normalizing the generosities, e.g., by setting \sum_a g_a = 0, in conjunction with a natural assumption about the connectedness of the assessor-candidate graph, makes the least-squares solution unique.

An appealing feature of this method is that candidates who find themselves assessed by a strict assessor are not at a disadvantage. For instance, consider a strict assessor a_1. There will be candidates c that are assessed both by a_1 and more “normal” assessors, say a_2. We will have s_{a_1,c} < s_{a_2,c}, and such inequalities will depress g_{a_1}, reflecting the low generosity of a_1. The strictness of Assessor a_1 does not influence the computed quality q_c of the candidates assessed by a_1: if Assessor a_1 adds a constant to all their scores, then adding the same constant to g_{a_1} gives a solution with the same (least-) squared error. If we produce a ranking of the candidates based on the computed q_c, then the only way for an assessor to influence that ranking is to influence the differences in the assessor's scores. Even if all assessors of a candidate are strict, the candidate is not necessarily at a disadvantage.

A recent paper studies the method in the context of panel assessments. The paper includes experiments on randomly generated data. Of course, the experimental results depend on the exact assumptions and should be taken with a grain of salt. They suggest that, starting with 3 assessors per candidate and simple averaging (that is, for each candidate take the average of the 3 raw scores), moving to the calibrated method described above improves the quality of the assessment about as much as increasing the number of assessors from 3 to 6. Note that the former involves a least-squares computation, whereas the latter involves doubling the assessment resources. The authors also study a refined method where the assessors additionally quantify their confidence in their judgement. This further improves the overall assessment quality (a bit).

Curiously, the authors credit a 1940 paper on crop treatments for the discovery of the (basic) calibration method. But this method of calibration must be very old. My father, an engineer in whose mind geodesy is the mother of science, claims it goes back to Gauß. In fact, the method above applies whenever imperfect measuring devices (such as altimeters) are used to determine quantities (such as heights or rather height differences).

I am not aware of an assessment panel that actually uses the calibration method. I think one should use it. There may be a human obstacle: assessors may think that they use the scoring scale correctly and they may view calibration as doubting their assessment ability.


Comments

Post a Comment

Popular posts from this blog

Tell me the price of memory and I give you €100

Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares: The random states (except the brown sink state) come with a probability distribution over the successor states. In the controlled states, however, a controller chooses a successor state. What does the controller want to achieve? That depends. In this blog post, the objective is very simple: take a red transition . The only special thing about red transitions is that the controller wants to take them. We consider only MDPs with the following properties: There are finitely many states and transitions. The MDP is acyclic, that is, it has no cycles. There are a unique start state from which any run starts (in the pictures: blue, at the top) and a unique sink state where any run ends (in the pictures: brown, at the bottom). No matter what the controller does, ...

Short Killing Words

Given a finite set \mathcal{M} of n \times n matrices over the integers, can you express the zero matrix as a product of matrices in \mathcal{M}? This is known as the mortality problem. Michael Paterson showed in 1970 that it is undecidable, even for n=3. Later it was shown that mortality remains undecidable for n=3 and |\mathcal{M}| = 7, and for n=21 and |\mathcal{M}| = 2. Decidability of mortality for n=2 is open. If all matrices in \mathcal{M} are nonnegative, the mortality problem becomes decidable, because then it matters only whether matrix entries are 0 or not. Specifically, view the given matrices as transition matrices of a nondeterministic finite automaton where all states are initial and accepting, and check whether there is a word that is not accepted. This problem is decidable and PSPACE-complete. One can construct cases where the shortest word that is not accepted by the automaton has exponential length. In this post we focus on the nonnegative ...