§ Precision, Recall, and all that.
- Setting: we have some theorem goal g, a dataset of mathematical lemmas D, a set of actually useful lemmas A, and a set of predicted lemmas P.
- We want to provide a good measure of how "good" Pis with respect to the ground truth A.
- We begin by defining precision , which is the fraction of P that was correct: ∣P∩A∣/∣P∣. Probabilistically, this can be seen as
P(actual|predicted). - Simlarly, we define recall as the fraction of
A that was correctly predicted: ∣P∩A∣/∣A∣. Probabilistically, this is P(predicted|actual). - Let us now change the setting a little bit, where we swap the set P for a \emph{sequence} over the full universe of lemmas A.
- We can get a set by truncating P at some threshold. So we will define
precision@k to be the precision of the set P[:k]. - We note that recall as a function of k,
recall@k is non-decreasing. As we see more predictions, we can only get more actually useful things. See that recall has a fixed denominator (the size of the number of actually useful things). - Since
recall@k is non-decreasing, we can build an inverse, k@recall(r). For a given level of recall r, we map to the smallest k (smallest number of items we need to take from the ranking) to get that level of recall. - This lets us define a precision-recall function, where
precision@recall(r) = precision@k(k@recall(r)).
§ Precision, Recall formulae
- Suppose for a given goal g, we have 3 correct premises
a, b, c. The universe has premises a, b, c, x, y, z. Our model predicts premises in the ranking x, a, y, b, c, z. Then we summarize this as follows:
Rank | Val | Prec | Rec
1 | x | 0 | 0/3
2 | a | 1/2 | 1/3
3 | y | 1/3 | 1/3
4 | b | 2/4 | 2/3
5 | c | 3/4 | 3/3
5 | z | 3/5 | 3/3
- Note that recall is monotonically non-decreasing, while precisoin both increases (
0 -> 1/2) and decreases ( 3/4 -> 3/5). - We introduce an auxiliary function delta, δ(i)≡lemma at i is correct. This lets us write the above quantities as follows:
- Let s(n)≡∑i=0nδ(i).
- The total number of correct elements is s(N) where N is the total number of correct premises.
- The precision at k is given by p(k)≡s(k)/k. The recall at k is given by r(k)≡s(k)/s(N).
- Now note that the discrete difference dr(k)=r(k)−r(k−1), which equals (s(k)−s(k−1)/s(N), which is δ(k)/s(N).
§ Mean Average Precision
- The best the
precision@recall function can be is a flat line with precision=1 for all levels of recall. - Deviation from this tells us how much worse our model is from the best model.
- So, let's compute the area under the
precision@recall curve. This is going to be the average precision, ap≡∫r=01p(r)dr. - Recall that the "best" precision will always have
p(r) = 1. Thus the theoretical maximum of this value we can have is ap=∫r=011dr=1. This gives us a good scale, where 0≤ap≤1. - We use recall as a way to "standardize" across the size of the dataset by the recall.
- Let's change of variables into k. So we want to change r into r(k).
- This will change the bounds of integration.
- The lower limit is given by 0=r(l) which solves for l=0.
- The upper lmit is 1=r(u) which solves for u=N (the size of the dataset).
- This also changes dr to dr(k)dk.
- In the discrete case, we set dk=1, and dr(k) becomes r(k)−r(k−1). This is Σi=0kδ(i)/s(N)−Σi=0k−1δ(i))/s(N) which evaluates to δ(k)/s(N).
- This gives us the discrete calulation of ap to be ap≡Σk=1Np(k)δ(k)/s(N).
§ Mean Average precision at K.
- I believe this to be an incoherent concept; Recall that we chose to define average precision as the area under the precision recall curve . This is a sensible quantity, because it's a normalized value (recall is between 0 and 1). We got the expression in terms of k via a change of variables . We did not start at k! We started from r and got to k.
- We can try to hack the expression for ap to artifically create ap@K. Let's try.
- First, to go from k to r, we find a number rk such that the recall at r(k)=rk
- Next, we calculate ap@K≡∫0rKp(r)dr.
- We must now find we must find new lower and upper bounds in terms of k.
- The lower bounds needs us to find
0 = r(l) or l = 0. - The upper bound needs us to find
r_K = r(u), or u = K. - We will next have to calculate dr(k)dk. Previously, we set dk=1, and we calculated dr(k)≡r(k)−r(k−1). This will give us dr(k)≡δ(k)/s(N). But note that s(N) will count all documents, not just limited to the top-K. So let's used s(K) instead --- a hack!
- Combining these, we get the formula to be ap@K≡∫0Kp(r(k))dr(k)=Σk=0Kp(k)δ(k)/s(K).
-
ap@K feels very unprincipled, and I don't feel that this carries mathematical weight. - Is there an alternative derivatio that sheds light on why this formula makes sense?
§ R-precision
- Recall that the recall is a nondecreasing function of k. The precision can vary any way it wants with respect to k.
- We will try to find a point where precision equals recall.
- Consider the equation p(K)=r(K). Using our previous formulation, this reduces to s(K)/K=s(K)/s(N). This of course gives us K=s(N).
- So, at the index K which is equal to the total number of correct lemmas, we will have the precision equal the recall.
- This value is called as the R precision: the precision p(K) at the first index K such that r(K)=1.
- Empirically, this value R correlates well with mean average precision.
§ F1 Score
- A sensible derivation is offered by Van Rijsbergen in his PhD thesis.
- First, a quick and dirty definition: F1 is the harmonic mean of precision and recall.
- This gives us
2/(1/p + 1/r), which works out to 2/[(tp + fp)/tp + (tp + fn)/tp]. - This simplifies to
2tp/(2tp + fp + fn). - Now consider the quantity
E := 1 - F. Think of E as error. This works out to (fp + fn)/(2tp + fp + fn). - See that this quantity works out the symmetric difference of the Aactual and Predicted set divided by the sum of the sizes of the Actual and Predicted set! AΔP≡fp+fn, and ∣A∣=tp+fn, and ∣P∣=tp+fp.
- Thus, the full expression for E becomes E≡(∣A∣Δ∣P∣)/[∣A∣+∣P∣] which is a genuinely sensible quantity!