§ Precision, Recall, and all that.
- Setting: we have some theorem goal g, a dataset of mathematical lemmas D, a set of actually useful lemmas A, and a set of predicted lemmas P.
- We want to provide a good measure of how "good" Pis with respect to the ground truth A.
- We begin by defining precision , which is the fraction of P that was correct: ∣P∩A∣/∣P∣. Probabilistically, this can be seen as
P(actual|predicted)
. - Simlarly, we define recall as the fraction of
A
that was correctly predicted: ∣P∩A∣/∣A∣. Probabilistically, this is P(predicted|actual)
. - Let us now change the setting a little bit, where we swap the set P for a \emph{sequence} over the full universe of lemmas A.
- We can get a set by truncating P at some threshold. So we will define
precision@k
to be the precision of the set P[:k]. - We note that recall as a function of k,
recall@k
is non-decreasing. As we see more predictions, we can only get more actually useful things. See that recall has a fixed denominator (the size of the number of actually useful things). - Since
recall@k
is non-decreasing, we can build an inverse, k@recall(r)
. For a given level of recall r
, we map to the smallest k (smallest number of items we need to take from the ranking) to get that level of recall. - This lets us define a precision-recall function, where
precision@recall(r) = precision@k(k@recall(r))
.
§ Precision, Recall formulae
- Suppose for a given goal g, we have 3 correct premises
a, b, c
. The universe has premises a, b, c, x, y, z
. Our model predicts premises in the ranking x, a, y, b, c, z
. Then we summarize this as follows:
Rank | Val | Prec | Rec
1 | x | 0 | 0/3
2 | a | 1/2 | 1/3
3 | y | 1/3 | 1/3
4 | b | 2/4 | 2/3
5 | c | 3/4 | 3/3
5 | z | 3/5 | 3/3
- Note that recall is monotonically non-decreasing, while precisoin both increases (
0 -> 1/2
) and decreases ( 3/4 -> 3/5
). - We introduce an auxiliary function delta, δ(i)≡lemma at i is correct. This lets us write the above quantities as follows:
- Let s(n)≡∑i=0nδ(i).
- The total number of correct elements is s(N) where N is the total number of correct premises.
- The precision at k is given by p(k)≡s(k)/k. The recall at k is given by r(k)≡s(k)/s(N).
- Now note that the discrete difference dr(k)=r(k)−r(k−1), which equals (s(k)−s(k−1)/s(N), which is δ(k)/s(N).
§ Mean Average Precision
- The best the
precision@recall
function can be is a flat line with precision=1
for all levels of recall. - Deviation from this tells us how much worse our model is from the best model.
- So, let's compute the area under the
precision@recall
curve. This is going to be the average precision, ap≡∫r=01p(r)dr. - Recall that the "best" precision will always have
p(r) = 1
. Thus the theoretical maximum of this value we can have is ap=∫r=011dr=1. This gives us a good scale, where 0≤ap≤1. - We use recall as a way to "standardize" across the size of the dataset by the recall.
- Let's change of variables into k. So we want to change r into r(k).
- This will change the bounds of integration.
- The lower limit is given by 0=r(l) which solves for l=0.
- The upper lmit is 1=r(u) which solves for u=N (the size of the dataset).
- This also changes dr to dr(k)dk.
- In the discrete case, we set dk=1, and dr(k) becomes r(k)−r(k−1). This is Σi=0kδ(i)/s(N)−Σi=0k−1δ(i))/s(N) which evaluates to δ(k)/s(N).
- This gives us the discrete calulation of ap to be ap≡Σk=1Np(k)δ(k)/s(N).
§ Mean Average precision at K.
- I believe this to be an incoherent concept; Recall that we chose to define average precision as the area under the precision recall curve . This is a sensible quantity, because it's a normalized value (recall is between 0 and 1). We got the expression in terms of k via a change of variables . We did not start at k! We started from r and got to k.
- We can try to hack the expression for ap to artifically create ap@K. Let's try.
- First, to go from k to r, we find a number rk such that the recall at r(k)=rk
- Next, we calculate ap@K≡∫0rKp(r)dr.
- We must now find we must find new lower and upper bounds in terms of k.
- The lower bounds needs us to find
0 = r(l)
or l = 0
. - The upper bound needs us to find
r_K = r(u)
, or u = K
. - We will next have to calculate dr(k)dk. Previously, we set dk=1, and we calculated dr(k)≡r(k)−r(k−1). This will give us dr(k)≡δ(k)/s(N). But note that s(N) will count all documents, not just limited to the top-K. So let's used s(K) instead --- a hack!
- Combining these, we get the formula to be ap@K≡∫0Kp(r(k))dr(k)=Σk=0Kp(k)δ(k)/s(K).
-
ap@K
feels very unprincipled, and I don't feel that this carries mathematical weight. - Is there an alternative derivatio that sheds light on why this formula makes sense?
§ R-precision
- Recall that the recall is a nondecreasing function of k. The precision can vary any way it wants with respect to k.
- We will try to find a point where precision equals recall.
- Consider the equation p(K)=r(K). Using our previous formulation, this reduces to s(K)/K=s(K)/s(N). This of course gives us K=s(N).
- So, at the index K which is equal to the total number of correct lemmas, we will have the precision equal the recall.
- This value is called as the R precision: the precision p(K) at the first index K such that r(K)=1.
- Empirically, this value R correlates well with mean average precision.
§ F1 Score
- A sensible derivation is offered by Van Rijsbergen in his PhD thesis.
- First, a quick and dirty definition: F1 is the harmonic mean of precision and recall.
- This gives us
2/(1/p + 1/r)
, which works out to 2/[(tp + fp)/tp + (tp + fn)/tp]
. - This simplifies to
2tp/(2tp + fp + fn)
. - Now consider the quantity
E := 1 - F
. Think of E
as error
. This works out to (fp + fn)/(2tp + fp + fn)
. - See that this quantity works out the symmetric difference of the Aactual and Predicted set divided by the sum of the sizes of the Actual and Predicted set! AΔP≡fp+fn, and ∣A∣=tp+fn, and ∣P∣=tp+fp.
- Thus, the full expression for E becomes E≡(∣A∣Δ∣P∣)/[∣A∣+∣P∣] which is a genuinely sensible quantity!