Informedness and Markedness
Alternatives to Recall and Precision as Evaluation Measures
In a recent effort, instead of using recall and precision, we used informedness and markedness to measure the ability of security analysis tools to detect vulnerabilities in Android apps. Here is my simplified view of these measures.
Preliminaries
We are interested in evaluating the goodness of a system M in correctly identifying positive and negative examples from a set of examples.
We start with a set of examples (objects) of which some are labelled as positive while the rest are labelled as negative. We refer to the examples labelled as positive as real positives (RP) and examples labelled as negative as real negatives (RN).
After S has been used to identify the examples as positive or negative, we refer to the examples identified (predicted) as positive as predicted positives (PP) and examples identified as negative as predicted negatives (PN).
Further, we refer to positive examples that are identified as positive by M as true positives (TP), positive examples that are identified as negative by M as false negatives (FN), negative examples that are identified as positive by M as false positives (FP), and negative examples that are identified as negative by M as true negatives (TN).
When we abuse the acronyms to denote the number of examples of specific kinds, they are related pictorially as shown below.
The above terms are related as follows: PP = TP + FP, PN = FN + TN, RP = TP + FN, and RN = FP + TN.
Recall and Precision
Recall is a measure of M’s ability to identify positives as positives, i.e., Recall = TP / RP. In other words, given a positive example, how certain are we that S will identify the example as positive.
Precision is a measure of trustworthiness of positive predictions by M, i.e., Precision = TP / PP. In other words, when S makes a positive prediction, how certain are we the prediction is correct.
Observe that both recall and precision pretty much ignore negatives. To understand why this matters, consider two cases.
To understand the implication of this, consider three cases.
- Evaluated with example set E1 consisting of RP=100 and RN=10, M1 yields TP=90, FN=10, and FP=10.
- Evaluated with example set E2 consisting of RP=100 and RN=100, M2 yields TP=90, FN=10, and FP=10.
- Evaluated with example set E3 consisting of RP=1000 and RN=100, M2 yields TP=900, FN=100, and FP=10.
In cases 1 and 2, recall and precision are 90/100=0.9. This suggests M1 and M2 perform equally well. However, this is not true because there is no evidence that M1 is able to identify negatives as negatives (TN=0 for M1). Further, the evidence about M2’s ability to identify negatives as negatives (TN=90 for M2) is not considered.
Both recall and precision ignored the ability of techniques to handle negatives.
In case 3 where the proportion of real positives and real negatives has changed, while recall is unchanged (900/1000=0.9), precision jumps to 900/910=.98.
All other things being unchanged, precision is affected by the prevalence of (real) positives in the example set.
Further, observe that the trustworthiness of M2's negative predictions (TN/PN) dropped from 90/100=0.9 to 90/190=0.47 from E2 to E3.
Precision did not detect the change in the trustworthiness of negative predictions.
An obvious solution to address the above issue is to consider both positive and negative examples and predictions.
Informedness and Markedness
In “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation”, Powers made the above observations about the above aspects of recall and precision and proposed the use of informedness and markedness as unbiased variants of recall and precision. Following are the simplified definitions (aka my interpretation) of these terms.
Informedness is a measure of how informed system M is about positives and negatives, i.e., Informedness = TP/RP - FP/RN.
Markedness is a measure of trustworthiness of positive and negative predictions by system M, i.e., Markedness = TP/PP - FN/PN.
In the above definitions, observe that
- Informedness considers both real positives and real negatives. Likewise, Markedness considers both predicted positives and predicted negatives.
- Informedness is counterpart of recall and markedness is the counterpart of precision.
- While the values of recall and precision range from 0 thru +1 (both inclusive), the values of informedness and markedness range from -1 thru +1 (both inclusive).
As for the interpretation of values of informedness and markedness,
- Magnitude of the Absolute Value: For informedness, it is a measure of how informed is M about positives and negatives. Higher the absolute value, higher the informedness. A value of 1 implies M is fully informed about positives and negatives.
For markedness, it is a measure of how trustworthy are M’s prediction. Higher the absolute value, higher the trust. A value of 1 implies M’s verdicts can be fully trusted. - Polarity: For informedness, positive polarity implies M is informed to correctly identify positives and negatives while negative polarity implies M is informed to incorrectly identify positives and negatives.
For markedness, positive polarity implies the trust is associated with M’s predictions being correct while negative polarity implies the trust is associated with M’s predictions being incorrect.
So, informedness of +1 implies every positive will be identified as positive (TP=RP) and every negative will be identified as negative (FP=0) while informedness of -1 implies every positive will be incorrectly identified as negative (TP=0) and every negative will be incorrectly identified as positive (FP=RN).
Similarly, markedness of +1 implies every positive prediction (TP=PP) and negative prediction (FN=0) will be correct while markedness of -1 implies every positive prediction (TP=0) and negative prediction (FN=PN) will be incorrect.
Observe that an informed yet incorrect system can be easily flipped to be an informed and correct system (to the same extent as it was incorrect). Similarly, a marked yet incorrect system can be easily flipped to be an marked and correct system (to the same extent as it was incorrect). Consequently, in terms of these measures, a bad system is one with measures that are close to 0, i.e., it is no better than chance.
What about the Previous Cases?
In case 1, both informedness and markedness are is 90/100 – 10/10 = -0.1. While M1 is well informed about positives (90 TPs out of 100 RPs), it is misinformed about negatives (10 FPs out of 10 RNs). Likewise, while M1’s positive predictions can be well trusted (90 TPs out of 100 PPs), M1’s negative predictions cannot be trusted (10 FNs out of 10 PNs). Hence, M1’s combined behavior when considering the examples and the predictions is very close to random but worse as it errs a little more often than it is correct. This is well captured by -0.1 measures.
In case 2, both informedness and markedness are 90/100 – 10/100 = 0.8. Since M2 performs equally well on both positives and negatives present in a balanced example set, the values of these measures are intuitive.
In case of 3, informedness is 900/1000 – 10/100 = 0.8 while markedness is 900/910 – 100/190 = 0.46. Like recall, informedness is unaffected by the change in the prevalence of real positives. However, unlike precision, markedness decreases to reflect the drop (change) in trustworthiness of M2’s negative predictions.
Clearly, it is obvious that informedness and markedness are sensitive to how techniques handle negatives.
Summary
Since recall and precision are dependent only on TP, FP, FN, RP, and PP, they are unaffected by changes to TN, RN, and PN. In comparison, since informedness and markedness are dependent on all of these numbers, they are sensitive to any change to (the proportions of) these numbers. Hence, when both positives and negatives are of interest, informedness and markedness are more informative evaluation measures.
References
- Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation by David M W Powers, 2007.
- Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities? by Venkatesh-Prasad Ranganath and Joydeep Mitra, 2018.
P.S.: If you find any errors in this post, then please leave a comment.