Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation
@inproceedings{Powers2008EvaluationFP, title={Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness \& Correlation}, author={David M. W. Powers}, year={2008}, url={https://api.semanticscholar.org/CorpusID:55767944} }
E elegant connection s are demonstrated between the concepts of Informedness, Markedness, Correlation and Significance as well as their relationships with Recall and Precision and outline th extension from the dichotomous case to the general multi-class case.
Figures and Tables from this paper
953 Citations
The Problem with Kappa
- 2012
Computer Science, Linguistics
It is shown that deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged, and the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.
Complementarity, F-score, and NLP Evaluation
- 2016
Computer Science
A method for measuring complementarity for precision, recall and F-score, quantifying the difference between entity extraction approaches is presented.
EPP: interpretable score of model predictive power
- 2019
Computer Science
A new EPP rating system for predictive models is introduced and numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation, which can assess the probability that one model will achieve better performance than another and can be directly compared between datasets.
Testing the Consistency of Performance Scores Reported for Binary Classification Problems
- 2024
Computer Science
Numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup are introduced and it is demonstrated how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields.
Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation
- 2023
Computer Science
Deep ROC analysis is proposed to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate, and a new interpretation of AUC in whole or part is provided, as balanced average accuracy, relevant to individuals instead of pairs.
Interpretable Meta-Measure for Model Performance
- 2020
Computer Science, Mathematics
A new meta-measure for performance assessment named Elo-based Predictive Power (EPP) is introduced, which has probabilistic interpretation and can be directly compared between data sets.
The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance
- 2012
Computer Science, Mathematics
This paper defines and explores a protocol that reduces the scale of repeated CV whilst providing a principled way to control the erosion of significance due to multiple testing.
The Tile: A 2D Map of Ranking Scores for Two-Class Classification
- 2024
Computer Science
A novel versatile tool that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores is presented.
Evaluation Gaps in Machine Learning Practice
- 2022
Computer Science
The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.
Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
- 2025
Computer Science
This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniquesβSMOTE, ADASYN, and Gaussian noise upsampling (GNUS)βacross datasets withβ¦
22 References
ROC βnβ Rule LearningβTowards a Better Understanding of Covering Algorithms
- 2005
Computer Science, Mathematics
This paper provides an analysis of the behavior of separate-and-conquer or covering rule learning algorithms by visualizing their evaluation metrics and their dynamics in coverage space, a variant of ROC space, and shows that most commonly used metrics are equivalent to one of two fundamental prototypes.
Diversity of decision-making models and the measurement of interrater agreement.
- 1987
Psychology
Several papers have appeared criticizing the kappa coefficient because of its tendency to fluctuate with sample base rates. The importance of these criticisms is difficult to evaluate because theyβ¦
Rule Evaluation Measures: A Unifying View
- 1999
Computer Science
This paper develops a unifying view on some of the existing measures for predictive and descriptive induction by means of contingency tables, and demonstrates that many rule evaluation measures developed for predictive knowledge discovery can be adapted to descriptive knowledge discovery tasks.
Calibration of Ο Values for Testing Precise Null Hypotheses
- 2001
Mathematics
P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of theβ¦
Assessing Agreement on Classification Tasks: The Kappa Statistic
- 1996
Computer Science, Linguistics
What is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and it is argued that the field would be better off as a field adopting techniques from content analysis.
Is Human Learning Rational?
- 1995
Psychology
It is argued that accurate judgements are an emergent property of an associationist learning process of the sort that has become common in adaptive network models of cognition and is the βmeansβ to a normative or statistical βendβ.
A Coefficient of Agreement for Nominal Scales
- 1960
Psychology
CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level ofβ¦
Improved likelihood ratio tests for complete contingency tables
- 1976
Mathematics
SUMMARY Lawley (1956) describes how asymptotic likelihood ratio tests can in general be improved by multiplying the -2 log A test statistic by a multiplier chosen so that the null distribution of theβ¦
Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit
- 1981
Mathematics
Abstract Approximations were derived for the mean and variance of G 2, the likelihood ratio statistic for testing goodness of fit in a k cell multinomial distribution. These approximate moments,β¦