Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation

D. Powers

Corpus ID: 55767944

Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation

@inproceedings{Powers2008EvaluationFP,
  title={Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness \& Correlation},
  author={David M. W. Powers},
  year={2008},
  url={https://api.semanticscholar.org/CorpusID:55767944}
}

D. Powers
Published 2008
Computer Science, Mathematics

E elegant connection s are demonstrated between the concepts of Informedness, Markedness, Correlation and Significance as well as their relationships with Recall and Precision and outline th extension from the dichotomous case to the general multi-class case.

david.wardpowers.info

953 Citations

Highly Influential Citations

Background Citations

272

Methods Citations

352

Results Citations

Figures and Tables from this paper

Topics

Precision Informedness Probability Correlation Receiver Operating Characteristics

The Problem with Kappa

D. Powers

Computer Science, Linguistics

Conference of the European Chapter of the…

2012

It is shown that deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged, and the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.

Complementarity, F-score, and NLP Evaluation

Leon Derczynski

Computer Science

International Conference on Language Resources…

2016

A method for measuring complementarity for precision, recall and F-score, quantifying the difference between entity extraction approaches is presented.

EPP: interpretable score of model predictive power

Alicja GosiewskaMateusz BąkałaK. WoźnicaMaciej ZwolinskiP. Biecek

Computer Science

arXiv.org

2019

A new EPP rating system for predictive models is introduced and numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation, which can assess the probability that one model will achieve better performance than another and can be directly compared between datasets.

[PDF]

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Attila FazekasGyorgy Kov'acs

Computer Science

Applied Soft Computing

2024

Numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup are introduced and it is demonstrated how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields.

[PDF]

Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

André M. CarringtonD. Manuel Andreas Holzinger

Computer Science

IEEE Transactions on Pattern Analysis and Machine…

2023

Deep ROC analysis is proposed to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate, and a new interpretation of AUC in whole or part is provided, as balanced average accuracy, relevant to individuals instead of pairs.

[PDF]

Interpretable Meta-Measure for Model Performance

Alicja GosiewskaK. WoźnicaP. Biecek

Computer Science, Mathematics

arXiv.org

2020

A new meta-measure for performance assessment named Elo-based Predictive Power (EPP) is introduced, which has probabilistic interpretation and can be directly compared between data sets.

The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance

David M. W. PowersA. Atyabi

Computer Science, Mathematics

Spring Congress on Engineering and Technology

2012

This paper defines and explores a protocol that reduces the scale of repeated CV whilst providing a principled way to control the erosion of significance due to multiple testing.

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

S'ebastien Pi'erardAnaïs HalinA. CioppaA. DeliègeMarc Van Droogenbroeck

Computer Science

arXiv.org

2024

A novel versatile tool that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores is presented.

[PDF]

Evaluation Gaps in Machine Learning Practice

Ben HutchinsonNegar RostamzadehChristina GreerK. HellerVinodkumar Prabhakaran

Computer Science

Conference on Fairness, Accountability and…

2022

The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.

[PDF]

Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels

Mehdi ImaniAli BeikmohammadiH. Arabnia

Computer Science

Technologies

2025

This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with…

ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms

Johannes FürnkranzPeter A. Flach

Computer Science, Mathematics

Machine-mediated learning

2005

This paper provides an analysis of the behavior of separate-and-conquer or covering rule learning algorithms by visualizing their evaluation metrics and their dynamics in coverage space, a variant of ROC space, and shows that most commonly used metrics are equivalent to one of two fundamental prototypes.

Diversity of decision-making models and the measurement of interrater agreement.

J. Uebersax

Psychology

1987

Several papers have appeared criticizing the kappa coefficient because of its tendency to fluctuate with sample base rates. The importance of these criticisms is difficult to evaluate because they…

Rule Evaluation Measures: A Unifying View

N. LavračPeter A. FlachB. Zupan

Computer Science

International Conference on Inductive Logic…

1999

This paper develops a unifying view on some of the existing measures for predictive and descriptive induction by means of contingency tables, and demonstrates that many rule evaluation measures developed for predictive knowledge discovery can be adapted to descriptive knowledge discovery tasks.

The exploitation of distributional information in syllable processing

P. PerruchetR. Peereman

Linguistics

Journal of Neurolinguistics

2004

Calibration of ρ Values for Testing Precise Null Hypotheses

T. SellkeM. J. BayarriJ. Berger

Mathematics

2001

P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the…

Assessing Agreement on Classification Tasks: The Kappa Statistic

J. Carletta

Computer Science, Linguistics

International Conference on Computational Logic

1996

What is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and it is argued that the field would be better off as a field adopting techniques from content analysis.

2,542

[PDF]

Is Human Learning Rational?

D. Shanks

Psychology

The Quarterly journal of experimental psychology…

1995

It is argued that accurate judgements are an emergent property of an associationist learning process of the sort that has become common in adaptive network models of cognition and is the “means” to a normative or statistical “end”.

A Coefficient of Agreement for Nominal Scales

Jacob Cohen

Psychology

1960

CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of…

Improved likelihood ratio tests for complete contingency tables

David A. Williams

Mathematics

1976

SUMMARY Lawley (1956) describes how asymptotic likelihood ratio tests can in general be improved by multiplying the -2 log A test statistic by a multiplier chosen so that the null distribution of the…

Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit

P. SmithD. RaeR. ManderscheidS. Silbergeld

Mathematics

1981

Abstract Approximations were derived for the mean and variance of G 2, the likelihood ratio statistic for testing goodness of fit in a k cell multinomial distribution. These approximate moments,…

Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation

Figures and Tables from this paper

Topics

953 Citations

The Problem with Kappa

Complementarity, F-score, and NLP Evaluation

EPP: interpretable score of model predictive power

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

Interpretable Meta-Measure for Model Performance

The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

Evaluation Gaps in Machine Learning Practice

Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels

22 References

ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms

Diversity of decision-making models and the measurement of interrater agreement.

Rule Evaluation Measures: A Unifying View

The exploitation of distributional information in syllable processing

Calibration of ρ Values for Testing Precise Null Hypotheses

Assessing Agreement on Classification Tasks: The Kappa Statistic

Is Human Learning Rational?

A Coefficient of Agreement for Nominal Scales

Improved likelihood ratio tests for complete contingency tables

Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit

Related Papers