• Corpus ID: 55767944

Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation

@inproceedings{Powers2008EvaluationFP,
  title={Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness \& Correlation},
  author={David M. W. Powers},
  year={2008},
  url={https://api.semanticscholar.org/CorpusID:55767944}
}
  • D. Powers
  • Published 2008
  • Computer Science, Mathematics
E elegant connection s are demonstrated between the concepts of Informedness, Markedness, Correlation and Significance as well as their relationships with Recall and Precision and outline th extension from the dichotomous case to the general multi-class case.

Figures and Tables from this paper

The Problem with Kappa

It is shown that deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged, and the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.

Complementarity, F-score, and NLP Evaluation

A method for measuring complementarity for precision, recall and F-score, quantifying the difference between entity extraction approaches is presented.

EPP: interpretable score of model predictive power

A new EPP rating system for predictive models is introduced and numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation, which can assess the probability that one model will achieve better performance than another and can be directly compared between datasets.

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup are introduced and it is demonstrated how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields.

Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

Deep ROC analysis is proposed to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate, and a new interpretation of AUC in whole or part is provided, as balanced average accuracy, relevant to individuals instead of pairs.

Interpretable Meta-Measure for Model Performance

A new meta-measure for performance assessment named Elo-based Predictive Power (EPP) is introduced, which has probabilistic interpretation and can be directly compared between data sets.

The Problem of Cross-Validation: Averaging and Bias, Repetition and Significance

This paper defines and explores a protocol that reduces the scale of repeated CV whilst providing a principled way to control the erosion of significance due to multiple testing.

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

A novel versatile tool that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores is presented.

Evaluation Gaps in Machine Learning Practice

The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.

Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels

This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniquesβ€”SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)β€”across datasets with…
...

ROC β€˜n’ Rule Learningβ€”Towards a Better Understanding of Covering Algorithms

This paper provides an analysis of the behavior of separate-and-conquer or covering rule learning algorithms by visualizing their evaluation metrics and their dynamics in coverage space, a variant of ROC space, and shows that most commonly used metrics are equivalent to one of two fundamental prototypes.

Diversity of decision-making models and the measurement of interrater agreement.

Several papers have appeared criticizing the kappa coefficient because of its tendency to fluctuate with sample base rates. The importance of these criticisms is difficult to evaluate because they…

Rule Evaluation Measures: A Unifying View

This paper develops a unifying view on some of the existing measures for predictive and descriptive induction by means of contingency tables, and demonstrates that many rule evaluation measures developed for predictive knowledge discovery can be adapted to descriptive knowledge discovery tasks.

Calibration of ρ Values for Testing Precise Null Hypotheses

P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the…

Assessing Agreement on Classification Tasks: The Kappa Statistic

What is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and it is argued that the field would be better off as a field adopting techniques from content analysis.

Is Human Learning Rational?

    D. Shanks
    Psychology
  • 1995
It is argued that accurate judgements are an emergent property of an associationist learning process of the sort that has become common in adaptive network models of cognition and is the β€œmeans” to a normative or statistical β€œend”.

A Coefficient of Agreement for Nominal Scales

CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of…

Improved likelihood ratio tests for complete contingency tables

SUMMARY Lawley (1956) describes how asymptotic likelihood ratio tests can in general be improved by multiplying the -2 log A test statistic by a multiplier chosen so that the null distribution of the…

Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit

Abstract Approximations were derived for the mean and variance of G 2, the likelihood ratio statistic for testing goodness of fit in a k cell multinomial distribution. These approximate moments,…