Precision in Classification Metrics

Author

Alape Aniruddha

Precision is a classification metric that measures the accuracy of the positive predictions made by a model. It is defined as the ratio of true positives to the sum of true positives and false positives.

\displaystyle \text{Precision} \ =\ \frac{\text{True Positives}}{\text{True Positives + False Positives}}

Precision is particularly useful in situations where the cost of false positives is high. For example, in medical diagnoses, where a false positive might lead to unnecessary treatments, precision becomes a crucial metric.

Calculating Precision for Binary Classification Problems


\displaystyle \text{Precision} \ =\ \frac{\text{TP}}{\text{TP + FP}}


Where, TP = True Positives FP = False Positives

Issues with using Precision

  • A model with high precision may still have low recall, and vice versa. There is often a trade-off between precision and recall, and the choice between them depends on the specific problem and its requirements.

Where can Precision be used?

  • In Fraud detection, if the primary objective is customer service, we do not want to flag a genuine transaction as fraud leading to inconvenience to the customer.

Precision in sklearn

precision_score is the function in scikit-learn used to calculate precision.

The parameters are as follows: - y_true: Ground truth (correct) labels. - y_pred: Predicted lab as returned by a classifier. - labels (None): The set of labels to include when average is not None. - pos_label (1): The label of the positive class. - average (‘binary’): The averaging strategy for multiclass settings. - sample_weight (None): Sample weights.

from sklearn.metrics import precision_score

y_true = [0, 1, 1, 1, 0, 1]
y_pred = [0, 1, 0, 1, 0, 1]

precision_score(y_true, y_pred)
1.0

Precision on a real-world dataset

We are going to calculate the precision of a logistic regression model on the breast_cancer dataset using sklearn.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
precision_score(y_test, y_pred)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
0.9666666666666667
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7e3d61cdd180>

In the above matrix, we can see that: TP = 87, TN = 51, FP = 3, FN = 2

\begin{aligned} {\displaystyle \text{Precision}} & ={\displaystyle \frac{\text{TP}}{\text{TP} + \text{FP}}}\\ & \\ & {\displaystyle =\frac{\text{87}}{\text{87 + 3}}}\\ & \\ & {\displaystyle =\ \frac{\text{87}}{\text{90}}}\\ & \\ & {\displaystyle =\ 0.9666} \end{aligned}