Ordinal Classification

Author

Vivek Sivaramakrishnan

Colab Link: Click here!

Ordinal Measurement

Let Y^* be the ordinal variable (with J classes) we want to predict. We denote Y^* in terms of the latent continuous variable Y and cutpoints \alpha_0, \alpha_1, ..., \alpha_J as follows \begin{align*} Y^* &= j \; \text{if} \; \alpha_{j-1} \le Y< \alpha_j \\ &(j = 1, ..., J) \end{align*}

Regression Assumption

We model latent variable Y as follows:

Y = \beta X + ϵ

where \beta us our slope parameter that needs to be estimated, and ϵ a randomly distributed error uncorrelated with X (0 mean) with variance 1.

Synthetic Data Creation

We will consider an ordinal classification problem with 4 classes. We randomly sample 1000 points from a standard normal distribution. We fix \beta = \begin{pmatrix} -1 \\ 1 \end{pmatrix}, and cutpoints \alpha_0 = -\infty, \alpha_1 = -2, \alpha_2 = -1, \alpha_3 = 2, \alpha_4 = \infty

import numpy as np
np.random.seed(69)

X = np.random.normal(scale=1, size=(1000, 2))
beta = np.array([-1, 1])
cutpoints = np.array([-np.inf, -2, -1, 2, np.inf])
from sklearn.model_selection import train_test_split

Y = X@beta

def ordify(cutpoints):

  def hlo(x):

    for i in range(len(cutpoints)-1):
      if cutpoints[i] <= x < cutpoints[i+1]:
        return i+1

  return hlo

ordinate = ordify(cutpoints)
Y_ord = np.array([ordinate(i) for i in Y])

# X_train, X_test, y_train, y_test = train_test_split(X, Y_ord, test_size=0.33, random_state=42)

OvR approach

We will fit a LR model using the one-vs-rest approach, which just considers our target to be a Nominal variable, and trains 4 classifiers.

from sklearn.linear_model import LogisticRegression

OvR_LR = LogisticRegression(multi_class='ovr').fit(X, Y_ord)
OvR_LR.score(X, Y_ord)
0.797
print('Beta estimates and biases')
for i in range(len(OvR_LR.coef_)):
  x = -1 * OvR_LR.coef_[i][0]
  print(OvR_LR.coef_[i]/x, OvR_LR.intercept_[i]/x)
Beta estimates and biases
[-1.          1.02454387] 2.1044752205291597
[-1.          1.07017873] 2.088226413125316
[-1.          1.33242037] 1.5887705536936128
[-1.          0.97466832] -2.050381104615722

Can we do better?

We see that by converting to a Nominal approach we are losing out on some data - namely, the ordering present within the target variables themselves.

Is there any way to capture this ordering, while not impacting the underlying learning scheme (LR in this case)?

Ordinal Classification Scheme

We can construct K-1 new variables for an observed datapoint x_i (with corresponding class y_i) in the following manner (Ex with K=4):

Condition N_1 N_2 N_3
\text{if } y_i = 1 0 0 0
\text{if } y_i = 2 1 0 0
\text{if } y_i = 3 1 1 0
\text{if } y_i = 4 1 1 1

In other words, generate for i = 1, 2, \cdots, K-1 N_i = \mathbb{1}(y_1 > i)

We then construct K-1 binary classification problems, where the target of the ith problem is N_i. We can go ahead and train K-1 classifiers using our preferred learning scheme.

How to predict?

For a validation datapoint x_i, use the trained models to calculate estimates \hat{N_i} \text{ for } i = 1, 2, \cdots, K-1. We then use these to calculate probabilities of each class as follows:

Class Probability
1 1 - \hat{N_1}
2 \hat{N_1} - \hat{N_2}
i \hat{N_{i-1}} - \hat{N_i}
K-1 \hat{N_{K-1}}

The first and last class probabilites are from a single classifier, where as the others are the difference of the outputs from a pair of consecutive (w.r.t i) classifiers.

Note that \hat{N_i} = \text{Pr}(y_i > i).

from sklearn.base import clone, BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import check_classification_targets
import numpy as np

class OrdinalClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self,learner):
        self.learner = learner
        self.ordered_learners = dict()
        self.classes = []

    def fit(self,X,y):
        self.classes = np.sort(np.unique(y))
        assert self.classes.shape[0] >= 3, f'OrdinalClassifier needs at least 3 classes, only {self.classes.shape[0]} found'

        for i in range(self.classes.shape[0]-1):
            N_i = np.vectorize(int)(y > self.classes[i])
            learner = clone(self.learner).fit(X,N_i)
            self.ordered_learners[i] = learner

    def predict(self,X):
        return np.vectorize(lambda i: self.classes[i])(np.argmax(self.predict_proba(X), axis=1))

    def predict_proba(self,X):
        predicted = [self.ordered_learners[k].predict_proba(X)[:,1].reshape(-1,1) for k in self.ordered_learners]

        N_1 = 1-predicted[0]
        N_K  = predicted[-1]
        N_i= [predicted[i] - predicted[i+1] for i in range(len(predicted) - 1)]

        probs = np.hstack([N_1, *N_i, N_K])

        return probs
from sklearn.linear_model import LogisticRegression
model = OrdinalClassifier(LogisticRegression())
model.fit(X, Y_ord)

model.score(X, Y_ord)
0.975

One Hot Encoding Bias Term - Another approach

We construct a dataset similar to the above, but one-hot encode biases based on the true-class; this enables our estimate for \beta to be more closer to the actual (due to large sample size)

def encoder(X, Y):

  segments = []
  segments_y = []
  n, J = len(X), max(Y)

  for i in range(1, J):

    segment = np.hstack((np.zeros((n, J-1)), X))
    segment[:, i-1] = -1
    segments.append(segment)

    segments_y.append(np.vectorize(int)(Y>i))

  return (np.vstack(segments), np.hstack(segments_y))
X_e, y_e = encoder(X, Y_ord)
OE_LR = LogisticRegression(fit_intercept=False).fit(X_e, y_e)

print('Bias (first 3 terms) and beta (rest) estimates')
print(OE_LR.coef_[0]/(-1*OE_LR.coef_[0][-2]))

from sklearn.metrics import accuracy_score

a1, a2, a3, b1, b2 = OE_LR.coef_[0]/(-1*OE_LR.coef_[0][-2])

beta_pred = np.array([b1, b2])
cutpoints_pred = np.array([-np.inf, a1, a2, a3, np.inf])

ordinate = ordify(cutpoints_pred)
y_pred = np.array([ordinate(i) for i in X@beta_pred])

print(accuracy_score(Y_ord, y_pred))
Bias (first 3 terms) and beta (rest) estimates
[-2.04249489 -1.03041969  2.02754915 -1.          1.01914656]
0.99