Inductive Bias in Decision Trees and K-Nearest Neighbors

Author

Vivek Sivaramakrishnan

Slides: Click here!

Construct a checkerboard Dataset

from matplotlib import pyplot as plt
import numpy as np

plt.rcParams['figure.figsize'] = (14, 6.3)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.markersize'] = 4.2

X, y = [], []
r, c = -1, -1
d = 4
n = 6
for i in np.linspace(0, 3, num=d*n):
  r += 1
  for j in np.linspace(0, 3, num=d*n):
    c += 1
    X.append([i, j])
    y.append(int(bool((r//n)%2) and not bool((c//n)%2) or not bool((r//n)%2) and bool((c//n)%2)))

X = np.array(X)
X -= np.mean(X, axis=0)
y = np.array(y)

r = np.pi/4
rotate = np.array([[np.cos(r), np.sin(r)], [np.sin(r), -np.cos(r)]])
X_rotated = X@rotate

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(X[:, 0][y==0], X[:, 1][y==0], color='red', label='y==0', alpha=0.5, edgecolor='k')
ax1.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', alpha=0.5, edgecolor='k')
ax1.legend()
ax1.set_title('Chessboard')

ax2.scatter(X_rotated[:, 0][y==0], X_rotated[:, 1][y==0], color='red', label='y==0', alpha=0.5, edgecolor='k')
ax2.scatter(X_rotated[:, 0][y==1], X_rotated[:, 1][y==1], color='blue', label='y==1', alpha=0.5, edgecolor='k')
ax2.legend()
ax2.set_title('Chessboard (Rotated)')
Text(0.5, 1.0, 'Chessboard (Rotated)')

Decision Trees

Popular representation for interpretable classifiers; even among humans!

Example: I’ve just arrived at a restaurant. Should I stay (wait for a table) or go elsewhere?

One may choose to use the following set of rules to make their decision:

source: ai.berkeley.edu

Decision trees: - Have a simple Design - Interpretable - Easy to implement - Good performance in practice

Note that splits happen individually at the feature level - corresponds to splits parallel to a feature axis - Inductive bias

Inductive bias

Anything which makes the algorithm learn one pattern instead of another pattern.

Decision trees use a step-function collection for classification; but these step functions utilize one feature/variable only. Is this phenomenon sensitive to the nature of the dataset?

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

DTree = DecisionTreeClassifier()
DTree.fit(X, y)

fig, (ax1, ax2) = plt.subplots(1, 2)

disp = DecisionBoundaryDisplay.from_estimator(DTree, X, response_method="predict", grid_resolution=1000, alpha=0.6, ax = ax1)
disp.ax_.scatter(X[:, 0][y==0], X[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'Tree Depth: {DTree.get_depth()}');

plot_tree(DTree, label='none', filled=True, feature_names=['F1', 'F2'], class_names=['Red', 'Blue'], node_ids=False, rounded=True, impurity=False, ax=ax2);

The 4x4 checkerboard dataset with alternating classes requires a tree of depth=7 to capture its structure respectively.

But what will happen if we try to train a tree on the rotated variant of this dataset?

fig, (ax1, ax2) = plt.subplots(1, 2)

DTree_rotated = DecisionTreeClassifier()
DTree_rotated.fit(X_rotated, y)

disp = DecisionBoundaryDisplay.from_estimator(DTree_rotated, X_rotated, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax1)
disp.ax_.scatter(X_rotated[:, 0][y==0], X_rotated[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X_rotated[:, 0][y==1], X_rotated[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'(Overfit) Tree Depth: {DTree_rotated.get_depth()}')

DTree_rotated_constrained = DecisionTreeClassifier(max_depth=7)
DTree_rotated_constrained.fit(X_rotated, y)

disp = DecisionBoundaryDisplay.from_estimator(DTree_rotated_constrained, X_rotated, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax2)
disp.ax_.scatter(X_rotated[:, 0][y==0], X_rotated[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X_rotated[:, 0][y==1], X_rotated[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'(Constrained) Tree Depth: {DTree_rotated_constrained.get_depth()}');

Oh!

The model fails to understand the generation rationale of the dataset as it suffers an inductive bias of axis-parallel splitting.

KNN

Examine the performance of KNN (with neighbors=3) on both variants of the dataset

from sklearn.neighbors import KNeighborsClassifier

fig, (ax1, ax2) = plt.subplots(1, 2)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

disp = DecisionBoundaryDisplay.from_estimator(knn, X, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax1)
disp.ax_.scatter(X[:, 0][y==0], X[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)

knn_rotated = KNeighborsClassifier(n_neighbors=3)
knn_rotated.fit(X_rotated, y)

disp = DecisionBoundaryDisplay.from_estimator(knn_rotated, X_rotated, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax2)
disp.ax_.scatter(X_rotated[:, 0][y==0], X_rotated[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X_rotated[:, 0][y==1], X_rotated[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
<matplotlib.collections.PathCollection at 0x79d1e140c640>

The rotation performed does not impact the performance of KNN.

What is the inductive bias in KNN then?

Inductive Bias of KNN

To investigate, we construct the following dataset

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (14, 6.3)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['lines.markersize'] = 4.2

np.random.seed(0)
class1_mean = [0, 0.25]
class1_cov = [[100000, 0], [0, 0.01]]
class1_data = np.random.multivariate_normal(class1_mean, class1_cov, 100)

class2_mean = [0, -0.25]
class2_cov = [[100000, 0], [0, 0.01]]
class2_data = np.random.multivariate_normal(class2_mean, class2_cov, 100)

plt.scatter(class1_data[:, 0], class1_data[:, 1], c='b', label='Class 1', edgecolor='k', alpha=0.5)
plt.scatter(class2_data[:, 0], class2_data[:, 1], c='r', label='Class 2', edgecolor='k', alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = np.concatenate([class1_data, class2_data])
y = np.array([0 for _ in range(100)] + [1 for _ in range(100)])

X_scaled = StandardScaler().fit_transform(X)
y_scaled = y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3).fit(X, y)
knn2 = KNeighborsClassifier(n_neighbors=3).fit(X_scaled, y)
fig, (ax1, ax2) = plt.subplots(1, 2)

disp = DecisionBoundaryDisplay.from_estimator(knn, X, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax1)
disp.ax_.scatter(X[:, 0][y==0], X[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'KNN | Score: {knn.score(X, y)}')

disp = DecisionBoundaryDisplay.from_estimator(knn2, X_scaled, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax2)
disp.ax_.scatter(X_scaled[:, 0][y_scaled==0], X_scaled[:, 1][y_scaled==0], color='red', label='y_scaled==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X_scaled[:, 0][y_scaled==1], X_scaled[:, 1][y_scaled==1], color='blue', label='y_scaled==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'KNN-Scaled | Score: {knn2.score(X_scaled, y_scaled)}');

Oh!

We observe that scaling impacts the performance on the dataset. This reveals the inductive bias for KNN:

The algorithm assumes that entities belonging to a particular category should appear near each other, and those that are part of different groups should be distant.

Here even though the seperation is evident, the scaling makes this phenomenon invisible to the knn classifier; hence the model does not capture this structure in the dataset.

Perceptron

If through context we are confident that our dataset has an underlying linear seperation, we could use the Perceptron algorithm

from sklearn.linear_model import Perceptron

perceptron = Perceptron(alpha=0, max_iter=int(1e6), tol=None).fit(X, y)
perceptron_scaled = Perceptron(alpha=0, max_iter=int(1e6), tol=None).fit(X_scaled, y_scaled)

fig, (ax1, ax2) = plt.subplots(1, 2)

disp = DecisionBoundaryDisplay.from_estimator(perceptron, X, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax1)
disp.ax_.scatter(X[:, 0][y==0], X[:, 1][y==0], color='red', label='y==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'Perceptron | Score: {perceptron.score(X, y)}')

disp = DecisionBoundaryDisplay.from_estimator(perceptron_scaled, X_scaled, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax2)
disp.ax_.scatter(X_scaled[:, 0][y_scaled==0], X_scaled[:, 1][y_scaled==0], color='red', label='y_scaled==0', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X_scaled[:, 0][y_scaled==1], X_scaled[:, 1][y_scaled==1], color='blue', label='y_scaled==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'Perceptron-Scaled | Score: {perceptron_scaled.score(X_scaled, y_scaled)}');

Perceptron algorithm: - Is a linear Classifier - Simple update rule: on mistake; add/subtract datapoint - Shown to converge only on linearly seperable datasets with non-zero margin (radius-margin bound) - Inductive Bias due to assumption of underlying structure of data

What about non-linear, say, quadratic seperability? Consider the following dataset:

np.random.seed(1)

X_ = np.random.rand(100, 2)
X_ = (X_-np.mean(X_))*10

X, y = [], []

perp = lambda i: -1*i[0]/i[1]
sign = lambda i: 2*int(i >= 0)-1

sep = lambda x: x[1]**2-8*x[1]*x[0]+2*x[0]**2

w = np.array([1, 1])/np.sqrt(2)
gamma = 0.5

for p in X_:
  d = sep(p)
  if abs(d) >= gamma:
    X.append(p)
    y.append(sign(d))

X = np.array(X)
y = np.array(y)

fig, (ax2) = plt.subplots(1, 1)
fig.suptitle(f'A non-linearly seperable dataset with γ={gamma} margin')

ax2.scatter(X[:, 0][y==1], X[:, 1][y==1], color='green', label='Positive')
ax2.scatter(X[:, 0][y!=1], X[:, 1][y!=1], color='red', label='Negative')
ax2.legend(loc='lower right')

ax2.axvline(x=0, c='black')
ax2.axhline(y=0, c='black')

plt.show()

Being Mindful of the Bias

The above dataset has a seperator corresponding to a second order function of the features.

Transform the dataset and apply perceptron! Alter inductive bias to our advantage

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Perceptron

poly_perceptron = make_pipeline(PolynomialFeatures(2), Perceptron(alpha=0, max_iter=int(1e6), tol=None))
poly_perceptron.fit(X, y)

fig, (ax1) = plt.subplots(1, 1)
disp = DecisionBoundaryDisplay.from_estimator(poly_perceptron, X, response_method="predict", grid_resolution=1000, alpha=0.6, ax=ax1)
disp.ax_.scatter(X[:, 0][y==-1], X[:, 1][y==-1], color='red', label='y==-1', edgecolor='k', alpha=0.5)
disp.ax_.scatter(X[:, 0][y==1], X[:, 1][y==1], color='blue', label='y==1', edgecolor='k', alpha=0.5)
disp.ax_.set_xlabel(f'Poly-Perceptron | Score: {poly_perceptron.score(X, y)}');