import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
= 100
n = 25
d
# Dataset with 25 uncorrelated features
7)
np.random.seed(# X = np.random.normal(size=(n, d))
= make_regression(n_samples=n, n_features=d, noise=1, random_state=2)
X, y
# Add 25 correlated features:
for j in range(d):
= np.random.normal(size=d+j)
a = X@np.vectorize(lambda i: int(i)*np.random.randint(-2, high=2))(a>1)
new_feature = np.c_[X, new_feature]
X
= X.T
X
plt.matshow(np.corrcoef(X))= plt.colorbar()
cb
cb.ax.tick_params()'Correlation Matrix'); plt.title(
Week 6: Ridge, Lasso Regression & Cross Validation Methods
Colab Link: Click here!
Maximum-Likelihood Estimator for finding w
The closed form solution for the ML estimator to solve the linear regression problem with the given probabilistic model is the following:
\hat{\mathbf{w}}_{ML} = (XX^T)^†Xy
= np.linalg.pinv(X @ X.T) @ X @ y w_hat_ml
Goodness of ML Estimator
To quantify “goodness” of our estimator, we look at the Mean Squared Error.
Mean Squared Error: Let \mathbf{\hat{w}} be an estimator of an unknown parameter \mathbf{w}. Then we define
\text{MSE}(\mathbf{\hat{w}}) = \mathbb{E}[||\mathbf{\hat{w}} - \mathbf{w}||^2]
as the Mean Squared Error of our estimator.
Bias-Variance Decomposition of MSE
The above formula for mean squared error can be rewritten as
\text{MSE}(\mathbf{\hat{w}}) = \text{tr}(\text{Var}(\mathbf{\hat{w}})) + ||\text{Bias}(\mathbf{\hat{w}})||^2
where \text{tr}(\text{Var}(\mathbf{\hat{w}})) is the trace of the covariance matrix, and \text{Bias}(\mathbf{\hat{w}}) = \mathbb{E}[\mathbf{\hat{w}}] - \mathbf{w}
is the bias, i.e, the expected difference between the estimator \mathbf{\hat{w}} and true value \mathbf{w}.
Applying Bias-Variance decomposition on ML Estimator of w
The ML Estimator has zero bias; so we get
\begin{align*} \text{MSE}(\mathbf{\hat{w}}_{ML}) &= \text{tr}(\text{Var}(\mathbf{\hat{w}}_{ML})) \\ &= \sigma^2 \times \text{tr}((XX^T)^{-1}) \end{align*}
Derivation of decomposition
We use the below well-known formula relating the covariance matrix \text{Var}(X) to the first and second moments: \mathbb{E}[X^2] = (\mathbb{E}[X])^2 + \text{Var}(X)
Therefore we have,
\begin{alignat*}{2} && \text{MSE}(\mathbf{\hat{w}}) & = \mathbb{E}[||\mathbf{\hat{w}} - \mathbf{w}||^2] \\ && & = (\mathbb{E}[||\mathbf{\hat{w}} - \mathbf{w}||])^2 + \text{Var}(||\mathbf{\hat{w}} - \mathbf{w}||)\\ && & = ||\mathbb{E}[\mathbf{\hat{w}}] - \mathbf{w}||^2 + \text{Var}(||\mathbf{\hat{w}}||) \\ && & = ||\text{Bias}(\mathbf{\hat{w}})||^2 + \text{tr}(\text{Var}(\mathbf{\hat{w}})) \end{alignat*}
Source: Wikipedia
In pursuit of a better estimator
We see that the goodness of the ML estimate depends on the variance of error σ^2 and \text{tr}((XX^T)^{-1}). We would like to reduce this quantity.
The variance of error is inherent to the experimental setup, and reduction of this quantity therefore depends on qualitative improvements of instruments/peripherals used to collect data. So we cannot hope to reduce this quantity from a mathematical perspective.
The trace depends on (XX^T)^{-1}, which denotes the inverse covariance matrix of the data X. The covariance matrix captures how the features are related with each other, and it seems that this relation affects the goodness of our estimator. We may try to tweak this value to reduce the trace value, thereby increasing the goodness of our ML estimator.
Reducing the Trace value
The trace of a matrix X can also be represented as the sum of the eigenvalues of X.
Let \mathbf{λ} denote the set of eigenvalues of XX^T. Then \{ \frac{1}{λ_i} \; ∀ \; i \; \epsilon \; \{1, 2, .., n\} \} is the set of eigenvalues of (XX^T)^{-1} and its trace thus is the following:
\text{tr}((XX^T)^{-1}) = \sum_{i=1}^{d}\frac{1}{λ_i}
Now, we’d like to reduce this value. A way forward would be to increase the eigenvalues of XX^T by doing the following:
\text{eigenvalues of } (XX^T + λI) = \{ λ_i + λ \; ∀ \; i \; \epsilon \; \{1, 2, .., n\} \}
which will result in a smaller trace value (due to increase in denominator): \text{tr}((XX^T + λI)^{-1}) = \sum_{i=1}^{d}\frac{1}{λ_i + λ}
However, this process reformulates the problem the estimator is trying to solve (trade-off) to the following:
\hat{\mathbf{w}}_{\lambda - ML} = (XX^T + λI)^{-1}Xy
This induces a non-zero bias in our estimator:
\begin{align*} \text{Bias}(\mathbf{\hat{w}}_{\lambda-ML}) & = \mathbb{E}[\mathbf{\hat{w}}_{\lambda-ML}] - \mathbf{w}\\ & = [(XX^T + λI)^{-1} - (XX^T)^†](XX^T)\mathbf{w} \end{align*}
The difference between the MSE of our estimators is then:
\begin{align*}{3} &&& \text{MSE}(\mathbf{\hat{w}}) - \text{MSE}(\mathbf{\hat{w}}_{\lambda-ML}) && = \text{tr}(\text{Var}(\mathbf{\hat{w}})) - \text{tr}(\text{Var}(\mathbf{\hat{w}}_{\lambda-ML})) & \quad - \quad\text{(A)} \\ &&& && \quad - || \text{Bias}(\mathbf{\hat{w}}_{\lambda-ML}) ||^2 & \quad - \quad\text{(B)} \\ \end{align*}
Note that both \text{(A)} and \text{(B)} are positive quantities, hence the difference can either be positive or negative; The difference depends on the parameter \lambda.
The existence theorem asserts that there exists some non-zero λ such that the MSE of \hat{\mathbf{w}}_{\lambda - ML} is lower than that of \hat{\mathbf{w}}_{ML}, i.e we get \text{(A)} > \text{(B)}.
Ridge Regression
The method of estimating w while reducing the effect inter-correlated variables have on the ML estimator for linear regression is called Ridge Regression.
The objective of the ridge regression problem is given as follows:
\underset{w}{arg \; min} \sum_{i=1}^{n}(w^Tx_i-y_i)^2 + λ||w||^2_2
The λ term is called the regularizer, and is a hyperparameter, which can be chosen through the cross-validation technique, described at the end of this notebook.
Closed-Form Solution for Ridge Estimator
It is the same as the aforementioned \hat{\mathbf{w}}_{\lambda - ML}:
\hat{\mathbf{w}}_{\lambda - ML} = (XX^T + λI)^{-1}Xy
It’s good to know that the inverse of (XX^T + λI) always exists since: 1. (XX^T) is positive semi-definite 2. \lambda > 0
Hence (XX^T + λI) is a positive definite matrix (i.e. all eigenvalues are strictly positive - full rank) - indicating its inverse exists.
= 10
l = np.linalg.inv(X @ X.T + l * np.eye(d)) @ X @ y w_hat_ridge
Comparison of Ridge and Linear Regression Solutions
The \lambda ||w||^2_2 term in the objective function for Ridge regression penalizes the score for w’s with greater length. This thus pushes the ridge estimator closer to the origin, resulting in smaller values in its components.
We show the effect of varying \lambda over the coefficients of ridge
= np.logspace(-1, 10 , 200)
lambdas
from sklearn.linear_model import Ridge
= []
coefs for l in lambdas:
= np.linalg.pinv(X @ X.T + l * np.eye(d)) @ X @ y
w_hat_ridge
coefs.append(w_hat_ridge)
= plt.subplots(figsize=(15, 5))
fig, ax
ax.plot(lambdas, coefs)"log")
ax.set_xscale(# ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
"λ (log scale)")
plt.xlabel("weights")
plt.ylabel("Ridge coefficients w.r.t λ")
plt.title("tight")
plt.axis( plt.show()
Lasso Regression
The objective for the lasso regression problem is as follows:
\underset{w}{arg \; min} \sum_{i=1}^{n}(w^Tx_i-y_i)^2 + λ||w||^2_1
The only difference between Lasso and Ridge is that Lasso uses the L1 norm (manhattan distance), in contrast to the L2 norm (euclidean distance) used by Ridge regression. This difference adds sparsity to the Lasso estimator; i.e, some feature co-efficients are pushed to 0; these features do not play a role in the final prediction.
In this context, Lasso can be said to perform feature-selection. Lasso also acts as a regularizer, in terms of constraining the length of the estimator.
Solution for Lasso Regression
There exists no closed form solution for Lasso, due to the constraint boundaries having points which are not differentiable.
Hence we may use the method of gradient descent, with the help of subgradients to find the Lasso estimator.
In this notebook we use scikit-learn’s implementation of Lasso to demonstrate the effect of the Lasso objective on our estimator.
import warnings
"ignore")
warnings.filterwarnings(
= np.logspace(-4, 3 , 200)
lambdas
from sklearn.linear_model import Lasso
= []
coefs for l in lambdas:
= Lasso(alpha=l, max_iter=int(1e3)).fit(X.T, y)
lasso
coefs.append(lasso.coef_)
= plt.subplots(figsize=(15, 5))
fig, ax
ax.plot(lambdas, coefs)"log")
ax.set_xscale(# ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
"λ (log scale)")
plt.xlabel("weights")
plt.ylabel("Lasso coefficients w.r.t λ")
plt.title("tight")
plt.axis( plt.show()
Best λ?
We discuss strategies that can be used to find λ which reduces the MSE for our linear regression problem.
Validation
- We partition our dataset (features and labels) into 2 - the train and validation sets respectively.
- We construct a set of candidate values for λ and find their corresponding estimators \hat{\mathbf{w}}_{\lambda - ML} using the train dataset.
- We then use these estimators on the validation dataset and find the MSE.
- We choose the λ that gives the least MSE.
= list(np.logspace(-9, 2, 1000))
candidate_lambdas
= X[:, :int(0.66*n)], X[:, int(0.66*n):], y[:int(0.66*n)], y[int(0.66*n):]
X_train, X_val, y_train, y_val
def validate(X_train, X_val, y_train, y_val, l):
= np.linalg.pinv(X_train @ X_train.T + l * np.eye(d)) @ X_train @ y_train
w_hat_l_ml = np.linalg.norm(X_val.T @ w_hat_l_ml - y_val)**2 / len(y_val)
l_mse
return l_mse
= [validate(X_train, X_val, y_train, y_val, l) for l in candidate_lambdas]
lst
= plt.subplots(figsize=(15, 5))
fig, ax
ax.plot(candidate_lambdas, lst)"log")
ax.set_xscale(
= min(zip(candidate_lambdas, lst), key=lambda i: i[-1])
best_lambda, best_loss =best_lambda, color='red')
plt.axvline(x
"λ (log scale)")
plt.xlabel("Validation Loss")
plt.ylabel("Validation")
plt.title("tight");
plt.axis(
print(f'Best Lambda: {best_lambda}')
Best Lambda: 0.0986265846131283
K-Fold Cross Validation
- Partition dataset into K sets.
- For each candidate λ:
- For i=1 to K rounds:
- Choose the ith partition as the validation set.
- Consider the union of the remaining sets as the train set.
- Follow aforementioned Cross Validation procedure to obtain MSE for chosen λ
- Return the average MSE
- For i=1 to K rounds:
- Choose the λ that gives the minimum average MSE.
= list(np.logspace(-9, 2, 1000))
candidate_lambdas
def KFold(K, l):
# Construct Folds
= []
folds = X.shape[1]//K
n
for i in range(K):
*n:(i+1)*n], y[i*n:(i+1)*n]))
folds.append((X[:, i
= np.array([])
l_mse # Cross validate over every validation partition
for i in range(K):
= folds[i]
X_val, y_val
= np.array([[] for _ in range(d)]), np.array([])
X_train, y_train
for j in range(K):
if i != j:
= folds[j]
X_j, y_j = np.column_stack((X_train, X_j))
X_train = np.concatenate((y_train, y_j))
y_train
= np.append(l_mse, validate(X_train, X_val, y_train, y_val, l))
l_mse
return(l_mse.mean())
= []
lst for l in candidate_lambdas:
= KFold(10, l)
a
lst.append(a)# print(l, '|', round(a, 10))
= plt.subplots(figsize=(15, 5))
fig, ax
ax.plot(candidate_lambdas, lst)"log")
ax.set_xscale(
= min(zip(candidate_lambdas, lst), key=lambda i: i[-1])
best_lambda, best_loss =best_lambda, color='red')
plt.axvline(x
"λ (log scale)")
plt.xlabel("K-Fold Loss")
plt.ylabel("K-Fold CV with K=10")
plt.title("tight");
plt.axis(
print(f'Best Lambda: {best_lambda}')
Best Lambda: 1.0399609139541203e-06
Leave-one-out Cross Validation
- Just K-Fold Validation, but with K set to the number of samples in our dataset.
- In other words, for n rounds, we choose just 1 element as the validation set, and the rest to be our train set.
- It is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance.
= list(np.logspace(-9, 2, 1000))
candidate_lambdas = [KFold(100, l) for l in candidate_lambdas]
lst
= plt.subplots(figsize=(15, 5))
fig, ax
ax.plot(candidate_lambdas, lst)"log")
ax.set_xscale(
= min(zip(candidate_lambdas, lst), key=lambda i: i[-1])
best_lambda, best_loss =best_lambda, color='red')
plt.axvline(x
"λ (log scale)")
plt.xlabel(f"K-Fold (K={n}) Loss")
plt.ylabel("LOOCV")
plt.title("tight");
plt.axis(
print(f'Best Lambda: {best_lambda}')
Best Lambda: 5.659170163246243e-07