Why should we study linear algebra in ML?
Why should we study linear algebra in ML?
Housing dataset
\(1\) house
Attributes/Target | Values |
---|---|
lattitude | 12.9 |
longitude | 80.2 |
age | 3 |
rooms | 2 |
area | 1000 |
distance | 3 |
price | 40 |
\(1\) house
Attributes/Target | Values |
---|---|
lattitude | 12.9 |
longitude | 80.2 |
age | 3 |
rooms | 2 |
area | 1000 |
distance | 3 |
price | 40 |
\[ \begin{bmatrix} 12.9\\ 80.2\\ 3\\ 2\\ 1000\\ 3\\ \end{bmatrix} \]
Vector
\(100\) houses?
\(100\) houses
\[ \begin{bmatrix} 12.9 & 80.2 & 3 & 2 & 1000 & 3\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 14.3 & 75.9 & 30 & 2 & 1200 & 5\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 20.8 & 90.5 & 1 & 3 & 1500 & 2 \end{bmatrix} \]
\(100 \times 6\) matrix
Each row is a house
Given
lattitude
longitue
age
num_of_rooms
area
distance_from_school
Predict
selling_price
Given
\[
\huge{x \in \mathbb{R}^{6}}
\]
Predict
\[
\huge{y \in \mathbb{R}}
\]
Given
\[
\huge{x \in \mathbb{R}^{n}}
\]
Predict
\[
\huge{y \in \mathbb{R}}
\]
Given
\[
\huge{x \in \mathbb{R}^{n}}
\]
Feature-vector
Predict
\[
\huge{y \in \mathbb{R}}
\]
Label
\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]
right
\[
\huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R}
\]
\[
\huge{f(x)=y}
\]
right
\[
\huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R}
\]
\[
\huge{f(x)=y}
\]
Learning a model?
Data for \(m\) houses; each house is described by \(n\) features:
\[
X = \begin{bmatrix}
\cdots & x_1 & \cdots\\
& \vdots &\\
\cdots & x_m & \cdots\\
\end{bmatrix}
\]
\(m \times n\) data-matrix (feature matrix)
\[ y = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]
\(m \times 1\) label vector
\[
\large{\text{Selling-price} = 2 \times \text{Area} - 0.2 \times \text{Distance} + \text{Constant}}
\]
right
\[
\begin{aligned}
y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\
\end{aligned}
\]
right
\[
\begin{aligned}
y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\
&= \theta^T x
\end{aligned}
\]
right
\[
y = \theta^T x = \begin{bmatrix}
\theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6
\end{bmatrix}\begin{bmatrix}
1\\
x_1\\
x_2\\
x_3\\
x_4\\
x_5\\
x_6
\end{bmatrix}
\]
right
\[
\huge{f(x) = \theta^T x}
\]
\(\large \theta\) is a vector of parameters (weights) of the model
right
\[
y = \theta^T x = \begin{bmatrix}
\theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6
\end{bmatrix}\begin{bmatrix}
1\\
x_1\\
x_2\\
x_3\\
x_4\\
x_5\\
x_6
\end{bmatrix}
\]
right
\[
y = x^T \theta = \begin{bmatrix}
1 & x_1 & x_2 & x_3 & x_4 & x_5 & x_6
\end{bmatrix} \begin{bmatrix}
\theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
y_{1}\\
\vdots\\
y_{100}
\end{bmatrix}= \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} & x_{1,5} & x_{1,6}\\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\
1 & x_{100,1} & x_{100,2} & x_{100,3} & x_{100,4} & x_{100,5} & x_{100,6}\\
\end{bmatrix}\begin{bmatrix}
\theta_0\\
\theta_1\\
\theta_2\\
\theta_3\\
\theta_4\\
\theta_5\\
\theta_6
\end{bmatrix}
\]
right
\[
\huge{X \theta} = y
\]
right
\[
\huge{X \theta} = y
\]
Enter Linear Algebra
right
\[
\huge{X \theta} = y
\]
right
\[
\huge{X \theta = 0}
\]
right
If:
If:
Then:
The set of all solutions of \(X \theta = 0\) is \(\underline{} \underline{} \underline{} \underline{}\) .
right
The set of all solutions of \(X \theta = 0\) is \(N(X)\).
right
The set of all solutions of \(X \theta = 0\) is \(N(X)\).
Find a basis for \(N(X)\).
right
The set of all solutions of \(X \theta = 0\) is \(N(X)\).
Find a basis for \(N(X)\).
\(N(X) = \text{span}(B)\)
right
\[
\huge{X \theta = 0}
\]
Gaussian elimination
right
\[
X = \begin{bmatrix}
1 & 0 & -1 & 0\\
2 & 1 & 0 & 1\\
3 & 1 & -1 & 1
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
1 & 0 & -1 & 0\\
2 & 1 & 0 & 1\\
3 & 1 & -1 & 1
\end{bmatrix} \underrightarrow{R_2 \rightarrow R_2-2R_1}
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
3 & 1 & -1 & 1
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
3 & 1 & -1 & 1
\end{bmatrix} \underrightarrow{R_3 \rightarrow R_3-3R_1}
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
0 & 1 & 2 & 1
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
0 & 1 & 2 & 1
\end{bmatrix} \underrightarrow{R_3 \rightarrow R_3 - R_2}
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
0 & 0 & 0 & 0
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
1 & 0 & -1 & 0\\
0 & 1 & 2 & 1\\
0 & 0 & 0 & 0
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
\color{blue}{1} & 0 & -1 & 0\\
0 & \color{blue}{1} & 2 & 1\\
0 & 0 & 0 & 0
\end{bmatrix}
\]
right
\[
\begin{bmatrix}
\color{blue}{1} & 0 & -1 & 0\\
0 & \color{blue}{1} & 2 & 1\\
0 & 0 & 0 & 0
\end{bmatrix}\begin{bmatrix}
\color{blue}{\theta_1}\\
\color{blue}{\theta_2}\\
\theta_3\\
\theta_4
\end{bmatrix} = \begin{bmatrix}
0\\
0\\
0\\
0
\end{bmatrix}
\]
right
\(B = \{ \}\)
For each independent variable \(\theta_i\):
right
\[
\begin{bmatrix}
\color{blue}{1} & 0 & -1 & 0\\
0 & \color{blue}{1} & 2 & 1\\
0 & 0 & 0 & 0
\end{bmatrix}\begin{bmatrix}
\color{blue}{\theta_1}\\
\color{blue}{\theta_2}\\
\theta_3\\
\theta_4
\end{bmatrix} = \begin{bmatrix}
0\\
0\\
0\\
0
\end{bmatrix}
\]
\[
B = \left \{ \begin{bmatrix}1\\
-2\\
1\\
0\end{bmatrix}, \begin{bmatrix}0\\
-1\\
0\\
1\end{bmatrix} \right \}
\]
When will \(X \theta = y\) have a solution?
right
\[
\begin{bmatrix}
\big\vert & & \big\vert\\
x_1 & \cdots & x_n\\
\big\vert & & \big\vert
\end{bmatrix} \begin{bmatrix}
\theta_1\\
\vdots\\
\theta_n
\end{bmatrix} = \begin{bmatrix}
y_1\\
\vdots\\
y_m
\end{bmatrix}
\]
\[
\begin{bmatrix}
\big\vert & & \big\vert\\
x_1 & \cdots & x_n\\
\big\vert & & \big\vert
\end{bmatrix} \begin{bmatrix}
\theta_1\\
\vdots\\
\theta_n
\end{bmatrix} = \begin{bmatrix}
y_1\\
\vdots\\
y_m
\end{bmatrix}
\]
\[
\theta_1 x_1 + \cdots + \theta_n x_n = y
\]
\[
\begin{bmatrix}
\big\vert & & \big\vert\\
x_1 & \cdots & x_n\\
\big\vert & & \big\vert
\end{bmatrix} \begin{bmatrix}
\theta_1\\
\vdots\\
\theta_n
\end{bmatrix} = \begin{bmatrix}
y_1\\
\vdots\\
y_m
\end{bmatrix}
\]
\[
\theta_1 x_1 + \cdots + \theta_n x_n = y
\]
\[
y \in C(X)
\]
When will \(X \theta = y\) have a solution?
\(y \in C(X)\)
right
Typical dataset:
Typical dataset:
\(X \theta = y\) is generally unsolvable.
right
right
What does the \(\approx\) symbol mean?
right
\[
\hat{y} \approx y
\]
\[
\hat{y} \approx y
\]
\[
||\hat{y} - y||
\]
\[
\hat{y} \approx y
\]
\[
||\hat{y} - y||^2
\]
\[
\hat{y} \approx y
\]
\[
||\hat{y} - y||^2 = (\hat{y}_1 - y_1)^2 + \cdots + (\hat{y}_m - y_m)^2
\]
\[
X\hat{\theta} \approx y
\]
\[
\hat{\theta} = \arg \min \limits_{\theta} ||X\theta - y||^2
\]
\[
\huge{L = ||X \theta - y||^2}
\]
right
\[
\begin{aligned}
L &= ||X \theta - y||^2
\end{aligned}
\]
right
\[
\begin{aligned}
L &= ||X \theta - y||^2\\\\
&= (X \theta - y)^T (X \theta - y)
\end{aligned}
\]
right
\[
\begin{aligned}
L &= (X \theta - y)^T (X \theta - y)
\end{aligned}
\]
\[
\begin{aligned}
L &= (X \theta - y)^T (X \theta - y)
\end{aligned}
\]
\[
\nabla_{\theta} L = \begin{bmatrix}
\cfrac{\partial L}{\partial \theta_1}\\
\vdots\\
\cfrac{\partial L}{\partial \theta_n}
\end{bmatrix} = 0
\]
\[
\begin{aligned}
L &= (X \theta - y)^T (X \theta - y)
\end{aligned}
\]
\[
\nabla_{\theta} L = 2(X^TX)\theta - 2X^Ty = 0
\]
\[
\huge{(X^TX)\theta = X^Ty}
\]
right
If \(X^TX\) is invertible, then:
\[
\hat{\theta} = (X^TX)^{-1} X^Ty
\]
When is \(X^TX\) invertible?
right
\[
N(X) = N(X^TX)
\]
\[
N(X) = N(X^TX)
\]
Proof
If \(\theta \in N(X)\):
\[
N(X) = N(X^TX)
\]
Proof
If \(\theta \in N(X)\):
If \(\theta \in N(X^T X)\)
If \(\text{rank}(X) = n\), \(X^TX\) is invertible
Proof
If \(\text{rank}(X) = n\):
What does a linear regression model look like?
right
right
right
right
\[
(X^TX)\hat{\theta} = X^Ty
\]
\(X \hat{\theta}\) is an approximation for \(y\)
How are \(X \hat{\theta}\) and \(y\) related geometrically?
\[
(X^TX)\hat{\theta} = X^Ty
\]
\(X \hat{\theta}\) is an approximation for \(y\)
How are \(X \hat{\theta}\) and \(y\) related geometrically?
\[
X = \begin{bmatrix}
2 & 6\\
1 & 3
\end{bmatrix}, y = \begin{bmatrix}
3\\
4
\end{bmatrix}, X \hat{\theta} = \begin{bmatrix}
4\\
2
\end{bmatrix}
\]
\[
X = \begin{bmatrix}
2 & 6\\
1 & 3
\end{bmatrix}, y = \begin{bmatrix}
3\\
4
\end{bmatrix}, X \hat{\theta} = \begin{bmatrix}
4\\
2
\end{bmatrix}
\]
\[
X = \begin{bmatrix}
2 & 6\\
1 & 3
\end{bmatrix}, y = \begin{bmatrix}
3\\
4
\end{bmatrix}, X \hat{\theta} = \begin{bmatrix}
4\\
2
\end{bmatrix}
\]
\[
e = y - X \hat{\theta}
\]
\[
e \perp C(X)
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
e \perp x_i
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
e \perp x_i
\]
\[
x_i^T e = 0
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
e \perp x_i
\]
\[
x_i^T e = 0
\]
\[
X^T e = 0
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
e \perp x_i
\]
\[
x_i^T e = 0
\]
\[
X^T e = 0
\]
\[
X^T(y - X\hat{\theta}) = 0
\]
\[
X = \begin{bmatrix}
\vert & & \vert\\
x_1 & \cdots & x_n\\
\vert & & \vert
\end{bmatrix}
\]
\[
e \perp x_i
\]
\[
x_i^T e = 0
\]
\[
X^T e = 0
\]
\[
X^T(y - X\hat{\theta}) = 0
\]
\[
X^TX \hat{\theta} = X^T y
\]