MLF | Live Session | Week-3

Course Outline

Week 1: Introduction to ML
Week 2: Recap of calculus for ML
Weeks 3 - 6: Linear algebra for ML
Weeks 7 - 9: Optimization for ML
Weeks 10 - 12: Probability and Statistics for ML

Session Outline

Linear Algebra for ML
Regression problem
- Dataset
Linear regression model
Algebra
- \(X \theta = 0\)
- \(X \theta = y\)
- \(X \theta \approx y\)
Geometry
- Best-fit
- Projections

Linear Algebra for ML

Why should we study linear algebra in ML?

Linear Algebra for ML

Why should we study linear algebra in ML?

Data

Linear Algebra for ML

Housing dataset

lattitude
longitue
age
num_of_rooms
area
distance_from_school

Linear Algebra for ML

\(1\) house

Attributes/Target	Values
lattitude	12.9
longitude	80.2
age	3
rooms	2
area	1000
distance	3
price	40

Linear Algebra for ML

\(1\) house

Attributes/Target	Values
lattitude	12.9
longitude	80.2
age	3
rooms	2
area	1000
distance	3
price	40

\[ \begin{bmatrix} 12.9\\ 80.2\\ 3\\ 2\\ 1000\\ 3\\ \end{bmatrix} \]

Vector

Linear Algebra for ML

\(100\) houses?

Linear Algebra for ML

\(100\) houses

\[ \begin{bmatrix} 12.9 & 80.2 & 3 & 2 & 1000 & 3\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 14.3 & 75.9 & 30 & 2 & 1200 & 5\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 20.8 & 90.5 & 1 & 3 & 1500 & 2 \end{bmatrix} \]

\(100 \times 6\) matrix

Each row is a house

Regression

Given

lattitude

longitue

age

num_of_rooms

area

distance_from_school

Predict

selling_price

Regression

Given

\[ \huge{x \in \mathbb{R}^{6}} \]

Predict

\[ \huge{y \in \mathbb{R}} \]

Regression

Given

\[ \huge{x \in \mathbb{R}^{n}} \]

Predict

\[ \huge{y \in \mathbb{R}} \]

Regression

Given

\[ \huge{x \in \mathbb{R}^{n}} \]

Feature-vector

Predict

\[ \huge{y \in \mathbb{R}} \]

Label

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

right

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

\[ \huge{f(x)=y} \]

right

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

\[ \huge{f(x)=y} \]

Learning a model?

Labeled Dataset

Data for \(m\) houses; each house is described by \(n\) features:

\[ X = \begin{bmatrix} \cdots & x_1 & \cdots\\ & \vdots &\\ \cdots & x_m & \cdots\\ \end{bmatrix} \]

\(m \times n\) data-matrix (feature matrix)

\[ y = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]

\(m \times 1\) label vector

Linear regression

\[ \large{\text{Selling-price} = 2 \times \text{Area} - 0.2 \times \text{Distance} + \text{Constant}} \]

right

Linear regression

\[ \begin{aligned} y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\ \end{aligned} \]

right

Linear regression

\[ \begin{aligned} y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\ &= \theta^T x \end{aligned} \]

right

Linear regression

\[ y = \theta^T x = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6 \end{bmatrix}\begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ x_6 \end{bmatrix} \]

right

Linear regression

\[ \huge{f(x) = \theta^T x} \]

\(\large \theta\) is a vector of parameters (weights) of the model

right

Linear regression

\[ y = \theta^T x = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6 \end{bmatrix}\begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ x_6 \end{bmatrix} \]

right

Linear regression

\[ y = x^T \theta = \begin{bmatrix} 1 & x_1 & x_2 & x_3 & x_4 & x_5 & x_6 \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6 \end{bmatrix} \]

right

Linear regression

\[ \begin{bmatrix} y_{1}\\ \vdots\\ y_{100} \end{bmatrix}= \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} & x_{1,5} & x_{1,6}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{100,1} & x_{100,2} & x_{100,3} & x_{100,4} & x_{100,5} & x_{100,6}\\ \end{bmatrix}\begin{bmatrix} \theta_0\\ \theta_1\\ \theta_2\\ \theta_3\\ \theta_4\\ \theta_5\\ \theta_6 \end{bmatrix} \]

right

Linear regression

\[ \huge{X \theta} = y \]

right

Linear regression

\[ \huge{X \theta} = y \]

Enter Linear Algebra

right

Setting

\(X\) is a data-matrix of dimensions \(m \times n\)
\(y\) is a column-vector of size \(m\)
\(\theta \in \mathbb{R}^n\)
\(X\theta \in \mathbb{R}^{m}\)

\[ \huge{X \theta} = y \]

right

\(X \theta = 0\)

\[ \huge{X \theta = 0} \]

right

\(X \theta = 0\)

If:

\(X \theta_1 = 0\)
\(X \theta_2 = 0\)

\(X \theta = 0\)

If:

\(X \theta_1 = 0\)
\(X \theta_2 = 0\)

Then:

\(X (\theta_1 + \theta_2) = X\theta_1 + X \theta_2 = 0\)
\(X(k \theta_1) = k(X \theta_1) = 0\)

\(X \theta = 0\)

The set of all solutions of \(X \theta = 0\) is \(\underline{} \underline{} \underline{} \underline{}\) .

right

\(X \theta = 0\)

The set of all solutions of \(X \theta = 0\) is \(N(X)\).

right

\(X \theta = 0\)

The set of all solutions of \(X \theta = 0\) is \(N(X)\).

Find a basis for \(N(X)\).

right

\(X \theta = 0\)

The set of all solutions of \(X \theta = 0\) is \(N(X)\).

Find a basis for \(N(X)\).

\(N(X) = \text{span}(B)\)

right

\(X \theta = 0\)

\[ \huge{X \theta = 0} \]

Gaussian elimination

right

Row-operations

swap two rows
scale a row by a non-zero constant
add a scalar multiple of a row to another row

Example

\[ X = \begin{bmatrix} 1 & 0 & -1 & 0\\ 2 & 1 & 0 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \]

right

Example: Step-1

\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 2 & 1 & 0 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \underrightarrow{R_2 \rightarrow R_2-2R_1} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \]

right

Example: Step-2

\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \underrightarrow{R_3 \rightarrow R_3-3R_1} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 1 & 2 & 1 \end{bmatrix} \]

right

Example: Step-3

\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 1 & 2 & 1 \end{bmatrix} \underrightarrow{R_3 \rightarrow R_3 - R_2} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Row-echelon matrix

\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Row-echelon matrix

\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Solve

\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} \color{blue}{\theta_1}\\ \color{blue}{\theta_2}\\ \theta_3\\ \theta_4 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0 \end{bmatrix} \]

right

Algorithm

\(B = \{ \}\)

For each independent variable \(\theta_i\):

Set \(\theta_i = 1\) and \(\theta_j = 0\) for all independent variables, \(j \neq i\)
Solve for the dependent variables
Add \(\theta\) to \(B\)

right

Solved

\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} \color{blue}{\theta_1}\\ \color{blue}{\theta_2}\\ \theta_3\\ \theta_4 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0 \end{bmatrix} \]

\[ B = \left \{ \begin{bmatrix}1\\ -2\\ 1\\ 0\end{bmatrix}, \begin{bmatrix}0\\ -1\\ 0\\ 1\end{bmatrix} \right \} \]

\(X \theta = y\)

When will \(X \theta = y\) have a solution?

right

\(X \theta = y\)

\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]

\(X \theta = y\)

\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]

\[ \theta_1 x_1 + \cdots + \theta_n x_n = y \]

\(X \theta = y\)

\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]

\[ \theta_1 x_1 + \cdots + \theta_n x_n = y \]

\[ y \in C(X) \]

\(X \theta = y\)

When will \(X \theta = y\) have a solution?

\(y \in C(X)\)

right

\(X \theta = y\)

\(X\) is the data-matrix
\(m \times n\)
\(m\) data-points, \(n\) features

\(X \theta = y\)

\(X\) is the data-matrix
\(m \times n\)
\(m\) data-points, \(n\) features

Typical dataset:

\(m = 10000\)
\(n = 10\)
Can \(10\) vectors span \(\mathbb{R}^{10000}\)?

\(X \theta = y\)

\(X\) is the data-matrix
\(m \times n\)
\(m\) data-points, \(n\) features

Typical dataset:

\(m = 10000\)
\(n = 10\)
Can \(10\) vectors span \(\mathbb{R}^{10000}\)?

\(X \theta = y\) is generally unsolvable.

right

\(X \theta \approx y\)

right

\(X \theta \approx y\)

What does the \(\approx\) symbol mean?

\(1.234 \approx 1\)
\(1.234 \approx 1.2\)
\(1.234 \approx 1.23\)

right

\(X \theta \approx y\)

\[ \hat{y} \approx y \]

\(X \theta \approx y\)

\[ \hat{y} \approx y \]

\[ ||\hat{y} - y|| \]

\(X \theta \approx y\)

\[ \hat{y} \approx y \]

\[ ||\hat{y} - y||^2 \]

\(X \theta \approx y\)

\[ \hat{y} \approx y \]

\[ ||\hat{y} - y||^2 = (\hat{y}_1 - y_1)^2 + \cdots + (\hat{y}_m - y_m)^2 \]

\(X \theta \approx y\)

\[ X\hat{\theta} \approx y \]

\[ \hat{\theta} = \arg \min \limits_{\theta} ||X\theta - y||^2 \]

Loss

\[ \huge{L = ||X \theta - y||^2} \]

right

Loss

\[ \begin{aligned} L &= ||X \theta - y||^2 \end{aligned} \]

right

Loss

\[ \begin{aligned} L &= ||X \theta - y||^2\\\\ &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

right

Optimization

\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

Optimization

\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

\[ \nabla_{\theta} L = \begin{bmatrix} \cfrac{\partial L}{\partial \theta_1}\\ \vdots\\ \cfrac{\partial L}{\partial \theta_n} \end{bmatrix} = 0 \]

Optimization

\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

\[ \nabla_{\theta} L = 2(X^TX)\theta - 2X^Ty = 0 \]

Normal Equations

\[ \huge{(X^TX)\theta = X^Ty} \]

right

Solution

If \(X^TX\) is invertible, then:

\[ \hat{\theta} = (X^TX)^{-1} X^Ty \]

When is \(X^TX\) invertible?

right

\(X^TX\)

\(n \times n\) matrix
Symmetric

\(X^TX\)

\[ N(X) = N(X^TX) \]

\(X^TX\)

\[ N(X) = N(X^TX) \]

Proof

If \(\theta \in N(X)\):

\(X\theta =0\)
\(X^TX \theta = 0\)

\(X^TX\)

\[ N(X) = N(X^TX) \]

Proof

If \(\theta \in N(X)\):

\(X\theta =0\)
\(X^TX \theta = 0\)

If \(\theta \in N(X^T X)\)

\(X^TX \theta = 0\)
\(\theta^T X^TX \theta = 0\)
\((X \theta)^T (X \theta) = 0\)
\(X \theta = 0\)

\(X^TX\)

If \(\text{rank}(X) = n\), \(X^TX\) is invertible

Proof

If \(\text{rank}(X) = n\):

\(\text{nullity}(X) = 0\)
\(\text{nullity}(X^TX) = 0\)
\(\text{rank}(X^TX) = n\)
\(X^TX\) is full rank, hence invertible

Geometry

What does a linear regression model look like?

right

Data

right

Linear model

right

Errors

right

Projections

\[ (X^TX)\hat{\theta} = X^Ty \]

\(X \hat{\theta}\) is an approximation for \(y\)

How are \(X \hat{\theta}\) and \(y\) related geometrically?

Projections

\[ (X^TX)\hat{\theta} = X^Ty \]

\(X \hat{\theta}\) is an approximation for \(y\)

How are \(X \hat{\theta}\) and \(y\) related geometrically?

\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections

\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections

\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections

\[ e = y - X \hat{\theta} \]

\[ e \perp C(X) \]

Projections

\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

Projections