MLF | Live Session | Week-3

Course Outline

Session Outline

Linear Algebra for ML

Linear Algebra for ML

Why should we study linear algebra in ML?

Linear Algebra for ML

Why should we study linear algebra in ML?



Data

Linear Algebra for ML

Housing dataset

Linear Algebra for ML

\(1\) house

Attributes/Target Values
lattitude 12.9
longitude 80.2
age 3
rooms 2
area 1000
distance 3
price 40

Linear Algebra for ML

\(1\) house

Attributes/Target Values
lattitude 12.9
longitude 80.2
age 3
rooms 2
area 1000
distance 3
price 40

\[ \begin{bmatrix} 12.9\\ 80.2\\ 3\\ 2\\ 1000\\ 3\\ \end{bmatrix} \]


Vector

Linear Algebra for ML

\(100\) houses?

Linear Algebra for ML

\(100\) houses

\[ \begin{bmatrix} 12.9 & 80.2 & 3 & 2 & 1000 & 3\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 14.3 & 75.9 & 30 & 2 & 1200 & 5\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 20.8 & 90.5 & 1 & 3 & 1500 & 2 \end{bmatrix} \]


\(100 \times 6\) matrix

Each row is a house

Regression

Given


lattitude

longitue

age

num_of_rooms

area

distance_from_school

Predict




selling_price

Regression

Given



\[ \huge{x \in \mathbb{R}^{6}} \]

Predict



\[ \huge{y \in \mathbb{R}} \]

Regression

Given



\[ \huge{x \in \mathbb{R}^{n}} \]

Predict



\[ \huge{y \in \mathbb{R}} \]

Regression

Given



\[ \huge{x \in \mathbb{R}^{n}} \]


Feature-vector

Predict



\[ \huge{y \in \mathbb{R}} \]



Label

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

right

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

\[ \huge{f(x)=y} \]

right

Model

\[ \huge{f: \mathbb{R}^{n}} \rightarrow \mathbb{R} \]

\[ \huge{f(x)=y} \]



Learning a model?

Labeled Dataset

Data for \(m\) houses; each house is described by \(n\) features:



\[ X = \begin{bmatrix} \cdots & x_1 & \cdots\\ & \vdots &\\ \cdots & x_m & \cdots\\ \end{bmatrix} \]

\(m \times n\) data-matrix (feature matrix)

\[ y = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]



\(m \times 1\) label vector

Linear regression



\[ \large{\text{Selling-price} = 2 \times \text{Area} - 0.2 \times \text{Distance} + \text{Constant}} \]

right

Linear regression



\[ \begin{aligned} y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\ \end{aligned} \]

right

Linear regression



\[ \begin{aligned} y &= \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3\theta_3 + x_4 \theta_4 + x_5 \theta_5 + x_6 \theta_6 \\\\ &= \theta^T x \end{aligned} \]

right

Linear regression



\[ y = \theta^T x = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6 \end{bmatrix}\begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ x_6 \end{bmatrix} \]

right

Linear regression



\[ \huge{f(x) = \theta^T x} \]



\(\large \theta\) is a vector of parameters (weights) of the model

right

Linear regression



\[ y = \theta^T x = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \theta_3 & \theta_4 & \theta_5 & \theta_6 \end{bmatrix}\begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ x_6 \end{bmatrix} \]

right

Linear regression



\[ y = x^T \theta = \begin{bmatrix} 1 & x_1 & x_2 & x_3 & x_4 & x_5 & x_6 \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \\ \theta_6 \end{bmatrix} \]

right

Linear regression



\[ \begin{bmatrix} y_{1}\\ \vdots\\ y_{100} \end{bmatrix}= \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} & x_{1,5} & x_{1,6}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{100,1} & x_{100,2} & x_{100,3} & x_{100,4} & x_{100,5} & x_{100,6}\\ \end{bmatrix}\begin{bmatrix} \theta_0\\ \theta_1\\ \theta_2\\ \theta_3\\ \theta_4\\ \theta_5\\ \theta_6 \end{bmatrix} \]



right

Linear regression





\[ \huge{X \theta} = y \]

right

Linear regression





\[ \huge{X \theta} = y \]

Enter Linear Algebra

right

Setting



\[ \huge{X \theta} = y \]

right

\(X \theta = 0\)




\[ \huge{X \theta = 0} \]

right

\(X \theta = 0\)



If:

  • \(X \theta_1 = 0\)
  • \(X \theta_2 = 0\)

\(X \theta = 0\)



If:

  • \(X \theta_1 = 0\)
  • \(X \theta_2 = 0\)



Then:

  • \(X (\theta_1 + \theta_2) = X\theta_1 + X \theta_2 = 0\)
  • \(X(k \theta_1) = k(X \theta_1) = 0\)

\(X \theta = 0\)



The set of all solutions of \(X \theta = 0\) is \(\underline{} \underline{} \underline{} \underline{}\) .

right

\(X \theta = 0\)



The set of all solutions of \(X \theta = 0\) is \(N(X)\).

right

\(X \theta = 0\)



The set of all solutions of \(X \theta = 0\) is \(N(X)\).



Find a basis for \(N(X)\).

right

\(X \theta = 0\)



The set of all solutions of \(X \theta = 0\) is \(N(X)\).



Find a basis for \(N(X)\).



\(N(X) = \text{span}(B)\)

right

\(X \theta = 0\)




\[ \huge{X \theta = 0} \]


Gaussian elimination

right

Row-operations

Example




\[ X = \begin{bmatrix} 1 & 0 & -1 & 0\\ 2 & 1 & 0 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \]

right

Example: Step-1




\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 2 & 1 & 0 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \underrightarrow{R_2 \rightarrow R_2-2R_1} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \]

right

Example: Step-2




\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 3 & 1 & -1 & 1 \end{bmatrix} \underrightarrow{R_3 \rightarrow R_3-3R_1} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 1 & 2 & 1 \end{bmatrix} \]

right

Example: Step-3




\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 1 & 2 & 1 \end{bmatrix} \underrightarrow{R_3 \rightarrow R_3 - R_2} \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Row-echelon matrix




\[ \begin{bmatrix} 1 & 0 & -1 & 0\\ 0 & 1 & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Row-echelon matrix




\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

right

Solve




\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} \color{blue}{\theta_1}\\ \color{blue}{\theta_2}\\ \theta_3\\ \theta_4 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0 \end{bmatrix} \]

right

Algorithm

\(B = \{ \}\)

For each independent variable \(\theta_i\):

  • Set \(\theta_i = 1\) and \(\theta_j = 0\) for all independent variables, \(j \neq i\)
  • Solve for the dependent variables
  • Add \(\theta\) to \(B\)

right

Solved



\[ \begin{bmatrix} \color{blue}{1} & 0 & -1 & 0\\ 0 & \color{blue}{1} & 2 & 1\\ 0 & 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} \color{blue}{\theta_1}\\ \color{blue}{\theta_2}\\ \theta_3\\ \theta_4 \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0 \end{bmatrix} \]



\[ B = \left \{ \begin{bmatrix}1\\ -2\\ 1\\ 0\end{bmatrix}, \begin{bmatrix}0\\ -1\\ 0\\ 1\end{bmatrix} \right \} \]

\(X \theta = y\)




When will \(X \theta = y\) have a solution?

right

\(X \theta = y\)




\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]

\(X \theta = y\)




\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]




\[ \theta_1 x_1 + \cdots + \theta_n x_n = y \]

\(X \theta = y\)




\[ \begin{bmatrix} \big\vert & & \big\vert\\ x_1 & \cdots & x_n\\ \big\vert & & \big\vert \end{bmatrix} \begin{bmatrix} \theta_1\\ \vdots\\ \theta_n \end{bmatrix} = \begin{bmatrix} y_1\\ \vdots\\ y_m \end{bmatrix} \]




\[ \theta_1 x_1 + \cdots + \theta_n x_n = y \]

\[ y \in C(X) \]

\(X \theta = y\)




When will \(X \theta = y\) have a solution?



\(y \in C(X)\)

right

\(X \theta = y\)



  • \(X\) is the data-matrix
  • \(m \times n\)
  • \(m\) data-points, \(n\) features

\(X \theta = y\)



  • \(X\) is the data-matrix
  • \(m \times n\)
  • \(m\) data-points, \(n\) features


Typical dataset:

  • \(m = 10000\)
  • \(n = 10\)
  • Can \(10\) vectors span \(\mathbb{R}^{10000}\)?

\(X \theta = y\)



  • \(X\) is the data-matrix
  • \(m \times n\)
  • \(m\) data-points, \(n\) features


Typical dataset:

  • \(m = 10000\)
  • \(n = 10\)
  • Can \(10\) vectors span \(\mathbb{R}^{10000}\)?



\(X \theta = y\) is generally unsolvable.

right

\(X \theta \approx y\)

right

\(X \theta \approx y\)

What does the \(\approx\) symbol mean?

  • \(1.234 \approx 1\)
  • \(1.234 \approx 1.2\)
  • \(1.234 \approx 1.23\)

right

\(X \theta \approx y\)



\[ \hat{y} \approx y \]

\(X \theta \approx y\)



\[ \hat{y} \approx y \]



\[ ||\hat{y} - y|| \]

\(X \theta \approx y\)



\[ \hat{y} \approx y \]



\[ ||\hat{y} - y||^2 \]

\(X \theta \approx y\)



\[ \hat{y} \approx y \]



\[ ||\hat{y} - y||^2 = (\hat{y}_1 - y_1)^2 + \cdots + (\hat{y}_m - y_m)^2 \]

\(X \theta \approx y\)



\[ X\hat{\theta} \approx y \]



\[ \hat{\theta} = \arg \min \limits_{\theta} ||X\theta - y||^2 \]

Loss




\[ \huge{L = ||X \theta - y||^2} \]

right

Loss




\[ \begin{aligned} L &= ||X \theta - y||^2 \end{aligned} \]

right

Loss




\[ \begin{aligned} L &= ||X \theta - y||^2\\\\ &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

right

Optimization




\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]

Optimization




\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]



\[ \nabla_{\theta} L = \begin{bmatrix} \cfrac{\partial L}{\partial \theta_1}\\ \vdots\\ \cfrac{\partial L}{\partial \theta_n} \end{bmatrix} = 0 \]

Optimization




\[ \begin{aligned} L &= (X \theta - y)^T (X \theta - y) \end{aligned} \]




\[ \nabla_{\theta} L = 2(X^TX)\theta - 2X^Ty = 0 \]

Normal Equations




\[ \huge{(X^TX)\theta = X^Ty} \]

right

Solution




If \(X^TX\) is invertible, then:


\[ \hat{\theta} = (X^TX)^{-1} X^Ty \]

When is \(X^TX\) invertible?

right

\(X^TX\)

  • \(n \times n\) matrix
  • Symmetric

\(X^TX\)




\[ N(X) = N(X^TX) \]

\(X^TX\)




\[ N(X) = N(X^TX) \]

Proof

If \(\theta \in N(X)\):

  • \(X\theta =0\)
  • \(X^TX \theta = 0\)

\(X^TX\)




\[ N(X) = N(X^TX) \]

Proof

If \(\theta \in N(X)\):

  • \(X\theta =0\)
  • \(X^TX \theta = 0\)

If \(\theta \in N(X^T X)\)

  • \(X^TX \theta = 0\)
  • \(\theta^T X^TX \theta = 0\)
  • \((X \theta)^T (X \theta) = 0\)
  • \(X \theta = 0\)

\(X^TX\)




If \(\text{rank}(X) = n\), \(X^TX\) is invertible

Proof

If \(\text{rank}(X) = n\):

  • \(\text{nullity}(X) = 0\)
  • \(\text{nullity}(X^TX) = 0\)
  • \(\text{rank}(X^TX) = n\)
  • \(X^TX\) is full rank, hence invertible

Geometry




What does a linear regression model look like?

right

Data

right

Linear model

right

Errors

right

Projections

\[ (X^TX)\hat{\theta} = X^Ty \]

\(X \hat{\theta}\) is an approximation for \(y\)



How are \(X \hat{\theta}\) and \(y\) related geometrically?

Projections

\[ (X^TX)\hat{\theta} = X^Ty \]

\(X \hat{\theta}\) is an approximation for \(y\)



How are \(X \hat{\theta}\) and \(y\) related geometrically?




\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections




\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections




\[ X = \begin{bmatrix} 2 & 6\\ 1 & 3 \end{bmatrix}, y = \begin{bmatrix} 3\\ 4 \end{bmatrix}, X \hat{\theta} = \begin{bmatrix} 4\\ 2 \end{bmatrix} \]

Projections




\[ e = y - X \hat{\theta} \]

\[ e \perp C(X) \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

\[ e \perp x_i \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

\[ e \perp x_i \]


\[ x_i^T e = 0 \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

\[ e \perp x_i \]


\[ x_i^T e = 0 \]
\[ X^T e = 0 \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

\[ e \perp x_i \]


\[ x_i^T e = 0 \]
\[ X^T e = 0 \]
\[ X^T(y - X\hat{\theta}) = 0 \]

Projections



\[ X = \begin{bmatrix} \vert & & \vert\\ x_1 & \cdots & x_n\\ \vert & & \vert \end{bmatrix} \]

\[ e \perp x_i \]


\[ x_i^T e = 0 \]
\[ X^T e = 0 \]
\[ X^T(y - X\hat{\theta}) = 0 \]
\[ X^TX \hat{\theta} = X^T y \]

Summary