Sunday, 5 February 2017

Machine Learning Cheat Sheet Part 4 - Polynomial Regression and Normal Equation

1. Polynomial Regression

The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).

We can combine existing feature into one. For example, combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \times x_2$.

If the hypothesis function is:

$ h_\theta (x) = \theta_0 + \theta_1 x $

then we can create additional features based just on $x_1$ to get a quadratic function:

$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 $

or the cubic function:

$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 $

where new features are:

$ x_2 = x_1^2 $ and $ x_3 = x_1^3 $, so the function becomes:

$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $

NOTE: if features are chosen this way then feature scaling becomes very important.


2. Normal Equation

Normal Equation is the method to solve $\Theta$ analytically rather than using iterations with gradient descent. Normal Equation method will minimise $J$ by explicitly taking its derivatives with respect to the $\theta_j$'s and setting them to zero. This approach finds the optimum $\theta$ without iteration.

$m$ - number of training examples
$n$ - number of features

The Normal Equation formula is:

$ \Theta = (X^{T}X)^{-1}X^Ty $

Negative side-effects:

1. Normal Equation calculates matrix inversion $ (X^{T}X)^{-1} $ that has complexity $O(n^3)$ - this makes it slow if there is a very large number of features. In practice, when $n$ exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.

2. $X^{T}X^{-1}$ could be non-invertible (singular, degenerate). There are two possible solutions for such cases:

1) In Octave use $pinv(X'*X)*X'*y$ rather than $inv$ function as $pinv$ function might still be able to calculate the value of $\theta$ even if $X^{T}X^{-1}$ is non-invertible.

2) Common causes for non-invertibility could be:

  • Redundant features that are closely related (i.e. they are linearly dependent), or
  • There are too many features (e.g. $m\leqslant n$).
Solution to these problems could be:
  • Deleting a feature that is linearly dependent on another feature
  • Deleting some features or use regularisation when there are too many features.


Comparison of Gradient Descent and Normal Equation
Gradient Descent Normal Equation
Need to choose $\alpha$ No need to choose $\alpha$
Might need feature scaling No need for feature scaling
Needs many iterations No need to iterate
$O(kn^2)$ $O(n^3)$ to calculate inverse of $X^{T}X$
Works well for $n>10,000$ Works well for $n\leqslant10,000$

No comments:

Post a Comment

Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: