Processing math: 100%

Sunday, 5 February 2017

Machine Learning Cheat Sheet Part 4 - Polynomial Regression and Normal Equation

1. Polynomial Regression

The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).

We can combine existing feature into one. For example, combine x1 and x2 into a new feature x3 by taking x1×x2.

If the hypothesis function is:

hθ(x)=θ0+θ1x

then we can create additional features based just on x1 to get a quadratic function:

hθ(x)=θ0+θ1x+θ2x2

or the cubic function:

hθ(x)=θ0+θ1x1+θ2x21+θ3x31

where new features are:

x2=x21 and x3=x31, so the function becomes:

hθ(x)=θ0+θ1x1+θ2x2+θ3x3

NOTE: if features are chosen this way then feature scaling becomes very important.


2. Normal Equation

Normal Equation is the method to solve Θ analytically rather than using iterations with gradient descent. Normal Equation method will minimise J by explicitly taking its derivatives with respect to the θj's and setting them to zero. This approach finds the optimum θ without iteration.

m - number of training examples
n - number of features

The Normal Equation formula is:

Θ=(XTX)1XTy

Negative side-effects:

1. Normal Equation calculates matrix inversion (XTX)1 that has complexity O(n3) - this makes it slow if there is a very large number of features. In practice, when n exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.

2. XTX1 could be non-invertible (singular, degenerate). There are two possible solutions for such cases:

1) In Octave use pinv(XX)Xy rather than inv function as pinv function might still be able to calculate the value of θ even if XTX1 is non-invertible.

2) Common causes for non-invertibility could be:

  • Redundant features that are closely related (i.e. they are linearly dependent), or
  • There are too many features (e.g. mn).
Solution to these problems could be:
  • Deleting a feature that is linearly dependent on another feature
  • Deleting some features or use regularisation when there are too many features.


Comparison of Gradient Descent and Normal Equation
Gradient Descent Normal Equation
Need to choose α No need to choose α
Might need feature scaling No need for feature scaling
Needs many iterations No need to iterate
O(kn2) O(n3) to calculate inverse of XTX
Works well for n>10,000 Works well for n10,000

No comments:

Post a Comment

Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: