The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).
We can combine existing feature into one. For example, combine x1 and x2 into a new feature x3 by taking x1×x2.
If the hypothesis function is:
hθ(x)=θ0+θ1x
then we can create additional features based just on x1 to get a quadratic function:
hθ(x)=θ0+θ1x+θ2x2
or the cubic function:
hθ(x)=θ0+θ1x1+θ2x21+θ3x31
where new features are:
x2=x21 and x3=x31, so the function becomes:
hθ(x)=θ0+θ1x1+θ2x2+θ3x3
NOTE: if features are chosen this way then feature scaling becomes very important.
2. Normal Equation
Normal Equation is the method to solve Θ analytically rather than using iterations with gradient descent. Normal Equation method will minimise J by explicitly taking its derivatives with respect to the θj's and setting them to zero. This approach finds the optimum θ without iteration.
m - number of training examples
n - number of features
The Normal Equation formula is:
Θ=(XTX)−1XTy
Negative side-effects:
1. Normal Equation calculates matrix inversion (XTX)−1 that has complexity O(n3) - this makes it slow if there is a very large number of features. In practice, when n exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.
2. XTX−1 could be non-invertible (singular, degenerate). There are two possible solutions for such cases:
1) In Octave use pinv(X′∗X)∗X′∗y rather than inv function as pinv function might still be able to calculate the value of θ even if XTX−1 is non-invertible.
2) Common causes for non-invertibility could be:
- Redundant features that are closely related (i.e. they are linearly dependent), or
- There are too many features (e.g. m⩽n).
Solution to these problems could be:
- Deleting a feature that is linearly dependent on another feature
- Deleting some features or use regularisation when there are too many features.
Gradient Descent | Normal Equation |
---|---|
Need to choose α | No need to choose α |
Might need feature scaling | No need for feature scaling |
Needs many iterations | No need to iterate |
O(kn2) | O(n3) to calculate inverse of XTX |
Works well for n>10,000 | Works well for n⩽10,000 |
No comments:
Post a Comment