The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).
We can combine existing feature into one. For example, combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \times x_2$.
If the hypothesis function is:
$ h_\theta (x) = \theta_0 + \theta_1 x $
then we can create additional features based just on $x_1$ to get a quadratic function:
$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 $
or the cubic function:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 $
where new features are:
$ x_2 = x_1^2 $ and $ x_3 = x_1^3 $, so the function becomes:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $
NOTE: if features are chosen this way then feature scaling becomes very important.
2. Normal Equation
Normal Equation is the method to solve $\Theta$ analytically rather than using iterations with gradient descent. Normal Equation method will minimise $J$ by explicitly taking its derivatives with respect to the $\theta_j$'s and setting them to zero. This approach finds the optimum $\theta$ without iteration.
$m$ - number of training examples
$n$ - number of features
The Normal Equation formula is:
$ \Theta = (X^{T}X)^{-1}X^Ty $
Negative side-effects:
1. Normal Equation calculates matrix inversion $ (X^{T}X)^{-1} $ that has complexity $O(n^3)$ - this makes it slow if there is a very large number of features. In practice, when $n$ exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.
2. $X^{T}X^{-1}$ could be non-invertible (singular, degenerate). There are two possible solutions for such cases:
1) In Octave use $pinv(X'*X)*X'*y$ rather than $inv$ function as $pinv$ function might still be able to calculate the value of $\theta$ even if $X^{T}X^{-1}$ is non-invertible.
2) Common causes for non-invertibility could be:
- Redundant features that are closely related (i.e. they are linearly dependent), or
- There are too many features (e.g. $m\leqslant n$).
Solution to these problems could be:
- Deleting a feature that is linearly dependent on another feature
- Deleting some features or use regularisation when there are too many features.
Gradient Descent | Normal Equation |
---|---|
Need to choose $\alpha$ | No need to choose $\alpha$ |
Might need feature scaling | No need for feature scaling |
Needs many iterations | No need to iterate |
$O(kn^2)$ | $O(n^3)$ to calculate inverse of $X^{T}X$ |
Works well for $n>10,000$ | Works well for $n\leqslant10,000$ |
No comments:
Post a Comment