Saturday, 4 February 2017

Machine Learning Cheat Sheet Part 3 - Linear Regression with Multiple Variables

1. Notation:

$ m $ - number of training examples.

$ n = \vert x^{(i)} \vert $ - number of features.

$ x^{(i)} $ - column vector of all the feature inputs of the $ i^{th} $ training example.

$ x_{j}^{(i)} $ - value of feature $ j $ in the $ i^{th} $ training example.

$ \alpha $ - learning rate.


2. Hypothesis:

$ h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n $

or
$ h_\theta(x) =  \left[ \begin{array}{cc} \theta_0 & \theta_1 & ... & \theta_n \end{array} \right] \times \left[ \begin{array}{cc} x_1 \\ x_2 \\ ... \\ x_n \end{array} \right] = \Theta^{T}X $

where $ x_0^{(i)} = 1 $ for $ (i \in 1, ..., m) $


3. Parameters:

$ \theta_0, \theta_1, ..., \theta_n $ or $ \Theta $


4. Cost function:

\[
J(\theta_1, \theta_2, ..., \theta_n) = J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\]
5. Gradient descent:

repeat until convergence:
\[
{
\{
\\
\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_0^{(i)}
\\
\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_1^{(i)}
\\
\theta_2 = \theta_2 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_2^{(i)}
\\
...
\\
\}
}
\]
6. Feature scaling:

This to be applied to the input data (training) set. Features may have significant differences in their values. For example, $ x_1 $ may have values from 100 to 10,000 (e.g. area of a house in $ feet^2 $) while $ x_1 $ may have values from 1 to 6 (e.g. number of bedrooms). It would be a good idea to scale the feature values, so that:
\[
-1 \leqslant x_i \leqslant 1  \\ or \\ -0.5 \leqslant x_i \leqslant 0.5
\]
Feature scaling technique could be a combination of feature scaling and mean normalisation:
\[
x_i = \frac{x_i - \mu_i}{s_i}
\]
where $\mu$ is an average of all the values of feature $x_i$ and $s_i$ is either the range of values $(\max - \min)$ of $x_i$ or their standard deviation. Note that dividing by the range, or dividing by the standard deviation, give different results.


7. Debugging gradient descent using learning rate $\alpha$:

7.1 Value of $J(\theta)$ and number of iterations:

Make a plot of values of cost function $J(\theta)$ (y-axis) over the number of iterations of gradient descent (x-axis). If $J(\theta)$ ever increases, then try to decrease $\alpha$ and repeat.

7.2 Automatic convergence test:

It has been proven that if learning rate $\alpha$ is sufficiently small, then $J(\theta)$ will decrease on each iteration. Declare convergence if $J(\theta)$ decreases by less than $\varepsilon$ in one iteration, where $\varepsilon$ is some small value such as $10^{-3}$. However in practice it could be difficult to choose this threshold value.

7.3 Trade-off:

If $\alpha$ is too small, it could result in slow convergence.
if $\alpha$ is too large,  it may not decrease on each iteration and thus may not converge.

Choose a small $\varepsilon$ and then increase it by $\approx3\times$:
$0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...$

No comments:

Post a Comment

Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: