1. Logistic regression deals with data sets where $y$ may have only a small number of discrete values. For example, if $y\in \{0, 1\}$ then this is a binary classification problem in which $y$ can take only two values 0 and 1.
For instance, in an email spam classifier $x^{(i)}$ could be some feature of an email message and $y$ be 1 if it is a spam or 0 otherwise. In this case 0 would be a negative class and 1 a position class and they are sometimes also denoted by symbols "-" and "+". For given $x{(i)}$, the corresponding $y{(i)}$ is called the label of this training example.
2. Hypothesis representation:
$h_\theta (x) = g(z)$
where
$g(z) = \frac{1}{1+e^{-z}}$ - Sigmoid or Logistic function
and
$z = \theta^{T}x$
Sigmoid function $g(z)$ maps any real number to (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
In this case $h_\theta(x)$ gives a probability that output is 1. For example, $h_\theta(x) = 0.7$ gives a probability of 70% that the output is 1 and probability of 30% that it is 0.
$h_\theta(x) = P(y = 1 | x; \theta) = 1 - P(y = 0 | x; \theta)$
or
$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$
3. Decision Boundary
In order to get classification with discrete values 0 and 1, we can translate the output of the hypothesis function as follows:
$h_\theta(x) \geqslant 0.5 \rightarrow y = 1$
$h_\theta(x) < 0.5 \rightarrow y = 0$
Sigmoid function $g(z)$ behaves the way that if its input is greater than or equal to zero, its output is greater than or equal to 0.5:
$g(z) \geqslant 0.5$
when $z \geqslant 0$
Remember that:
$z = 0, e^0 = 1\Rightarrow g(z) = \frac{1}{2}$
$z \rightarrow \infty, e^{-\infty} \rightarrow 0 \Rightarrow g(z) = 1$
$z \rightarrow -\infty, e^{\infty} \rightarrow \infty \Rightarrow g(z) = 0$
So if $z = \theta^{T}x$, then it means that:
$h_\theta(x) = g(\theta^{T}x) \geqslant 0.5$
when $\theta^{T}x \geqslant 0$
From the statements above it's valid to say:
$\theta^{T}x \geqslant 0 \Rightarrow y = 1$
$\theta^{T}x < 0 \Rightarrow y = 0$
The decision boundary is a line that separates an area where y = 0 with an area where y = 1. This line is created by the hypothesis function.
An input to the sigmoid function $g(z)$ (e.g. $\theta^{T}x$) doesn't need to be linear and could be a function that describes, say, a circle ($z = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2$) or any other shape that fits the data.
4. Cost function
Training set with $m$ examples:
$\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})\}$
$n$ features:
$x \in \left[\begin{array}{c} x_0 \\ x_1 \\ ... \\ x_n \end{array} \right]$ where $x_0 = 1$ and $y \in \{0, 1\}$
Hypothesis with sigmoid:
$h_\theta(x) = \frac{1}{1 + e^{-\theta^{T}x}}$
Cost function helps to select parameters $\theta$:
\[J(\theta) = \frac{1}{m}\sum_{i=1}^{m} Cost(h_\theta(x^{(i)}, y^{(i)})\]
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (h_\theta (x))$, if $y = 1$
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (1 - h_\theta (x))$, if $y = 0$
5. Simplified cost function:
We can combine two conditional cases into one case:
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -y\log(h_\theta(x)) + (1 - y)\log(1 - h_\theta(x))$
Notice that when $y$ is equal to 1, then the second term $(1 -y)\log(1 - h_\theta(x))$ is zero and will not affect the result. If $y$ is equal to 0, then the first term $-y\log(h_\theta(x))$ is zero and will not affect the result.
The entire cost function will look like this:
\[ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))] \]
A vectorised representation is:
\[
h = g(X\theta) \\
J(\theta) = \frac{1}{m} \times (-y^{T}log(h) + (1 - y)^{T} \log(1 -1))
\]
6. Gradient descent:
\[
Repeat\ \{ \\
\theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \\
\}
\]
A vectorised representation is:
\[ \theta = \theta - \frac{\alpha}{m} X^{T} (g(X\theta) - \overrightarrow{y}) \]
6. Advanced Optimisation
Once cost function $J(\theta)$ is defined, we need to $\min_{\theta}J(\theta)$ in order to find optimal values $\theta$ for our hypothesis. In a normal case scenario, we need a code that can compute:
- cost function $J(\theta)$, and
- gradient descent $\frac{\partial}{\partial \theta_j}J(\theta) \ (for\ j = 0, 1, ..., n)$
where gradient descent is:
$Repeat\ \{ \\
\ \ \ \ \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) \\
\}$
Luckily, there are existing optimisation algorithms that work quite well. Such Octave (and Matlab) algorithms as "Conjugate Gradient", "BFGS" and "L-BFGS" provide more sophisticated and faster way to optimise $\theta$ and could be used instead of gradient descent. The workflow is as follows:
1. Provide a function that evaluates the following two functions for a given input value $\theta$:
$J(\theta)$, and
$\frac{\partial}{\partial \theta_j}J(\theta)$
In Octave it may look like this:
$function\ [jVal, gradeint] = costFunction(theta) \\
\ \ \ \ jVal = [...\ code\ to\ compute\ J(\theta)...]; \\
\ \ \ \ gradient = [...\ code\ to\ compute\ derivative\ of\ J(\theta)...]; \\
end$
2. Use Octave optimisation algorithm $fminunc()$ together with $optimset()$ function that creates an object containing the options to be sent to $fminunc()$. Give to the function $fminunc()$ the cost function $J(\theta)$, an initial vector of $\theta$ values and the "options" object:
$options = optimset('GradObj', 'on', 'MaxIter', 100); \\
initialTheta = zeros(2, 1); \\
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);$
$optTheta$ would contain $\theta$ values that we are after.
Advantages of using existing optimisation algorithms are:
- No need to manually pick up $\alpha$
- Often faster than gradient descent
Disadvantage:
- Could be more complex for a given task than actually needed.
7. Multi-class Classification
This is an approach to classify the data when there are more than two categories. Instead of $y \in \{0, 1\}$, it could be $y \in \{0, 1, 2, ..., n\}$.
In this case we can divide the problem into $(n+1)$ binary classification problems and in each of them to predict that $y$ is a member of one of given classes.
$y \in \{0, 1, 2, ..., n\}$
$h_{\theta}^{(0)}(x) = P(y = 0 | x; \theta)$
$h_{\theta}^{(1)}(x) = P(y = 1 | x; \theta)$
$h_{\theta}^{(2)}(x) = P(y = 2 | x; \theta)$
$h_{\theta}^{(n)}(x) = P(y = n | x; \theta)$
$prediction = max_i(h_{\theta}^{(i)}(x))$
Basically, this is a selection of one class with all other classes combined into a single second class. It is done repeatedly, applying binary logistic regression to each case. Then for prediction we can use the hypothesis that returns the highest value.
To summarise:
1. Train a logistic regression classifier $h_{\theta}(x)$ for each class to predict the probability that $y = i$.
2. To make a prediction on a new $x$, pick the class that maximises $h_{\theta}(x)$.
Saturday, 18 February 2017
Sunday, 5 February 2017
Machine Learning Cheat Sheet Part 4 - Polynomial Regression and Normal Equation
1. Polynomial Regression
The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).
We can combine existing feature into one. For example, combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \times x_2$.
If the hypothesis function is:
$ h_\theta (x) = \theta_0 + \theta_1 x $
then we can create additional features based just on $x_1$ to get a quadratic function:
$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 $
or the cubic function:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 $
where new features are:
$ x_2 = x_1^2 $ and $ x_3 = x_1^3 $, so the function becomes:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $
NOTE: if features are chosen this way then feature scaling becomes very important.
2. Normal Equation
Normal Equation is the method to solve $\Theta$ analytically rather than using iterations with gradient descent. Normal Equation method will minimise $J$ by explicitly taking its derivatives with respect to the $\theta_j$'s and setting them to zero. This approach finds the optimum $\theta$ without iteration.
$m$ - number of training examples
$n$ - number of features
The Normal Equation formula is:
$ \Theta = (X^{T}X)^{-1}X^Ty $
Negative side-effects:
1. Normal Equation calculates matrix inversion $ (X^{T}X)^{-1} $ that has complexity $O(n^3)$ - this makes it slow if there is a very large number of features. In practice, when $n$ exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.
2. $X^{T}X^{-1}$ could be non-invertible (singular, degenerate). There are two possible solutions for such cases:
1) In Octave use $pinv(X'*X)*X'*y$ rather than $inv$ function as $pinv$ function might still be able to calculate the value of $\theta$ even if $X^{T}X^{-1}$ is non-invertible.
2) Common causes for non-invertibility could be:
The hypothesis function does not need to be linear (a straight line) if it does not fit the data well. It's possible to change the behaviour of the curve of the hypothesis function by making it a quadratic, cubic or square root function (or any other form).
We can combine existing feature into one. For example, combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \times x_2$.
If the hypothesis function is:
$ h_\theta (x) = \theta_0 + \theta_1 x $
then we can create additional features based just on $x_1$ to get a quadratic function:
$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 $
or the cubic function:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 $
where new features are:
$ x_2 = x_1^2 $ and $ x_3 = x_1^3 $, so the function becomes:
$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $
NOTE: if features are chosen this way then feature scaling becomes very important.
2. Normal Equation
Normal Equation is the method to solve $\Theta$ analytically rather than using iterations with gradient descent. Normal Equation method will minimise $J$ by explicitly taking its derivatives with respect to the $\theta_j$'s and setting them to zero. This approach finds the optimum $\theta$ without iteration.
$m$ - number of training examples
$n$ - number of features
The Normal Equation formula is:
$ \Theta = (X^{T}X)^{-1}X^Ty $
Negative side-effects:
1. Normal Equation calculates matrix inversion $ (X^{T}X)^{-1} $ that has complexity $O(n^3)$ - this makes it slow if there is a very large number of features. In practice, when $n$ exceeds 10,000 it might be a good idea to switch from Normal Equation to iterations with Gradient Descent.
2. $X^{T}X^{-1}$ could be non-invertible (singular, degenerate). There are two possible solutions for such cases:
1) In Octave use $pinv(X'*X)*X'*y$ rather than $inv$ function as $pinv$ function might still be able to calculate the value of $\theta$ even if $X^{T}X^{-1}$ is non-invertible.
2) Common causes for non-invertibility could be:
- Redundant features that are closely related (i.e. they are linearly dependent), or
- There are too many features (e.g. $m\leqslant n$).
Solution to these problems could be:
- Deleting a feature that is linearly dependent on another feature
- Deleting some features or use regularisation when there are too many features.
Gradient Descent | Normal Equation |
---|---|
Need to choose $\alpha$ | No need to choose $\alpha$ |
Might need feature scaling | No need for feature scaling |
Needs many iterations | No need to iterate |
$O(kn^2)$ | $O(n^3)$ to calculate inverse of $X^{T}X$ |
Works well for $n>10,000$ | Works well for $n\leqslant10,000$ |
Saturday, 4 February 2017
Machine Learning Cheat Sheet Part 3 - Linear Regression with Multiple Variables
1. Notation:
$ m $ - number of training examples.
$ n = \vert x^{(i)} \vert $ - number of features.
$ x^{(i)} $ - column vector of all the feature inputs of the $ i^{th} $ training example.
$ x_{j}^{(i)} $ - value of feature $ j $ in the $ i^{th} $ training example.
$ \alpha $ - learning rate.
2. Hypothesis:
$ h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n $
or
$ h_\theta(x) = \left[ \begin{array}{cc} \theta_0 & \theta_1 & ... & \theta_n \end{array} \right] \times \left[ \begin{array}{cc} x_1 \\ x_2 \\ ... \\ x_n \end{array} \right] = \Theta^{T}X $
where $ x_0^{(i)} = 1 $ for $ (i \in 1, ..., m) $
3. Parameters:
$ \theta_0, \theta_1, ..., \theta_n $ or $ \Theta $
4. Cost function:
\[
J(\theta_1, \theta_2, ..., \theta_n) = J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\]
5. Gradient descent:
repeat until convergence:
\[
{
\{
\\
\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_0^{(i)}
\\
\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_1^{(i)}
\\
\theta_2 = \theta_2 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_2^{(i)}
\\
...
\\
\}
}
\]
6. Feature scaling:
This to be applied to the input data (training) set. Features may have significant differences in their values. For example, $ x_1 $ may have values from 100 to 10,000 (e.g. area of a house in $ feet^2 $) while $ x_1 $ may have values from 1 to 6 (e.g. number of bedrooms). It would be a good idea to scale the feature values, so that:
\[
-1 \leqslant x_i \leqslant 1 \\ or \\ -0.5 \leqslant x_i \leqslant 0.5
\]
Feature scaling technique could be a combination of feature scaling and mean normalisation:
\[
x_i = \frac{x_i - \mu_i}{s_i}
\]
where $\mu$ is an average of all the values of feature $x_i$ and $s_i$ is either the range of values $(\max - \min)$ of $x_i$ or their standard deviation. Note that dividing by the range, or dividing by the standard deviation, give different results.
7. Debugging gradient descent using learning rate $\alpha$:
7.1 Value of $J(\theta)$ and number of iterations:
Make a plot of values of cost function $J(\theta)$ (y-axis) over the number of iterations of gradient descent (x-axis). If $J(\theta)$ ever increases, then try to decrease $\alpha$ and repeat.
7.2 Automatic convergence test:
It has been proven that if learning rate $\alpha$ is sufficiently small, then $J(\theta)$ will decrease on each iteration. Declare convergence if $J(\theta)$ decreases by less than $\varepsilon$ in one iteration, where $\varepsilon$ is some small value such as $10^{-3}$. However in practice it could be difficult to choose this threshold value.
7.3 Trade-off:
If $\alpha$ is too small, it could result in slow convergence.
if $\alpha$ is too large, it may not decrease on each iteration and thus may not converge.
Choose a small $\varepsilon$ and then increase it by $\approx3\times$:
$0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...$
$ m $ - number of training examples.
$ n = \vert x^{(i)} \vert $ - number of features.
$ x^{(i)} $ - column vector of all the feature inputs of the $ i^{th} $ training example.
$ x_{j}^{(i)} $ - value of feature $ j $ in the $ i^{th} $ training example.
$ \alpha $ - learning rate.
2. Hypothesis:
$ h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n $
or
$ h_\theta(x) = \left[ \begin{array}{cc} \theta_0 & \theta_1 & ... & \theta_n \end{array} \right] \times \left[ \begin{array}{cc} x_1 \\ x_2 \\ ... \\ x_n \end{array} \right] = \Theta^{T}X $
where $ x_0^{(i)} = 1 $ for $ (i \in 1, ..., m) $
3. Parameters:
$ \theta_0, \theta_1, ..., \theta_n $ or $ \Theta $
4. Cost function:
\[
J(\theta_1, \theta_2, ..., \theta_n) = J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\]
5. Gradient descent:
repeat until convergence:
\[
{
\{
\\
\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_0^{(i)}
\\
\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_1^{(i)}
\\
\theta_2 = \theta_2 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)}) - y^{(i)}) \times x_2^{(i)}
\\
...
\\
\}
}
\]
6. Feature scaling:
This to be applied to the input data (training) set. Features may have significant differences in their values. For example, $ x_1 $ may have values from 100 to 10,000 (e.g. area of a house in $ feet^2 $) while $ x_1 $ may have values from 1 to 6 (e.g. number of bedrooms). It would be a good idea to scale the feature values, so that:
\[
-1 \leqslant x_i \leqslant 1 \\ or \\ -0.5 \leqslant x_i \leqslant 0.5
\]
Feature scaling technique could be a combination of feature scaling and mean normalisation:
\[
x_i = \frac{x_i - \mu_i}{s_i}
\]
where $\mu$ is an average of all the values of feature $x_i$ and $s_i$ is either the range of values $(\max - \min)$ of $x_i$ or their standard deviation. Note that dividing by the range, or dividing by the standard deviation, give different results.
7. Debugging gradient descent using learning rate $\alpha$:
7.1 Value of $J(\theta)$ and number of iterations:
Make a plot of values of cost function $J(\theta)$ (y-axis) over the number of iterations of gradient descent (x-axis). If $J(\theta)$ ever increases, then try to decrease $\alpha$ and repeat.
7.2 Automatic convergence test:
It has been proven that if learning rate $\alpha$ is sufficiently small, then $J(\theta)$ will decrease on each iteration. Declare convergence if $J(\theta)$ decreases by less than $\varepsilon$ in one iteration, where $\varepsilon$ is some small value such as $10^{-3}$. However in practice it could be difficult to choose this threshold value.
7.3 Trade-off:
If $\alpha$ is too small, it could result in slow convergence.
if $\alpha$ is too large, it may not decrease on each iteration and thus may not converge.
Choose a small $\varepsilon$ and then increase it by $\approx3\times$:
$0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...$
Friday, 3 February 2017
Machine Learning Cheat Sheet Part 2 - Linear Regression with One Variable
1. Training set: $ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... (x^{(m)}, y^{(m)}) $
2. Hypothesis: $ h_\theta(x)=\theta_0+\theta_1x $
3. Parameters: $ \theta_0, \theta_1 $
4. Cost function: $J(\theta_0, \theta_1)$ uses parameters $\theta_0$ and $\theta_1$ to check the difference between hypothesis values $h_\theta(x)$ and given values $y$ from training example $(x,y)$:
\[ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \]
5. Goal: minimise cost function
$ min_{\theta_0, \theta_1} J(\theta_0, \theta_1) $
6. Gradient descent algorithm (minimisation of cost function):
$ \alpha $ - learning rate
repeat until convergence:
\[
{
\{
\\
\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)} - y^{(i)})
\\
\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_0(x^{(i)} - y^{(i)})\times x^{(i)}
\\
\}
}
\]
(update $ \theta_0 $ and $ \theta_1 $ simultaneously!)
Thursday, 2 February 2017
Machine Learning Cheat Sheet Part 1 - Supervised and Unsupervised Machine Learning
1. Machine Learning definition:
Field of study that gives computer the ability to learn without being explicitly programmed.
Arthur Samuel (1959)
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Tom Mitchel (1998)
2. Supervised Machine Learning
Supervised machine learning is the task of inferring a function from labeled training data. The training data consist of a set of training examples. Each example is a pair consisting of an input value/s (usually a vector) and a known output value. A supervised learning algorithm analyses the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly labels unseen instances.
The inferred function could represent:
- regression - predict continuous valued output, or
- classification or categorisation - predict discrete valued output, for example, 0 and 1.
3. Unsupervised Machine Learning
Unsupervised machine learning is the task of inferring a function to describe hidden structure from unlabelled data (see Cocktail Party Effect).
Wednesday, 1 February 2017
Machine Learning Cheat Sheet - Introduction
This is a summary of my notes taken on Machine Learning course that I completed on Coursera. It was taught by Professor Andrew Ng, Associate Professor at Stanford University, Chief Scientist at Baidu and Chairman/Co-founder at Coursera. The notes include main math formulas and some practical tips mentioned during the course. Covered topics are:
Subscribe to:
Posts (Atom)
Online Encyclopedia of Statistical Science (Free)
Please, click on the chart below to go to the source:
-
Thanks to an excellent Java Concept of the Day , this is a brief description of main interfaces and classes of Java Collection Framework. H...
-
1. Logistic regression deals with data sets where $y$ may have only a small number of discrete values. For example, if $y\in \{0, 1\}$ then ...
-
1. Machine Learning definition: Field of study that gives computer the ability to learn without being explicitly programmed. Arthur Sam...