Saturday, 18 February 2017

Machine Learning Cheat Sheet Part 5 - Logistic Regression (Classification)

1. Logistic regression deals with data sets where $y$ may have only a small number of discrete values. For example, if $y\in \{0, 1\}$ then this is a binary classification problem in which $y$ can take only two values 0 and 1.

For instance, in an email spam classifier $x^{(i)}$ could be some feature of an email message and $y$ be 1 if it is a spam or 0 otherwise. In this case 0 would be a negative class and 1 a position class and they are sometimes also denoted by symbols "-" and "+". For given $x{(i)}$, the corresponding $y{(i)}$ is called the label of this training example.


2. Hypothesis representation:

$h_\theta (x) = g(z)$

where

$g(z) = \frac{1}{1+e^{-z}}$ - Sigmoid or Logistic function

and

$z = \theta^{T}x$

Sigmoid function $g(z)$ maps any real number to (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.

In this case $h_\theta(x)$ gives a probability that output is 1. For example, $h_\theta(x) = 0.7$ gives a probability of 70% that the output is 1 and probability of 30% that it is 0.

$h_\theta(x) = P(y = 1 | x; \theta) = 1 - P(y = 0 | x; \theta)$

or

$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) =  1$


3. Decision Boundary

In order to get classification with discrete values 0 and 1, we can translate the output of the hypothesis function as follows:

$h_\theta(x) \geqslant 0.5 \rightarrow y = 1$

$h_\theta(x) < 0.5 \rightarrow y = 0$


Sigmoid function $g(z)$ behaves the way that if its input is greater than or equal to zero, its output is greater than or equal to 0.5:

$g(z) \geqslant 0.5$

when $z \geqslant 0$


Remember that:

$z = 0, e^0 = 1\Rightarrow g(z) = \frac{1}{2}$

$z \rightarrow \infty, e^{-\infty} \rightarrow 0 \Rightarrow g(z) = 1$

$z \rightarrow -\infty, e^{\infty} \rightarrow \infty \Rightarrow g(z) = 0$


So if $z = \theta^{T}x$, then it means that:

$h_\theta(x) = g(\theta^{T}x) \geqslant 0.5$

when $\theta^{T}x \geqslant 0$


From the statements above it's valid to say:

$\theta^{T}x \geqslant 0 \Rightarrow y = 1$

$\theta^{T}x < 0 \Rightarrow y = 0$


The decision boundary is a line that separates an area where y = 0 with an area where y = 1. This line is created by the hypothesis function.

An input to the sigmoid function $g(z)$ (e.g. $\theta^{T}x$) doesn't need to be linear and could be a function that describes, say, a circle ($z = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2$) or any other shape that fits the data.


4. Cost function

Training set with $m$ examples:

$\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})\}$

$n$ features:

$x \in \left[\begin{array}{c} x_0 \\ x_1 \\ ... \\ x_n \end{array} \right]$ where $x_0 = 1$ and $y \in \{0, 1\}$

Hypothesis with sigmoid:

$h_\theta(x) = \frac{1}{1 + e^{-\theta^{T}x}}$

Cost function helps to select parameters $\theta$:
\[J(\theta) = \frac{1}{m}\sum_{i=1}^{m} Cost(h_\theta(x^{(i)}, y^{(i)})\]
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (h_\theta (x))$, if $y = 1$

$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (1 - h_\theta (x))$, if $y = 0$


5. Simplified cost function:

We can combine two conditional cases into one case:

$Cost(h_\theta(x^{(i)}, y^{(i)}) = -y\log(h_\theta(x)) + (1 - y)\log(1 - h_\theta(x))$

Notice that when $y$ is equal to 1, then the second term $(1 -y)\log(1 - h_\theta(x))$ is zero and will not affect the result. If $y$ is equal to 0, then the first term $-y\log(h_\theta(x))$ is zero and will not affect the result.


The entire cost function will look like this:
\[ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))] \]
A vectorised representation is:
\[
h = g(X\theta) \\
J(\theta) = \frac{1}{m} \times (-y^{T}log(h) + (1 - y)^{T} \log(1 -1))
\]
6. Gradient descent:
\[
Repeat\ \{ \\
\theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \\
\}
\]
A vectorised representation is:
\[ \theta = \theta - \frac{\alpha}{m} X^{T} (g(X\theta) - \overrightarrow{y}) \]


6. Advanced Optimisation

Once cost function $J(\theta)$ is defined, we need to $\min_{\theta}J(\theta)$ in order to find optimal values $\theta$ for our hypothesis. In a normal case scenario, we need a code that can compute:

- cost function $J(\theta)$, and

- gradient descent $\frac{\partial}{\partial \theta_j}J(\theta) \ (for\ j = 0, 1, ..., n)$

where gradient descent is:

$Repeat\ \{ \\
\ \ \ \ \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) \\
\}$

Luckily, there are existing optimisation algorithms that work quite well. Such Octave (and Matlab) algorithms as "Conjugate Gradient", "BFGS" and "L-BFGS" provide more sophisticated and faster way to optimise $\theta$ and could be used instead of gradient descent. The workflow is as follows:

1. Provide a function that evaluates the following two functions for a given input value $\theta$:

$J(\theta)$, and
$\frac{\partial}{\partial \theta_j}J(\theta)$

In Octave it may look like this:

$function\ [jVal, gradeint] = costFunction(theta) \\
\ \ \ \ jVal = [...\ code\ to\ compute\ J(\theta)...]; \\
\ \ \ \ gradient = [...\ code\ to\ compute\ derivative\ of\ J(\theta)...]; \\
end$

2. Use Octave optimisation algorithm $fminunc()$ together with $optimset()$ function that creates an object containing the options to be sent to $fminunc()$. Give to the function $fminunc()$ the cost function $J(\theta)$, an initial vector of $\theta$ values and the "options" object:

$options = optimset('GradObj', 'on', 'MaxIter', 100); \\
initialTheta = zeros(2, 1); \\
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);$
$optTheta$ would contain $\theta$ values that we are after.

Advantages of using existing optimisation algorithms are:
- No need to manually pick up $\alpha$
- Often faster than gradient descent

Disadvantage:
 - Could be more complex for a given task than actually needed.


7. Multi-class Classification

This is an approach to classify the data when there are more than two categories. Instead of $y \in \{0, 1\}$, it could be $y \in \{0, 1, 2, ..., n\}$.

In this case we can divide the problem into $(n+1)$ binary classification problems and in each of them to predict that $y$ is a member of one of given classes.

$y \in \{0, 1, 2, ..., n\}$

$h_{\theta}^{(0)}(x) = P(y = 0 | x; \theta)$

$h_{\theta}^{(1)}(x) = P(y = 1 | x; \theta)$

$h_{\theta}^{(2)}(x) = P(y = 2 | x; \theta)$

$h_{\theta}^{(n)}(x) = P(y = n | x; \theta)$

$prediction = max_i(h_{\theta}^{(i)}(x))$

Basically, this is a selection of one class with all other classes combined into a single second class. It is done repeatedly, applying binary logistic regression to each case. Then for prediction we can use the hypothesis that returns the highest value.


To summarise:

1. Train a logistic regression classifier $h_{\theta}(x)$ for each class to predict the probability that $y = i$.

2. To make a prediction on a new $x$, pick the class that maximises $h_{\theta}(x)$.



No comments:

Post a Comment

Online Encyclopedia of Statistical Science (Free)

Please, click on the chart below to go to the source: