1. Logistic regression deals with data sets where $y$ may have only a small number of discrete values. For example, if $y\in \{0, 1\}$ then this is a binary classification problem in which $y$ can take only two values 0 and 1.
For instance, in an email spam classifier $x^{(i)}$ could be some feature of an email message and $y$ be 1 if it is a spam or 0 otherwise. In this case 0 would be a negative class and 1 a position class and they are sometimes also denoted by symbols "-" and "+". For given $x{(i)}$, the corresponding $y{(i)}$ is called the label of this training example.
2. Hypothesis representation:
$h_\theta (x) = g(z)$
where
$g(z) = \frac{1}{1+e^{-z}}$ - Sigmoid or Logistic function
and
$z = \theta^{T}x$
Sigmoid function $g(z)$ maps any real number to (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
In this case $h_\theta(x)$ gives a probability that output is 1. For example, $h_\theta(x) = 0.7$ gives a probability of 70% that the output is 1 and probability of 30% that it is 0.
$h_\theta(x) = P(y = 1 | x; \theta) = 1 - P(y = 0 | x; \theta)$
or
$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$
3. Decision Boundary
In order to get classification with discrete values 0 and 1, we can translate the output of the hypothesis function as follows:
$h_\theta(x) \geqslant 0.5 \rightarrow y = 1$
$h_\theta(x) < 0.5 \rightarrow y = 0$
Sigmoid function $g(z)$ behaves the way that if its input is greater than or equal to zero, its output is greater than or equal to 0.5:
$g(z) \geqslant 0.5$
when $z \geqslant 0$
Remember that:
$z = 0, e^0 = 1\Rightarrow g(z) = \frac{1}{2}$
$z \rightarrow \infty, e^{-\infty} \rightarrow 0 \Rightarrow g(z) = 1$
$z \rightarrow -\infty, e^{\infty} \rightarrow \infty \Rightarrow g(z) = 0$
So if $z = \theta^{T}x$, then it means that:
$h_\theta(x) = g(\theta^{T}x) \geqslant 0.5$
when $\theta^{T}x \geqslant 0$
From the statements above it's valid to say:
$\theta^{T}x \geqslant 0 \Rightarrow y = 1$
$\theta^{T}x < 0 \Rightarrow y = 0$
The decision boundary is a line that separates an area where y = 0 with an area where y = 1. This line is created by the hypothesis function.
An input to the sigmoid function $g(z)$ (e.g. $\theta^{T}x$) doesn't need to be linear and could be a function that describes, say, a circle ($z = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2$) or any other shape that fits the data.
4. Cost function
Training set with $m$ examples:
$\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})\}$
$n$ features:
$x \in \left[\begin{array}{c} x_0 \\ x_1 \\ ... \\ x_n \end{array} \right]$ where $x_0 = 1$ and $y \in \{0, 1\}$
Hypothesis with sigmoid:
$h_\theta(x) = \frac{1}{1 + e^{-\theta^{T}x}}$
Cost function helps to select parameters $\theta$:
\[J(\theta) = \frac{1}{m}\sum_{i=1}^{m} Cost(h_\theta(x^{(i)}, y^{(i)})\]
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (h_\theta (x))$, if $y = 1$
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -\log (1 - h_\theta (x))$, if $y = 0$
5. Simplified cost function:
We can combine two conditional cases into one case:
$Cost(h_\theta(x^{(i)}, y^{(i)}) = -y\log(h_\theta(x)) + (1 - y)\log(1 - h_\theta(x))$
Notice that when $y$ is equal to 1, then the second term $(1 -y)\log(1 - h_\theta(x))$ is zero and will not affect the result. If $y$ is equal to 0, then the first term $-y\log(h_\theta(x))$ is zero and will not affect the result.
The entire cost function will look like this:
\[ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))] \]
A vectorised representation is:
\[
h = g(X\theta) \\
J(\theta) = \frac{1}{m} \times (-y^{T}log(h) + (1 - y)^{T} \log(1 -1))
\]
6. Gradient descent:
\[
Repeat\ \{ \\
\theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \\
\}
\]
A vectorised representation is:
\[ \theta = \theta - \frac{\alpha}{m} X^{T} (g(X\theta) - \overrightarrow{y}) \]
6. Advanced Optimisation
Once cost function $J(\theta)$ is defined, we need to $\min_{\theta}J(\theta)$ in order to find optimal values $\theta$ for our hypothesis. In a normal case scenario, we need a code that can compute:
- cost function $J(\theta)$, and
- gradient descent $\frac{\partial}{\partial \theta_j}J(\theta) \ (for\ j = 0, 1, ..., n)$
where gradient descent is:
$Repeat\ \{ \\
\ \ \ \ \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) \\
\}$
Luckily, there are existing optimisation algorithms that work quite well. Such Octave (and Matlab) algorithms as "Conjugate Gradient", "BFGS" and "L-BFGS" provide more sophisticated and faster way to optimise $\theta$ and could be used instead of gradient descent. The workflow is as follows:
1. Provide a function that evaluates the following two functions for a given input value $\theta$:
$J(\theta)$, and
$\frac{\partial}{\partial \theta_j}J(\theta)$
In Octave it may look like this:
$function\ [jVal, gradeint] = costFunction(theta) \\
\ \ \ \ jVal = [...\ code\ to\ compute\ J(\theta)...]; \\
\ \ \ \ gradient = [...\ code\ to\ compute\ derivative\ of\ J(\theta)...]; \\
end$
2. Use Octave optimisation algorithm $fminunc()$ together with $optimset()$ function that creates an object containing the options to be sent to $fminunc()$. Give to the function $fminunc()$ the cost function $J(\theta)$, an initial vector of $\theta$ values and the "options" object:
$options = optimset('GradObj', 'on', 'MaxIter', 100); \\
initialTheta = zeros(2, 1); \\
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);$
$optTheta$ would contain $\theta$ values that we are after.
Advantages of using existing optimisation algorithms are:
- No need to manually pick up $\alpha$
- Often faster than gradient descent
Disadvantage:
- Could be more complex for a given task than actually needed.
7. Multi-class Classification
This is an approach to classify the data when there are more than two categories. Instead of $y \in \{0, 1\}$, it could be $y \in \{0, 1, 2, ..., n\}$.
In this case we can divide the problem into $(n+1)$ binary classification problems and in each of them to predict that $y$ is a member of one of given classes.
$y \in \{0, 1, 2, ..., n\}$
$h_{\theta}^{(0)}(x) = P(y = 0 | x; \theta)$
$h_{\theta}^{(1)}(x) = P(y = 1 | x; \theta)$
$h_{\theta}^{(2)}(x) = P(y = 2 | x; \theta)$
$h_{\theta}^{(n)}(x) = P(y = n | x; \theta)$
$prediction = max_i(h_{\theta}^{(i)}(x))$
Basically, this is a selection of one class with all other classes combined into a single second class. It is done repeatedly, applying binary logistic regression to each case. Then for prediction we can use the hypothesis that returns the highest value.
To summarise:
1. Train a logistic regression classifier $h_{\theta}(x)$ for each class to predict the probability that $y = i$.
2. To make a prediction on a new $x$, pick the class that maximises $h_{\theta}(x)$.
Subscribe to:
Post Comments (Atom)
Online Encyclopedia of Statistical Science (Free)
Please, click on the chart below to go to the source:
-
Thanks to an excellent Java Concept of the Day , this is a brief description of main interfaces and classes of Java Collection Framework. H...
-
1. Logistic regression deals with data sets where $y$ may have only a small number of discrete values. For example, if $y\in \{0, 1\}$ then ...
-
1. Machine Learning definition: Field of study that gives computer the ability to learn without being explicitly programmed. Arthur Sam...
No comments:
Post a Comment