1. Logistic regression deals with data sets where y may have only a small number of discrete values. For example, if y∈{0,1} then this is a binary classification problem in which y can take only two values 0 and 1.
For instance, in an email spam classifier x(i) could be some feature of an email message and y be 1 if it is a spam or 0 otherwise. In this case 0 would be a negative class and 1 a position class and they are sometimes also denoted by symbols "-" and "+". For given x(i), the corresponding y(i) is called the label of this training example.
2. Hypothesis representation:
hθ(x)=g(z)
where
g(z)=11+e−z - Sigmoid or Logistic function
and
z=θTx
Sigmoid function g(z) maps any real number to (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
In this case hθ(x) gives a probability that output is 1. For example, hθ(x)=0.7 gives a probability of 70% that the output is 1 and probability of 30% that it is 0.
hθ(x)=P(y=1|x;θ)=1−P(y=0|x;θ)
or
P(y=0|x;θ)+P(y=1|x;θ)=1
3. Decision Boundary
In order to get classification with discrete values 0 and 1, we can translate the output of the hypothesis function as follows:
hθ(x)⩾0.5→y=1
hθ(x)<0.5→y=0
Sigmoid function g(z) behaves the way that if its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)⩾0.5
when z⩾0
Remember that:
z=0,e0=1⇒g(z)=12
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if z=θTx, then it means that:
hθ(x)=g(θTx)⩾0.5
when θTx⩾0
From the statements above it's valid to say:
θTx⩾0⇒y=1
θTx<0⇒y=0
The decision boundary is a line that separates an area where y = 0 with an area where y = 1. This line is created by the hypothesis function.
An input to the sigmoid function g(z) (e.g. θTx) doesn't need to be linear and could be a function that describes, say, a circle (z=θ0+θ1x21+θ2x22) or any other shape that fits the data.
4. Cost function
Training set with m examples:
{(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
n features:
x∈[x0x1...xn] where x0=1 and y∈{0,1}
Hypothesis with sigmoid:
hθ(x)=11+e−θTx
Cost function helps to select parameters θ:
J(θ)=1mm∑i=1Cost(hθ(x(i),y(i))
Cost(hθ(x(i),y(i))=−log(hθ(x)), if y=1
Cost(hθ(x(i),y(i))=−log(1−hθ(x)), if y=0
5. Simplified cost function:
We can combine two conditional cases into one case:
Cost(hθ(x(i),y(i))=−ylog(hθ(x))+(1−y)log(1−hθ(x))
Notice that when y is equal to 1, then the second term (1−y)log(1−hθ(x)) is zero and will not affect the result. If y is equal to 0, then the first term −ylog(hθ(x)) is zero and will not affect the result.
The entire cost function will look like this:
J(θ)=−1mm∑i=1[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
A vectorised representation is:
h=g(Xθ)J(θ)=1m×(−yTlog(h)+(1−y)Tlog(1−1))
6. Gradient descent:
Repeat {θj=θj−αmm∑i=1(hθ(x(i))−y(i))x(i)j}
A vectorised representation is:
θ=θ−αmXT(g(Xθ)−→y)
6. Advanced Optimisation
Once cost function J(θ) is defined, we need to minθJ(θ) in order to find optimal values θ for our hypothesis. In a normal case scenario, we need a code that can compute:
- cost function J(θ), and
- gradient descent ∂∂θjJ(θ) (for j=0,1,...,n)
where gradient descent is:
Repeat { θj=θj−α∂∂θjJ(θ)}
Luckily, there are existing optimisation algorithms that work quite well. Such Octave (and Matlab) algorithms as "Conjugate Gradient", "BFGS" and "L-BFGS" provide more sophisticated and faster way to optimise θ and could be used instead of gradient descent. The workflow is as follows:
1. Provide a function that evaluates the following two functions for a given input value θ:
J(θ), and
∂∂θjJ(θ)
In Octave it may look like this:
function [jVal,gradeint]=costFunction(theta) jVal=[... code to compute J(θ)...]; gradient=[... code to compute derivative of J(θ)...];end
2. Use Octave optimisation algorithm fminunc() together with optimset() function that creates an object containing the options to be sent to fminunc(). Give to the function fminunc() the cost function J(θ), an initial vector of θ values and the "options" object:
options=optimset(′GradObj′,′on′,′MaxIter′,100);initialTheta=zeros(2,1);[optTheta,functionVal,exitFlag]=fminunc(@costFunction,initialTheta,options);
optTheta would contain θ values that we are after.
Advantages of using existing optimisation algorithms are:
- No need to manually pick up α
- Often faster than gradient descent
Disadvantage:
- Could be more complex for a given task than actually needed.
7. Multi-class Classification
This is an approach to classify the data when there are more than two categories. Instead of y∈{0,1}, it could be y∈{0,1,2,...,n}.
In this case we can divide the problem into (n+1) binary classification problems and in each of them to predict that y is a member of one of given classes.
y∈{0,1,2,...,n}
h(0)θ(x)=P(y=0|x;θ)
h(1)θ(x)=P(y=1|x;θ)
h(2)θ(x)=P(y=2|x;θ)
h(n)θ(x)=P(y=n|x;θ)
prediction=maxi(h(i)θ(x))
Basically, this is a selection of one class with all other classes combined into a single second class. It is done repeatedly, applying binary logistic regression to each case. Then for prediction we can use the hypothesis that returns the highest value.
To summarise:
1. Train a logistic regression classifier hθ(x) for each class to predict the probability that y=i.
2. To make a prediction on a new x, pick the class that maximises hθ(x).
Subscribe to:
Post Comments (Atom)
Online Encyclopedia of Statistical Science (Free)
Please, click on the chart below to go to the source:

-
Thanks to an excellent Java Concept of the Day , this is a brief description of main interfaces and classes of Java Collection Framework. H...
-
1. Logistic regression deals with data sets where y may have only a small number of discrete values. For example, if y∈{0,1} then ...
-
1. Notation: m - number of training examples. n=|x(i)| - number of features. x(i) - column vector of all...
No comments:
Post a Comment