MIT 15.097讲解、algorithms辅导、Python,c/c++,Java程序语言讲解 辅导Python编程|讲解数据库SQL
- 首页 >> Python编程 Support Vector Machines
MIT 15.097 Course Notes
Let’s start with some intuition about margins.
The margin of an example xi = “distance” from example to decision boundary
= yif(xi)
The margin is positive if the example is on the correct side of the decision boundary,
otherwise it’s negative.
Here’s the intuition for SVM’s:
• We want all examples to have large margins, want them to be as far from
decision boundary as possible.
• That way, the decision boundary is more “stable,” we are confident in all
decisions.
1
Most other algorithms (logistic regression, decision trees, perceptron) don’t generally
produce large margins. (AdaBoost generally produces large margins.)
As in logistic regression and AdaBoost, function f is linear,
m
f(x) = Xλ
(j)x
(j) + λ0.
j=1
Note that the intercept term can get swept into x by adding a 1 as the last
component of each x. Then f(x) would be just λ
Tx but for this lecture we’ll
keep the intercept term separately because SVM handles that term differently
than if you put the intercept as a separate feature. We classify x using sign(f(x)).
If xi has a large margin, we are confident that we classified it correctly. So we’re
essentially suggesting to use the margin yif(xi) to measure the confidence in our
prediction.
But there is a problem with using yif(xi) to measure confidence in prediction.
There is some arbitrariness about it.
How? What should we do about it?
2
SVM’s maximize the distance from the decision boundary to the nearest training
example – they maximize the minimum margin. There is a geometric perspective
too. I set the intercept to zero for this picture (so the decision boundary passes
through the origin):
The decision boundary are x’s where λ
Tx = 0. That means the unit vector for
λ must be perpendicular to those x’s that lie on the decision boundary.
Now that you have the intuition, we’ll put the intercept back, and we have to
translate the decision boundary, so it’s really the set of x’s where λ
Tx = λ0.
The margin of example i is denoted γi
:
B is the point on the decision boundary closest to the positive example xi
. B isλB = xi − γi||λ|| 23
since we moved −γi units along the unit vector to get from the example to B.
Since B lies on the decision boundary, it obeys λ
Tx + λ0 = 0, where x is B. (I
wrote the intercept there explicitly).
Note that here we normalized so we wouldn’t have the arbitrariness in the meaning
of the margin.
If the example is negative, the same calculation works, with a few sign flips (we’d
need to move γi units rather than −γi units).
So the “geometric” margin from the picture is the same as the “functional”
margin ˜ yif(xi).
Maximize the minimum margin
Support vector machines maximize the minimum margin. They would like to
have all examples being far from the decision boundary. So they’ll choose f this
way:
For any λ and λ0 that satisfy this, any positively scaled multiple satisfies them
too, so we can arbitrarily set ||λ||2 = 1/γ so that the right side is 1.
Now when we maximize γ, we’re maximizing γ = 1/ ||λ||2
.
Txi + λ0) − 1 ≥ 0 i = 1 . . . m (1)
(the 1/2 and square are just for convenience) which is the same as:
Writing the KKT conditions, starting with Lagrangian stationarity, where we
need to find the gradient wrt λ and the derivative wrt λ0:
m m
∇λL([λ, λ0], α) = λ −
Xαiyixi = 0 =⇒ λ =
Xαiyixi
.
i=1 i=1
m m
∂
L([λ, λ0], α) = αiyi = 0 = ∂λ0
−
i=1
⇒ αiyi = 0.
i=1
αi ≥ 0
X
∀i (dual feasibilit
X
y)
αi −yi(λ
Txi + λ0) + 1
= 0
T
∀i (complementary slackness)
−yi(λ xi + λ0) + 1 ≤ 0. (primal feasibility)
5
Using the KKT conditions, we can simplify the Lagrangian.
m m m
1
L([λ, λ0], α) = kλk
2 T
2 + λ
X(−αiyixi) +X(−αiyiλ0) + αi
2
i=1 i=1 i=1
(We just expanded terms. Now we’ll plug in the first KKT condition.)
X
m
1
= ||λ||2
2 − || ||2 λ 2 − λ0
X
m
(αiyi) + i
2
Xα
i=1 i=1
(Plug in the second KKT condition.)
Again using the first KKT condition, we can rewrite the first term.
Plugging back into the Lagrangian (2), which now only depends on α, and
putting in the second and third KKT conditions gives us the dual problem;
max
1 X T αi 0 i = 1 . . . m L(α) = αi − αiαkyiykx
We’ll use the last two KKT conditions in what follows,
P
for instance to get conditions
on λ0, but what we’ve already done is enough to define the dual problem
for α.
We can solve this dual problem. Either (i) we’d use a generic quadratic programming
solver, or (ii) use another algorithm, like SMO, which I will discuss
6
later. For now, assume we solved it. So we have α1∗, . . . , αm∗. We can use the
solution of the dual problem to get the solution of the primal problem. We can
plug α∗
into the first KKT condition to get, but we can see something cool in the process.
Support Vectors
Look at the complementary slackness KKT condition and the primal and dual
feasibility conditions:
The examples in the first category, for which the scaled margin is 1 and the
constraints are active are called support vectors. They are the closest to the
decision boundary.
7
Finish What We Were Doing Earlier
To get λ∗0, use the complementarity condition for any of the support vectors (in other words, use the fact that the unnormalized margin of the support vectors is one):
If you take a positive support vector, yi = 1,
Written another way, since the support vectors have the smallest margins,
So that’s the solution! Just to recap, to get the scoring function f
∗
for SVM,
you’d compute α∗
from the dual problem (3), plug it into (4) to get λ∗, plug that
into the equation above to get λ∗0, and that’s the solution to the primal problem,
and the coefficients for f∗.8
Support vectors
Image by MIT OpenCourseWare.
Because of the form of the solution:
it is possible that λ∗
is very fast to calculate.
Why is that? Think support vectors.
The Nonseparable Case
If there is no separating hyperplane,
there is no feasible solution to the problem we wrote above. Most real problems
are nonseparable.
Let’s fix our SVM so it can accommodate the nonseparable case. The new formulation
will penalize mistakes the farther they are from the decision boundary.
So we are allowed to make mistakes now, but we pay a price.
9
Let’s change our primal problem (1) to this new primal problem:
So the constraints allow some slack of size ξi
, but we pay a price for it in the
objective. That is, if yif(xi) ≥ 1 then ξi gets set to 0, penalty is 0. Otherwise,
if yif(xi) = 1 − ξi
, we pay price ξi
.
2 Parameter C trades off between the twin goals of making the λ
2
|| ||2
small (making
what-was-the-minimum-margin 1/ ||λ||2
large) and ensuring that most examples
2
have margin at least 1/ ||λ||2
.
Going on a Little Tangent
Rewrite the penalty another way:
If yif(xi) ≥ 1, zero penalty. Else, pay price ξi = 1 − yif(xi)
Third time’s the charm:
Pay price ξi = b1 − yif(xi)c+
where this notation bzc+ means take the maximum of z and 0.
Equation (5) becomes:
m
1
min ||λ||2
2 + C
Xb1 − yif(xi)
λ,λ0 2
i=1
c+
Does that look familiar?
10
The Dual for the Nonseparable Case
Form the Lagrangian of (5):
m m m
1
L(λ , α, r) =
2
||λ||2
, b, ξ T
2 + C
Xξi αi yi(λ xi + λ0) 1 + ξi riξi
where αi
’s and ri
’s are Lagrange multipliers (constrained to be ≥ 0). The dual
turns out to be (after some work)
i=1 αiyi = 0
So the only difference from the original problem’
P
s Lagrangian (3) is that 0 ≤ αi
was changed to 0 ≤ αi ≤ C. Neat!
Solving the dual problem with SMO
SMO (Sequential Minimal Optimization) is a type of coordinate ascent algorithm,
but adapted to SVM so that the solution always stays within the feasible
region.
Start with (6). Let’s say you want to hold α2, . . . , αm fixed and take a coordinate
step in the first direction. That is, change α1 to maximize the objective in (6).
Can we make any progress? Can we get a better feasible solution by doing this?
m Turns out, no. Look at the constraint in (6), i=1 αiyi = 0. This means:
So, since α2, . . . , αm are fixed, α1 is also fixed.
11
So, if we want to update any of the αi
’s, we need to update at least 2 of them
simultaneously to keep the solution feasible (i.e., to keep the constraints satis-
fied).
Start with a feasible vector α. Let’s update α1 and α2, holding α3, . . . , αm fixed.
What values of α1 and α2 are we allowed to choose?
Again, the constraint is: α1y1 + α2y2 = − i=3 αiyi =: ζ (fixed constant). Pm
We are only allowed to choose α1, α2 on the line, so when we pick α2, we get α1
automatically, from1α1 = (ζy1− α2y2)= y1(ζ − α2y2) (y1 = 1/y1 since y1 ∈ {+1, −1}).
Also, the other constraints in (6) say 0 ≤ α1, α2 ≤ C. So, α2 needs to be within
[L,H] on the figure (in order for α1 to stay within [0, C]), where we will always
have 0 ≤ L,H ≤ C. To do the coordinate ascent step, we will optimize the
objective over α2, keeping it within [L,H]. Intuitively, (6) becomes:
max
α1 + α2 + constants −XαiαkyiykxT
i xk where α1 = y1(ζ α2y2).
α2∈[L,H] 2i,k −
(7)
The objective is quadratic in α2. This means we can just set its derivative to 0
to optimize it and get α2 for the next iteration of SMO. If the optimal value is
12
outside of [L,H], just choose α2 to be either L or H for the next iteration.
For instance, if this is a plot of (7)’s objective (sometimes it doesn’t look like
this, sometimes it’s upside-down), then we’ll choose :
Note: there are heuristics to choose the order of αi
’s chosen to update.
13
MIT OpenCourseWare
http://ocw.mit.edu
15.097 Prediction: Machine Learning and Statistics
Spring 2012
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MIT 15.097 Course Notes
Let’s start with some intuition about margins.
The margin of an example xi = “distance” from example to decision boundary
= yif(xi)
The margin is positive if the example is on the correct side of the decision boundary,
otherwise it’s negative.
Here’s the intuition for SVM’s:
• We want all examples to have large margins, want them to be as far from
decision boundary as possible.
• That way, the decision boundary is more “stable,” we are confident in all
decisions.
1
Most other algorithms (logistic regression, decision trees, perceptron) don’t generally
produce large margins. (AdaBoost generally produces large margins.)
As in logistic regression and AdaBoost, function f is linear,
m
f(x) = Xλ
(j)x
(j) + λ0.
j=1
Note that the intercept term can get swept into x by adding a 1 as the last
component of each x. Then f(x) would be just λ
Tx but for this lecture we’ll
keep the intercept term separately because SVM handles that term differently
than if you put the intercept as a separate feature. We classify x using sign(f(x)).
If xi has a large margin, we are confident that we classified it correctly. So we’re
essentially suggesting to use the margin yif(xi) to measure the confidence in our
prediction.
But there is a problem with using yif(xi) to measure confidence in prediction.
There is some arbitrariness about it.
How? What should we do about it?
2
SVM’s maximize the distance from the decision boundary to the nearest training
example – they maximize the minimum margin. There is a geometric perspective
too. I set the intercept to zero for this picture (so the decision boundary passes
through the origin):
The decision boundary are x’s where λ
Tx = 0. That means the unit vector for
λ must be perpendicular to those x’s that lie on the decision boundary.
Now that you have the intuition, we’ll put the intercept back, and we have to
translate the decision boundary, so it’s really the set of x’s where λ
Tx = λ0.
The margin of example i is denoted γi
:
B is the point on the decision boundary closest to the positive example xi
. B isλB = xi − γi||λ|| 23
since we moved −γi units along the unit vector to get from the example to B.
Since B lies on the decision boundary, it obeys λ
Tx + λ0 = 0, where x is B. (I
wrote the intercept there explicitly).
Note that here we normalized so we wouldn’t have the arbitrariness in the meaning
of the margin.
If the example is negative, the same calculation works, with a few sign flips (we’d
need to move γi units rather than −γi units).
So the “geometric” margin from the picture is the same as the “functional”
margin ˜ yif(xi).
Maximize the minimum margin
Support vector machines maximize the minimum margin. They would like to
have all examples being far from the decision boundary. So they’ll choose f this
way:
For any λ and λ0 that satisfy this, any positively scaled multiple satisfies them
too, so we can arbitrarily set ||λ||2 = 1/γ so that the right side is 1.
Now when we maximize γ, we’re maximizing γ = 1/ ||λ||2
.
Txi + λ0) − 1 ≥ 0 i = 1 . . . m (1)
(the 1/2 and square are just for convenience) which is the same as:
Writing the KKT conditions, starting with Lagrangian stationarity, where we
need to find the gradient wrt λ and the derivative wrt λ0:
m m
∇λL([λ, λ0], α) = λ −
Xαiyixi = 0 =⇒ λ =
Xαiyixi
.
i=1 i=1
m m
∂
L([λ, λ0], α) = αiyi = 0 = ∂λ0
−
i=1
⇒ αiyi = 0.
i=1
αi ≥ 0
X
∀i (dual feasibilit
X
y)
αi −yi(λ
Txi + λ0) + 1
= 0
T
∀i (complementary slackness)
−yi(λ xi + λ0) + 1 ≤ 0. (primal feasibility)
5
Using the KKT conditions, we can simplify the Lagrangian.
m m m
1
L([λ, λ0], α) = kλk
2 T
2 + λ
X(−αiyixi) +X(−αiyiλ0) + αi
2
i=1 i=1 i=1
(We just expanded terms. Now we’ll plug in the first KKT condition.)
X
m
1
= ||λ||2
2 − || ||2 λ 2 − λ0
X
m
(αiyi) + i
2
Xα
i=1 i=1
(Plug in the second KKT condition.)
Again using the first KKT condition, we can rewrite the first term.
Plugging back into the Lagrangian (2), which now only depends on α, and
putting in the second and third KKT conditions gives us the dual problem;
max
1 X T αi 0 i = 1 . . . m L(α) = αi − αiαkyiykx
We’ll use the last two KKT conditions in what follows,
P
for instance to get conditions
on λ0, but what we’ve already done is enough to define the dual problem
for α.
We can solve this dual problem. Either (i) we’d use a generic quadratic programming
solver, or (ii) use another algorithm, like SMO, which I will discuss
6
later. For now, assume we solved it. So we have α1∗, . . . , αm∗. We can use the
solution of the dual problem to get the solution of the primal problem. We can
plug α∗
into the first KKT condition to get, but we can see something cool in the process.
Support Vectors
Look at the complementary slackness KKT condition and the primal and dual
feasibility conditions:
The examples in the first category, for which the scaled margin is 1 and the
constraints are active are called support vectors. They are the closest to the
decision boundary.
7
Finish What We Were Doing Earlier
To get λ∗0, use the complementarity condition for any of the support vectors (in other words, use the fact that the unnormalized margin of the support vectors is one):
If you take a positive support vector, yi = 1,
Written another way, since the support vectors have the smallest margins,
So that’s the solution! Just to recap, to get the scoring function f
∗
for SVM,
you’d compute α∗
from the dual problem (3), plug it into (4) to get λ∗, plug that
into the equation above to get λ∗0, and that’s the solution to the primal problem,
and the coefficients for f∗.8
Support vectors
Image by MIT OpenCourseWare.
Because of the form of the solution:
it is possible that λ∗
is very fast to calculate.
Why is that? Think support vectors.
The Nonseparable Case
If there is no separating hyperplane,
there is no feasible solution to the problem we wrote above. Most real problems
are nonseparable.
Let’s fix our SVM so it can accommodate the nonseparable case. The new formulation
will penalize mistakes the farther they are from the decision boundary.
So we are allowed to make mistakes now, but we pay a price.
9
Let’s change our primal problem (1) to this new primal problem:
So the constraints allow some slack of size ξi
, but we pay a price for it in the
objective. That is, if yif(xi) ≥ 1 then ξi gets set to 0, penalty is 0. Otherwise,
if yif(xi) = 1 − ξi
, we pay price ξi
.
2 Parameter C trades off between the twin goals of making the λ
2
|| ||2
small (making
what-was-the-minimum-margin 1/ ||λ||2
large) and ensuring that most examples
2
have margin at least 1/ ||λ||2
.
Going on a Little Tangent
Rewrite the penalty another way:
If yif(xi) ≥ 1, zero penalty. Else, pay price ξi = 1 − yif(xi)
Third time’s the charm:
Pay price ξi = b1 − yif(xi)c+
where this notation bzc+ means take the maximum of z and 0.
Equation (5) becomes:
m
1
min ||λ||2
2 + C
Xb1 − yif(xi)
λ,λ0 2
i=1
c+
Does that look familiar?
10
The Dual for the Nonseparable Case
Form the Lagrangian of (5):
m m m
1
L(λ , α, r) =
2
||λ||2
, b, ξ T
2 + C
Xξi αi yi(λ xi + λ0) 1 + ξi riξi
where αi
’s and ri
’s are Lagrange multipliers (constrained to be ≥ 0). The dual
turns out to be (after some work)
i=1 αiyi = 0
So the only difference from the original problem’
P
s Lagrangian (3) is that 0 ≤ αi
was changed to 0 ≤ αi ≤ C. Neat!
Solving the dual problem with SMO
SMO (Sequential Minimal Optimization) is a type of coordinate ascent algorithm,
but adapted to SVM so that the solution always stays within the feasible
region.
Start with (6). Let’s say you want to hold α2, . . . , αm fixed and take a coordinate
step in the first direction. That is, change α1 to maximize the objective in (6).
Can we make any progress? Can we get a better feasible solution by doing this?
m Turns out, no. Look at the constraint in (6), i=1 αiyi = 0. This means:
So, since α2, . . . , αm are fixed, α1 is also fixed.
11
So, if we want to update any of the αi
’s, we need to update at least 2 of them
simultaneously to keep the solution feasible (i.e., to keep the constraints satis-
fied).
Start with a feasible vector α. Let’s update α1 and α2, holding α3, . . . , αm fixed.
What values of α1 and α2 are we allowed to choose?
Again, the constraint is: α1y1 + α2y2 = − i=3 αiyi =: ζ (fixed constant). Pm
We are only allowed to choose α1, α2 on the line, so when we pick α2, we get α1
automatically, from1α1 = (ζy1− α2y2)= y1(ζ − α2y2) (y1 = 1/y1 since y1 ∈ {+1, −1}).
Also, the other constraints in (6) say 0 ≤ α1, α2 ≤ C. So, α2 needs to be within
[L,H] on the figure (in order for α1 to stay within [0, C]), where we will always
have 0 ≤ L,H ≤ C. To do the coordinate ascent step, we will optimize the
objective over α2, keeping it within [L,H]. Intuitively, (6) becomes:
max
α1 + α2 + constants −XαiαkyiykxT
i xk where α1 = y1(ζ α2y2).
α2∈[L,H] 2i,k −
(7)
The objective is quadratic in α2. This means we can just set its derivative to 0
to optimize it and get α2 for the next iteration of SMO. If the optimal value is
12
outside of [L,H], just choose α2 to be either L or H for the next iteration.
For instance, if this is a plot of (7)’s objective (sometimes it doesn’t look like
this, sometimes it’s upside-down), then we’ll choose :
Note: there are heuristics to choose the order of αi
’s chosen to update.
13
MIT OpenCourseWare
http://ocw.mit.edu
15.097 Prediction: Machine Learning and Statistics
Spring 2012
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.