Large Scale Machine Learning and Other Animals: SVM

Sunday, April 15, 2012

More on large scale SVM

About a month ago I posted here on large scale SVM. The conclusion of my post was that linear SVM is solved problem, mainly due to Pegasos stochastic gradient descent algorithm.

Today I had the pleasure of meeting Shai Shalev-Shwartz, the author of Pegasos. I asked Shai if he can explain to me (namely for dummies.. ) why pegasos is working so well. So this is what I heard. Pegasos is a stochastic gradient descent method. Instead of computing the costly operation of the exact gradient, a random data point is selected, and an approximated gradient is computed, based solely on this data point. The solution method is advancing in random directions, however on expectation, those random directions will lead to the exact global solution.

I asked Shai if he can provide me a simple matlab code that demonstrates the essence of Pegasos. And here is the code I got from him:

% w=pegasos(X,Y,lambda,nepochs)
% Solve the SVM optimization problem without kernels:
% w = argmin lambda w'*w + 1/m * sum(max(0,1-Y.*X*w))
% Input:
% X - matrix of instances (each row is an instance)
% Y - column vector of labels over {+1,-1}
% lambda - scalar
% nepochs - how many times to go over the training set
% Output:
% w - column vector of weights
% Written by Shai Shalev-Shwartz, HUJI
function w=pegasos(X,Y,lambda,nepochs)

[m,d] = size(X);
w = zeros(d,1);
t = 1;
for (i=1:nepochs) % iterations over the full data
for (tau=1:m) % pick a single data point
if (Y(tau)*X(tau,:)*w < 1) % distance of data point
% from separator is to small
% or data point is at the other side of the separator.
% take a step towards the gradient
w = (1-1/t)*w + 1/(lambda*t)*Y(tau)*X(tau,:)';
else
w = (1-1/t)*w;
end
t=t+1; % increment counter
end
end

You must agree with me that this is a very simple and elegant piece of code.
And here are my two stupid questions:
1) Why do we update the gradient with the magic number < 1?
2) Why do we update w even when gradient update is not needed?
Answers I got from Shai:
1) It's either the data point is close to the separator, or that it's far away from the separator but on the wrong side (i.e., if y*w*x is a large negative number).
2) The actual distance from a point x to the separator is |w*x| / (||w|| * ||x||). So, to increase this number we want that both |w*x| will be large and that ||w|| will be small. So, even if we don't update, we always want to shrink ||w||.

Wednesday, March 7, 2012

Large scale SVM (support vector machine)

Not long ago I had the pleasure of visiting Toyota Technical Institute in Chicago.
I had some interesting meeting with Joseph Keshet. We discussed what is the best way to
deploy large scale kernel SVM.

According to Joseph, linear SVM is a pretty much solved problem. State of the art solutions
consists Pegasos and SVMLight see Joachim 2006.

Recently, Joseph have worked on large scale SVMs in two fronts: GPU and MPI.
The GPU kernelized (not linear) can be found here. This paper appeared in KDD 2011.
Here is an image depicting nice speedup they got:

The second SVM activity by Joseph is how to quickly approximate the kernel matrix using Taylor serias expansion. This work is presented in their arxiv paper. Evaluating the kernel matrix has a major overhead in SVM implementation especially because it is dense. Previously, I have implemented a large scale SVM solver on up to 1024 cores on IBM BlueGene supercomputer. About 90% of the time was spent on evaluating the kernel matrix. I wish I had some fancy techniques as Joseph proposes for quickly approximating the kernel matrix...

Large Scale Machine Learning and Other Animals

Sunday, April 15, 2012

More on large scale SVM

Wednesday, March 7, 2012

Large scale SVM (support vector machine)

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax