as spam, so the number of positive examples is rather low.
One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.
Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).
%function for converting hearst data to svm light format %Input: number - the file number. 1-5 Model files. 6 - validation. % doclick or doopen - one of them should be 1 and the other zero, depends on which target. %Written by Danny Bickson, CMU, July 2011. %This script converts hearst machine learning challenge data into SVMlight format %namely:The script can be actually run in parallel on multicore machine. The way to run it is to execute the following in a Linux shell (optimally if you have 11 cores):... % for example %-1 3:15.4 4:18 19:32 % function []=convert2svm(number,doclick, doopen) assert(number>=1 && number<=6); row_offset = [0 400000 800000 1200000 1600000 0]; rows=[400000 400000 400000 400000 185421 9956]; cols=274; assert(~(doopen && doclick)); assert(doclick || doopen); terms273 = {'Sun', 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat'}; ids = num2cell(1:length(terms273)); dict273 = reshape({terms273{:};ids{:}},2,[]); dict273 = struct(dict273{:}); if (number == 6) fid=fopen('validation.csv','r'); outid=fopen('validation.txt','w'); else fid=fopen(['Modeling_', num2str(number), '.csv'],'r'); if (doclick) outid=fopen(['svm', num2str(number), '.txt'],'w'); else outid=fopen(['2svm', num2str(number), '.txt'],'w'); end end assert(outid~=-1); title=textscan(fid, '%s', 273, 'delimiter', ','); % read title title=title{1}; title{274} = 'date';% field no. 273 is mistakenly parsed into two fields in matlab because of a "," % go over rows tic for j=1:rows(number)-1 if (mod(j,500) == 0) disp(['row ', num2str(j)]); tic for j=1:rows(number)-1 if (mod(j,500) == 0) disp(['row ', num2str(j)]); toc end a=textscan(fid, '%s', 274,'delimiter', ','); a=a{1}; for i=1:cols if (i == 1|| i == 2) %handle target if ((doclick&&i==1) || (doopen&&i==2)) if (number == 6) fprintf(outid,'%d ', -1); %target is unknown, write -1 as a placeholder else fprintf(outid,'%d ', (2*strcmp(a{i},'Y'))-1); end end elseif (~strcmp(a{i} ,''))%if feature is non zero val=a{i}; if (i == 73) % translate field of the type A01, B03, J05, etc. quickly into a number val = val(1)*26+val(3); elseif (i==273) val = val(2:end); %remove quatation mark val = dict273.(val); elseif (i==274) % translate date into a number val = datenum(a{274}); else if (length(val) == 1) val = uint8(val); elseif (sum(isletter(val))==0) % string is all digits, translate to double val = str2double(val); else val = sum(uint8(val));%translate a string into a number, using sun of chars, can use more fancy methods here end end fprintf(outid, '%d:%f ', i-2, val); % remove two from field number since first two fields are targets end end fprintf(outid, '\n'); end fclose(fid); fclose(outid); end
for i in `seq 1 1 6` do matlab -r "convert2svm($i,1,0)" & matlab -r "convert2svm($i,0,1)" & doneThe resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file. Next you can merge the files using the command
cat svm1.txt > total.txt for i in `seq 2 1 5` do cat svm$i.txt >> total.txt done