Showing posts with label collaborative-filtering. Show all posts
Showing posts with label collaborative-filtering. Show all posts

Wednesday, December 19, 2012

3rd Generation Collaborative Filtering - Sparse Case [or] Can Matrix Factoriztion be used for Classification?

Following my gensgd implementation which supports dense feature matrices, I was asked by my mega collaborator Justin Yan to implement a sparse version that supports libsvm format.
sparse_gensgd is exactly the same algorithm (high dimensional matrix factorization), but for the sparse case. Perhaps a bit surprising, I will show below the sparse_gensgd algorithm can be used for classification with a very nice performance. As a case study I will discuss KDD CUP 2010.

Case study: ACM KDD CUP 2010

In this case study I will show you how you can get state-of-the-art performance from GraphChi CF toolkit for solving a recent KDD CUP 2010 task. Here is a text from the contest website describing the task:
This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting. 
I have used libsvm data respository which converted the task into a binary classification. The problem is moderate in size: around 20M samples for training, 750K samples for testing, with 29M sparse features.

The winning team was NTU, and here is their winning paper. Here is a graph depicting their single algorithm improvement:
As you can see, prediction RMSE around 0.2815 is the best result obtained by a single model.

The data is sparse in the sense that any number of features can appear in one sample. For example, here are the first 3 lines of the data in libsvm format:

1 2:1 103:1 104:1 105:1 106:0.301 107:0.301 32913:1 32917:1 2990385:1 2990386:1 2990387:1 2990388:1 2990389:0.301 2990390:1 2990391:1 2990392:1 2990393:1 2990394:1 2990395:1 2990396:1 2990397:1 2990398:1 2990399:1 2990400:1 2990401:1
0 2:1 92:1 115:1 116:1 117:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990402:1 2990403:1 2990404:1 2990405:1 2990406:1
0 2:1 100:1 143:1 144:1 145:1 12235:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990407:1 2990408:1 2990409:1 2990410:1 2990411:1

The target is either 1 or 0, this is the parameter we would like to predict as part of the matrix factorization procedure. The rest of the features are integer ids, where most of them are binary (1) but some of them are doubles (0.477). Given the validation dataset where only the features are given we would like to predict the target - either 0 or 1.


Now we run sparse_gensgd and we get:

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/sparse_gensgd --training=kddb --cutoff=0.5 --calc_error=1 --quiet=1 --gensgd_mult_dec=0.99999 --max_iter=100 --validation=kddb.t --gensgd_rate3=1e-4 --D=20 --gensgd_regw=1e-4 --gensgd_regv=1e-4 --gensgd_rate1=1e-4 --gensgd_rate2=1e-4 --gensgd_reg0=1e-3
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
   253.313) Iteration:   0 Training RMSE:   0.316999 Train err:   0.134567  Validation RMSE:   0.301573 Validation Err:    0.11255
   302.651) Iteration:   1 Training RMSE:   0.311927 Train err:   0.131657  Validation RMSE:   0.299295 Validation Err:   0.111737
   350.719) Iteration:   2 Training RMSE:   0.309526 Train err:   0.130117  Validation RMSE:   0.298312 Validation Err:   0.111191
   399.433) Iteration:   3 Training RMSE:   0.307839 Train err:   0.128916  Validation RMSE:   0.297752 Validation Err:   0.111072
...

   1598.31) Iteration:  27 Training RMSE:   0.293562 Train err:   0.117877  Validation RMSE:   0.288989 Validation Err:   0.109334
   1647.51) Iteration:  28 Training RMSE:   0.293217 Train err:   0.117608  Validation RMSE:   0.288871 Validation Err:   0.109252
   1696.24) Iteration:  29 Training RMSE:   0.292884 Train err:   0.117357  Validation RMSE:   0.288908 Validation Err:   0.109398
   1745.18) Iteration:  30 Training RMSE:   0.292555 Train err:   0.117106  Validation RMSE:   0.289125 Validation Err:   0.109435


As you can see, we got a nice validation RMSE of 0.289 while the best customized solution for this task get an RMSE of 0.281. So for a general purpose solver the obtained performance is quite nice.

Some explanation about the run time parameters:
--training - training input file name
--validation - validation input file name
--D - width of feature vector
--calc_error - calculates the classification error.
--cutoff - threshold value used for binary classification. Since the data contains 0/1 labels I set the cutoff threshold to be 0.5
--max_iter - number of iterations
--gensgd_rate1/2/3 - learning rates (gradient step sizes)
--gensgd_regw/v/0 - regularization rates
--quiet - runs in less verbose mode

Instructions:
1) Install GraphChi as instructed here (steps 1-3).
2) Download the datasets kddb (training) and kddb.t (validation) and put them in the root GraphChi folder. (Tip: use bunzip2 to open those files).
3) Create a file named kddb\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 19264097
4) Create a file named kddb.t\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 748400
5) Run as instructed.






Friday, December 14, 2012

Collaborative filtering - 3rd generation - part 2

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.

**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.

First dataset: Airline on time


Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory  Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

And here are the first few lines:

2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

Note: you can get the dataset using the commands:
curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2
bunzip2 2008.csv.bz2


First task. Can we predict the total time the flight was on the air? 


Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

First let's use traditional matrix factorization.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1  --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 0 : 

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum

   7.58561) Iteration:   0 Training RMSE:    67.1094
   11.7177) Iteration:   1 Training RMSE:    64.6665
   15.8441) Iteration:   2 Training RMSE:    63.2155
   19.9971) Iteration:   3 Training RMSE:    59.0044
   24.0989) Iteration:   4 Training RMSE:    53.9083
   28.1962) Iteration:   5 Training RMSE:    50.2416
...
   77.6041) Iteration:  17 Training RMSE:    35.6409
   81.7165) Iteration:  18 Training RMSE:     35.505
   85.8197) Iteration:  19 Training RMSE:    35.4046
   89.9266) Iteration:  20 Training RMSE:    35.3288


We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=100  --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 7 : 

INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum


   21.8356) Iteration:   0 Training RMSE:    50.3144
   36.6782) Iteration:   1 Training RMSE:    40.4813
    51.425) Iteration:   2 Training RMSE:    36.0579
   66.4348) Iteration:   3 Training RMSE:    33.4226
...
   272.188) Iteration:  17 Training RMSE:    20.0103
   286.887) Iteration:  18 Training RMSE:    19.7198
   301.602) Iteration:  19 Training RMSE:    19.4597
   316.305) Iteration:  20 Training RMSE:    19.2147


 With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.

Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...

./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 14 : 
INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:  12 : CRSElapsedTime
INFO:     gensgd.cpp(main:1211): Selected feature:  13 : AirTime
INFO:     gensgd.cpp(main:1211): Selected feature:  14 : ArrDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  15 : DepDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  16 : Origin
INFO:     gensgd.cpp(main:1211): Selected feature:  17 : Dest
INFO:     gensgd.cpp(main:1211): Selected feature:  18 : Distance
INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum
   36.2089) Iteration:   0 Training RMSE:    21.1476
   61.2802) Iteration:   1 Training RMSE:    10.1963
   86.3032) Iteration:   2 Training RMSE:    8.64215
   111.236) Iteration:   3 Training RMSE:    7.76054
   136.246) Iteration:   4 Training RMSE:    7.14308
   161.221) Iteration:   5 Training RMSE:     6.6629
...
   461.528) Iteration:  17 Training RMSE:    4.26991
    486.61) Iteration:  18 Training RMSE:    4.17239
   511.737) Iteration:  19 Training RMSE:    4.08084
   536.775) Iteration:  20 Training RMSE:    3.99414

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1  --file_columns=28 --gensgd_rate3=1e-3  --gensgd_mult_dec=0.9999 --max_iter=20  --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[quiet] => [1]
INFO:     gensgd.cpp(main:1155): Total selected features: 16 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
INFO:     gensgd.cpp(main:1158): Selected feature: 10
INFO:     gensgd.cpp(main:1158): Selected feature: 11
INFO:     gensgd.cpp(main:1158): Selected feature: 12
INFO:     gensgd.cpp(main:1158): Selected feature: 13
INFO:     gensgd.cpp(main:1158): Selected feature: 14
INFO:     gensgd.cpp(main:1158): Selected feature: 15
INFO:     gensgd.cpp(main:1158): Selected feature: 16
INFO:     gensgd.cpp(main:1158): Selected feature: 17
INFO:     gensgd.cpp(main:1158): Selected feature: 18
   1.56777) Iteration:   0 Training RMSE:    3.89207
   3.01777) Iteration:   1 Training RMSE:    3.64978
    4.5159) Iteration:   2 Training RMSE:    3.46472
    5.8659) Iteration:   3 Training RMSE:    3.30712
   7.26778) Iteration:   4 Training RMSE:    3.17225
    8.7159) Iteration:   5 Training RMSE:    3.06696
...
   23.6072) Iteration:  16 Training RMSE:    2.60147
   24.9789) Iteration:  17 Training RMSE:    2.57697
   26.3267) Iteration:  18 Training RMSE:    2.55768
   27.6967) Iteration:  19 Training RMSE:    2.54186
   29.0773) Iteration:  20 Training RMSE:    2.53113
We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.


Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.


Second dataset: Hearst machine learning challenge

A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:
CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD

Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.

And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"


For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.

We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).

After about an hour of playing I got the the following formulation:

./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1255): Total selected features: 36 : 
INFO:     gensgd.cpp(main:1258): Selected feature:   3 : AQI
INFO:     gensgd.cpp(main:1258): Selected feature:   4 : ASIAN_CD
INFO:     gensgd.cpp(main:1258): Selected feature:   5 : AUTO_IN_MARKET
INFO:     gensgd.cpp(main:1258): Selected feature:   6 : BIRD_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:   7 : BUYER_DM_BOOKS
INFO:     gensgd.cpp(main:1258): Selected feature:   8 : BUYER_DM_COLLECT_SPC_FOOD
INFO:     gensgd.cpp(main:1258): Selected feature:  11 : BUYER_DM_GARDEN_FARM
INFO:     gensgd.cpp(main:1258): Selected feature:  12 : BUYER_DM_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  13 : BUYER_DM_GIFT_GADGET
INFO:     gensgd.cpp(main:1258): Selected feature:  14 : BUYER_DM_MALE_ORIEN
INFO:     gensgd.cpp(main:1258): Selected feature:  15 : BUYER_DM_UPSCALE
INFO:     gensgd.cpp(main:1258): Selected feature:  16 : BUYER_MAG_CULINARY_INTERS
INFO:     gensgd.cpp(main:1258): Selected feature:  17 : BUYER_MAG_FAMILY_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  18 : BUYER_MAG_FEMALE_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  19 : BUYER_MAG_GARDEN_FARMING
INFO:     gensgd.cpp(main:1258): Selected feature:  20 : BUYER_MAG_HEALTH_FITNESS
INFO:     gensgd.cpp(main:1258): Selected feature:  21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  22 : BUYER_MAG_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  23 : CATS_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:  24 : CEN_2000_MATCH_LEVEL
INFO:     gensgd.cpp(main:1258): Selected feature:  25 : CLUB_MEMBER_CD
INFO:     gensgd.cpp(main:1258): Selected feature:  26 : COUNTRY_OF_ORIGIN
INFO:     gensgd.cpp(main:1258): Selected feature:  27 : DECEASED_INDICATOR
INFO:     gensgd.cpp(main:1258): Selected feature:  28 : DM_RESPONDER_HH
INFO:     gensgd.cpp(main:1258): Selected feature:  29 : DM_RESPONDER_INDIV
INFO:     gensgd.cpp(main:1258): Selected feature:  30 : DMR_CONTRIB_CAT_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO:     gensgd.cpp(main:1258): Selected feature:  32 : DMR_CONTRIB_CAT_POLITICAL
INFO:     gensgd.cpp(main:1258): Selected feature:  33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  34 : DMR_DO_IT_YOURSELFERS
INFO:     gensgd.cpp(main:1258): Selected feature:  35 : DMR_MISCELLANEOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  36 : DMR_NEWS_FINANCIAL
INFO:     gensgd.cpp(main:1258): Selected feature:  37 : DMR_ODD_ENDS
INFO:     gensgd.cpp(main:1258): Selected feature:  38 : DMR_PHOTOGRAPHY
INFO:     gensgd.cpp(main:1258): Selected feature:  39 : DMR_SWEEPSTAKES
INFO:     gensgd.cpp(main:1258): Selected feature:  40 : DOG_QTY
INFO:     gensgd.cpp(main:1259): Target variable   0 : CLICK_FLG
INFO:     gensgd.cpp(main:1260): From              9 : BUYER_DM_CRAFTS_HOBBI
INFO:     gensgd.cpp(main:1261): To               10 : BUYER_DM_FEMALE_ORIEN
   54.8829) Iteration:   0 Training RMSE: 0.00927502  Train err:      8e-05
   99.4742) Iteration:   1 Training RMSE: 0.00120904  Train err:          0
   143.852) Iteration:   2 Training RMSE: 0.000793143 Train err:          0
   188.523) Iteration:   3 Training RMSE: 0.000604034 Train err:          0
   233.188) Iteration:   4 Training RMSE: 0.000500067 Train err:          0


We got a very good classifier - starting from the second iteration there are no classification errors.

Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You  can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles

Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.

Tuesday, December 11, 2012

Collaborative Filtering - 3rd Generation [or] winning the kdd cup in 5 minutes!

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.


After spending a few years writing collaborative filtering software with thousands of installations, and after talking to tens of companies and participating in KDD CUP twice,  I have started to develop some next generation collaborative filtering software. The software is very experimental at this point and I am looking for the help of my readers - universities and companies who would like to try it out.
[NOTE: I HAVE ADDED SOME UPDATES BELOW ON THURSDAY DEC 13]

The problem:

Most collaborative filtering methods (like ALS, SGD, bias-SGD, NMF etc.) use the rating values for computing matrix factorization. A few "fancier" methods (like tensor-ALS, time-SVD++ etc. ) utilize also the temporal information to improve the quality of predictions. So basically we are limited to 2 or 3 dimensional factorization. Typically the utilized data is of the type:
[ user ] [ item ] [ rating ] 
or
[ user ] [ item ] [ time ] [ rating ] 

I am often asked, how to approach problems when you have data of the type:

[ user ] [ item ] [ item category] [ purchase amount ] [ quantity ] [ user age ] [ zip code ] [ time ] [ date ] ... [ user rating ]

In other words, how do we utilize additional information we have about user features, item features, or even more fancier feature like user friends etc. This problem is often encountered in practice and in many cases, papers are written about it by doing specific constructions. See for example Koenigstein's paper. However, in practice, most users do not like to break their heads and invent novel algorithms but want to have a readily accessible method that can take more features into account and without much fine tuning.

The solution:

Following the great success of libFM, I thought about implementing a more general SGD method in GraphChi that case take a list of features into account.

A new SGD based algorithm is developed with the following
1) Support for string features (John Smith bought the Matrix)
2) Support for dynamic selection of features on runtime.
3) Support of multiple file formats with column permutation.
4) Support for an unlimited number of features
5) Support for multiple ratings of the same item.

Working example - KDD CUP 2012 - track1

To give some concrete example, I will use KDD CUP 2012 track1 data which will demonstrate how easy to setup and try the new method.

Preliminaries:
0) Download track 1 data from here. Extract the zip file.
1) Download and install GraphChi using steps 1-3.

2a) In the root graphchi folder, Create a file named rec_log_train.txt:info with the following lines:

%%MatrixMarket matrix coordinate real general
2500000 2500000 73209277


2b) link the file track1/rec_log_train.txt into the root graphchi folder:
cd graphchi
ln -s ../track1/rec_log_train.txt .

Let's look at the input file format:

<49|0>bickson@bigbro6:~/graphchi$ head rec_log_train.txt
2088948 1760350 -1 1318348785
2088948 1774722 -1 1318348785
2088948 786313 -1 1318348785
601635 1775029 -1 1318348785
601635 1902321 -1 1318348785
The input is of the format 
[user] [ add ] [ click ] [ timestamp ]

Where click is either -1 (not clicked) or 1 (clicked).

First step: regular matrix factorization

Now let's run a quick matrix factorization using user, item and rating:
 ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2

Explanation: --training is the input file. --val_pos=2 means that the rating is in column 2, --rehash=1 means we treat all fields as strings (and thus support string values), --limit_rating means we handle only the first million ratings (to speed up the demo), --max_iter is the number of SGD iterations, --minval and --maxval are the allowed rating range, and --quiet less verbose output. --calc_error displays the classification error (how many predictions were wrong).--file_columns=4 says that there are 4 columns in the input file.

And here is the output we get:



WARNING:  common.hpp(print_copyright:204): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com
[training] => [rec_log_train.txt]
[limit_rating] => [1000000]
[max_iter] => [100]
[gensgd_mult_dec] => [0.999999]
[minval] => [-1]
[maxval] => [1]
[quiet] => [1]
[calc_error] => [1]
[file_columns] => [4]
[gensgd_rate0] => [0.1]
[gensgd_rate1] => [0.1]
[gensgd_rate2] => [0.1]
[gensgd_regw] => [0.1]
[gensgd_reg0] => [0.1]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]

 === REPORT FOR sharder() ===
[Timings]
edata_flush: 1.00698s (count: 265, min: 0.000625s, max: 0.005065, avg: 0.00379992s)
execute_sharding: 2.3855 s
finish_shard.sort: 0.648602s (count: 4, min: 0.156317s, max: 0.166634, avg: 0.162151s)
preprocessing: 1.72782 s
shard_final: 1.78102s (count: 4, min: 0.432858s, max: 0.454368, avg: 0.445255s)
[Other]
app: sharder
   31.7185) Iteration:   0 Training RMSE:   0.526537 Train err:  0.0010427
Step size 1 0.000387365  Step size 2 0.000780633  Step size 3 8.82609e-06
...
   295.691) Iteration:  99 Training RMSE:   0.206428 Train err: 0.000239218

We got a training RMSE error of 0.20, but only a training error of 0.02% (namely 99.98% of the predictions are correct)

Second step: temporal matrix factorization

Now let's add the time bins (3rd column) into the computation as feature and run again. This is done using the --features=3 command line flag:


./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2 --features=3 --gensgd_rate3=0.1 --gensgd_rate4=0.1
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   3.17175) Iteration:   0 Training RMSE:   0.522901 Train err:   0.033788
...
    284.147) Iteration:  99 Training RMSE:  0.0275943 Train err:          0

By adding the time bins into consideration, we get an improvement from RMSE 0.206 to 0.027 !!
Furthermore, the classification error is down to zero. 

Third step: let's throw in some user features!

Besides of add rating data, we have some additional information about the users. The file user_profile.txt holds some properties of each user. The file has the following format:
100044 1899 1 5 831;55;198;8;450;7;39;5;111
100054 1987 2 6 0
100065 1989 1 57 0
100080 1986 1 31 113;41;44;48;91;96;42;79;92;35
100086 1986 1 129 0
100097 1981 1 75 0
100100 1984 1 47 71;51

The file has the following format:
[user ] [ year of birth ] [ gender ] [ number of tweets ] [ tag ids (area of interest) ]

Adding user features is simply done by the flag --user_file=user_profile.txt

./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --features=3 --last_item=1 --user_file=user_profile.txt
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.02809) Iteration:   0 Training RMSE:          0 Train err:          0
   2.90718) Iteration:   1 Training RMSE:   0.511614 Train err:   0.022662
   3.74655) Iteration:   2 Training RMSE:    0.49371 Train err:   0.017136
   4.55983) Iteration:   3 Training RMSE:   0.479225 Train err:   0.015074
   5.40781) Iteration:   4 Training RMSE:   0.465404 Train err:   0.016538
   6.27764) Iteration:   5 Training RMSE:   0.451063 Train err:   0.015657
...
   77.5867) Iteration:  96 Training RMSE:  0.0177382 Train err:          0
   78.3384) Iteration:  97 Training RMSE:  0.0176325 Train err:          0
   79.0683) Iteration:  98 Training RMSE:  0.0174947 Train err:          0
   79.7872) Iteration:  99 Training RMSE:  0.0174152 Train err:          0

Overall we got another improvement from 0.018 to 0.0174

Step four: throw in some item features

In the KDD cup data, we are also given some item features, in the file item.txt

2335869 8.1.4.2 412042;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974
1774844 1.8.3.6 31449;517124;45008;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268

The format is:
[add id] [catergory] [ list of keywords ]


Let's throw in some item information into the algorithm. This is done using the --item_file parameter.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=0 --features=3 --last_item=1   --quiet=1 --user_file=user_profile.txt --item_file=item.txt --gensgd_rate5=1e-5 --calc_error=1 --file_columns=4

WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.23951) Iteration:   0 Training RMSE:          0 Train err:          0
   4.95858) Iteration:   1 Training RMSE:   0.527203 Train err:   0.022205
   7.54827) Iteration:   2 Training RMSE:   0.499881 Train err:   0.022271
    10.026) Iteration:   3 Training RMSE:   0.476596 Train err:   0.024138
   12.4976) Iteration:   4 Training RMSE:   0.454496 Train err:   0.016523
   14.9459) Iteration:   5 Training RMSE:   0.431336 Train err:   0.016406
...
    217.96) Iteration:  96 Training RMSE:  0.0127242 Train err:          0
   220.116) Iteration:  97 Training RMSE:  0.0126185 Train err:          0
   222.317) Iteration:  98 Training RMSE:  0.0125111 Train err:          0
   224.559) Iteration:  99 Training RMSE:  0.0123526 Train err:          0



We got some significant RMSE improvement - from 0.017 to 0.012.

Thursday, Dec 13 - An update

I am getting a lot of readers inputs about this blog post, which is excellent!

One question I got from Xavier Amatriain, manager of recommendations @ Netflix, is why do I compute training error and not test error. Xavier is absolutely right, I was quite excited about the results so I wanted to share them before I even had time to compute the test error. Anyway I promise to do so in a couple of days. But I am quite sure that the model is quite accurate!

I got some interesting inputs from Tianqi Chen, author of SVDFeature software:
I think one important thing that we may want to add is the support of classification loss( which is extremely easy for SGD ). Since now days RMSE optimization seems get a bit out of fashioned and most data are click-through data and the optimization target is ranking instead of RMSE. I think the feature selection part is quite interesting. Since adding junk feature in those feature-based factorization model will almost hamper the performance. However, directly replacing L1 constraint on weight will work worse than L2 regularization, so I am curious what trick you used :-)

I also got comments from my golden collaborator Justin Yan:
1. for SGD-FM, it is hard to turn the parameters like learning rate and MCMC based method is slow.
2. Recently I find another great model- online bayisian probit regression (adpredictor) which bing has used in their CTR prediction. this model is a online learning model it is very fast,and the result is better than Logistic regression, so I am thinking about borrowing some ideas from this model to improve LibFM to a online learning model.

The last kind of feedback I am getting is from companies how claim to already solved this problem.. I think that if the problem was already completely solved, I was not getting so much feedback about it.
What do you think?

Next: next generation cf - part 2 - trying the software on airline on time data.

Monday, December 3, 2012

Collaborative filtering with GraphChi

A couple of weeks ago I covered GraphChi by Aapo Kyrola in my blog.
Here is a quick tutorial for trying out GraphChi collaborative filtering toolbox that I wrote. Currently it supports ALS (alternating least squares), SGD (stochastic gradient descent), bias-SGD (biased stochastic gradient descent) , SVD++ , NMF (non-negative matrix factorization), SVD (restarted lanczos, and one sided lanczos), RBM (restricted Bolzman machines), FM (factorization machines) and time-SVD++, CLiMF (collaborative less is more filtering).  I am soon going to implement several more algorithms.

New: Join our GraphLab& GraphChi LinkedIn group

References

Here are papers which explain the algorithms in more detail:

  • Alternating Least Squares (ALS)
    Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008.
    
  • Alternating Least Squares (ALS) - parallel coordinate descent (a.k.a. CCD++)
    H.-F. Yu, C.-J. Hsieh, S. Si, I. S. Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. IEEE International Conference on Data Mining(ICDM), December 2012. 
    Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 635-644.
    
  • Stochastic gradient descent (SGD)
     Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37. 
    Takács, G, Pilászy, I., Németh, B. and Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
    
  • Bias stochastic gradient descent (Bias-SGD)
    Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. ACM SIGKDD 2008. Equation (5).
  • SVD++
    Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. ACM SIGKDD 2008. 
  • Weighted-ALS
    Collaborative Filtering for Implicit Feedback Datasets Hu, Y.; Koren, Y.; Volinsky, C. IEEE International Conference on Data Mining (ICDM 2008), IEEE (2008). 
  • Sparse-ALS
    Xi Chen, Yanjun Qi, Bing Bai, Qihang Lin and Jaime Carbonell. Sparse Latent Semantic Analysis. In SIAM International Conference on Data Mining (SDM), 2011. 
    D. Needell, J. A. Tropp CoSaMP: Iterative signal recovery from incomplete and inaccurate samples Applied and Computational Harmonic Analysis, Vol. 26, No. 3. (17 Apr 2008), pp. 301-321. 
  • NMF
    Lee, D..D., and Seung, H.S., (2001), 'Algorithms for Non-negative Matrix
    Factorization', Adv. Neural Info. Proc. Syst. 13, 556-562.
  • SVD (Restarted Lanczos & One sided Lanczos)
    V. Hern´andez, J. E. Rom´an and A. Tom´as. STR-8: Restarted Lanczos Bidiagonalization for the SVD in SLEPc. 

  • tensor-ALS
    Tensor Decompositions, Alternating Least Squares and other Tales. P. Comon, X. Luciani and A. L. F. de Almeida. Special issue, Journal of Chemometrics. In memory of R. Harshman.
    August 16, 2009
  • Restricted Bolzman Machines (RBM)
    G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. University of Toronto Tech report UTML TR 2010-003.
  • time-svd++
    Yehuda Koren. 2009. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 447-456. DOI=10.1145/1557019.1557072
  • libFM
    Steffen Rendle (2010): Factorization Machines, in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia.
    
  • PMF
    Salakhutdinov and Mnih, Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. in International Conference on Machine Learning, 2008.
    
  • CLiMF
     CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. Yue Shi, Martha Larson, Alexandros Karatzoglou, Nuria Oliver, Linas Baltrunas, Alan Hanjalic, Sixth ACM Conference on Recommender Systems, RecSys '12.
    

Target

The benefit of using GraphChi is that it requires a single multicore machine and can scale up to very large models, since at no point the data is fully read into memory. In other words, GraphChi is very useful for machine with limited RAM since it streams over the dataset. It is also possible to configure how much RAM to use during the run.

Here are some performance numbers:

The above graph shows 6 iterations of SGD (stochastic gradient descent) on the full Netflix data.
Netflix has around 100M ratings so the matrix has 100M non-zeros. The size of the decomposed
matrix is about 480K users x 10K movies. I used a single multicore machine with 8 threads, where GraphChi memory consumption was limited to 800Mb, using 8 cores. The factorized matrix has a width of D=20. In total it takes around 80 seconds per 6 iterations, which is around 14 seconds per iteration.

Preprocessing the matrix is done once, and take around 35 seconds.

The input to GraphChi ALS/SGD/bias-SGD is the sparse matrix A in sparse matrix market format. The output are two matrices U and V s.t. A ~= U*V'  and both U and V have
a lower dimension D.

Running and setup instructions

Let's start with an example:

1) Download graphchi from Github using the instructions here.

2) Change directory to graphchi
    > cd graphchi

3) Install graphchi
    > bash install.sh

4a) For ALS/SGD/bias-SGD/SVD++/SVD Download Netflix synthetic sample file. The input is in sparse matrix market format.
    wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm 
  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mme 

4b) For WALS Download netflix sample file including time:

  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/time_smallnetflix
  wget http://www.select.cs.cmu.edu/code/graphlab/datasets/time_smallnetflixe

5) Run baseline methods on Netflix example:
    ./toolkits/collaborative_filtering/baseline --training=smallnetflix_mm --validation=smallnetflix_mm --minval=1 --maxval=5 --quiet=1  --algorithm=user_mean

5a) Run ALS on the Netflix example:
     ./toolkits/collaborative_filtering/als --training=smallnetflix_mm --validation=smallnetflix_mme --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1


   At the first time, the input file will be preprocessed into an efficient binary representation on disk and then 6 ALS iterations will be run. 


5b) Run CCD++ on the Netflix example:
     ./toolkits/collaborative_filtering/als_coord --training=smallnetflix_mm --validation=smallnetflix_mme --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1


5c) Run SGD on the Netflix example:
  ./toolkits/collaborative_filtering/sgd --training=smallnetflix_mm --validation=smallnetflix_mme --sgd_lambda=1e-4 --sgd_gamma=1e-4 --minval=1 --maxval=5 --max_iter=6 --quiet=1

5c) Run bias-SGD on the Netflix example:
  ./toolkits/collaborative_filtering/biassgd --training=smallnetflix_mm --validation=smallnetflix_mme --biassgd_lambda=1e-4 --biassgd_gamma=1e-4 --minval=1 --maxval=5 --max_iter=6 --quiet=1

5d) Run SVD++ on Netflix example:
  ./toolkits/collaborative_filtering/svdpp --training=smallnetflix_mm --validation=smallnetflix_mme --biassgd_lambda=1e-4 --biassgd_gamma=1e-4 --minval=1 --maxval=5 --max_iter=6 --quiet=1


5e) Run weighted-ALS on the Netflix time example:
  ./toolkits/collaborative_filtering/wals --training=time_smallnetflix --validation=time_smallnetflixe --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1

5f) Run NMF on the reverse Netflix example:
./toolkits/collaborative_filtering/nmf --training=reverse_netflix.mm --minval=1 --maxval=5 --max_iter=20 --quiet=1

5g) Run SVD and one sided SVD on the Netflix example:
./toolkits/collaborative_filtering/svd --training=smallnetflix_mm --nsv=3 --nv=10 --max_iter=5 --quiet=1 --tol=1e-1
./toolkits/collaborative_filtering/svd_onesided --training=smallnetflix_mm --nsv=3 --nv=10 --max_iter=5 --quiet=1 --tol=1e-1

5h) Run tensor-ALS on Netflix time example
  ./toolkits/collaborative_filtering/als_tensor --training=time_smallnetflix --validation=time_smallnetflixe --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --K=27 --quiet=1

5i) Run RBM on time_smallnetflix data using the command:
./toolkits/collaborative_filtering/rbm --training=smallnetflix_mm --validation=smallnetflix_mme --minval=1 --maxval=5 --max_iter=6 --quiet=1

5j) Run time-svd++ on  time_smallnetflix data:
./toolkits/collaborative_filtering/timesvdpp --training=time_smallnetflix --validation=time_smallnetflixe --minval=1 --maxval=5 --max_iter=6 --quiet=1

5k) Run libFM on time_smallnetflix
./toolkits/collaborative_filtering/libfm --training=time_smallnetflix --validation=time_smallnetflixe --minval=1 --maxval=5 --max_iter=6 --quiet=1


5l) Run PMF on smallnetflix_mm data:
./toolkits/collaborative_filtering/pmf --training=smallnetflix_mm --quiet=1 --minval=1 --maxval=5 --max_iter=10 --pmf_burn_in=5

5m) Run Bias-SGD2 on smallnetflix_mm data:

./toolkits/collaborative_filtering/biassgd2 --training=smallnetflix_mm --minval=1 --maxval=5 --validation=smallnetflix_mme --biassgd_gamma=1e-2 --biassgd_lambda=1e-2 --max_iter=10 --quiet=1 --loss=logistic --biassgd_step_dec=0.99999
./toolkits/collaborative_filtering/biassgd2 --training=smallnetflix_mm --minval=1 --maxval=5 --validation=smallnetflix_mme --biassgd_gamma=1e-2 --biassgd_lambda=1e-2 --max_iter=10 --quiet=1 --loss=abs --biassgd_step_dec=0.99999
./toolkits/collaborative_filtering/biassgd2 --training=smallnetflix_mm --minval=1 --maxval=5 --validation=smallnetflix_mme --biassgd_gamma=1e-2 --biassgd_lambda=1e-2 --max_iter=10 --quiet=1 --loss=square --biassgd_step_dec=0.99999

5n) Run CLiMF on the netflix data:
./toolkits/collaborative_filtering/climf --training=smallnetflix_mm --validation=smallnetflix_mme --binary_relevance_thresh=4 --sgd_gamma=1e-6 --max_iter=6 --quiet=1 --sgd_step_dec=0.9999 --sgd_lambda=1e-6
6) View the output.

For ALS, CCD++, SGD, bias-SGD, WALS, SVD++ , RBM, CLiMF and NMF

Two files are created: filename_U.mm and filename_V.mm
    The files store the matrices U and V in dense matrix market format.
    head smallnetflix_mm_U.mm

%%MatrixMarket matrix array real general
95526 5
0.693370
1.740420
0.947675
1.328987
1.150084
1.399164
1.292951
0.300416

For tensor-ALS, time-SVD++

Additional output file named filename_T.mm is created. Prediction is computed as the tensor product of U_i * V_j * T_k (namely r_ijk = sum_l( u_il * v_jl * t_kl )).

For bias-SGD, SVD++, time-SVD++

Additional three files are created: filename_U_bias.mm, filename_V_bias.mm and filename_global_mean.mm. Bias files include the bias for each user (U) and item (V).
The global mean file includes the global mean of the rating.

For SVD

For each singular vector a file named filename.U.XX is created where XX is the number of the singular vector. The same with filename.V.XX. Additionally a singular value files is also saved.





Algorithms

Here is a table summarizing the properties of the different algorithms in the collaborative filtering library:
ALGORITHMMethod typeComments
ALSALS
ALS_COORD/CCD++ALSUsing parallel coordinate descent
Sparse-ALSALSSparse feature vectors (useful for classifying users/items together)
SGDSGD
bias-SGDSGD
bias-SGD2SGDSupports logistic loss and MAE
SVDLanczos
One Sided SVDFor skewed matrices (with one dimension larger than the other)
NMFFor positive matrices.
RBMSGDMCMC method
SVD++SGD
LIBFMSGD
PMFALSMCMC method
time-SVD++SGDSupports time
BPTF (not implemented yet)MCMC method
BaseLineXX
WALSALSSupports weights for each recommendation.
TENSOR ALSTensor factorization.
GENSGDSupports arbitrary string format. Can be used for classification.
SPARSE_GENSGDlibsvm format.
CLiMFSGDMinimizes MRR (mean reciprocal rank)


Note: for tensor algorithms, you need to verify you have both the rating and its time. Typically the exact time is binned into time bins (a few tens up to a few hundreds). Having too fine granularity over the time bins slows down computation and does not improve prediction.
Using matrix market format, you need to specify each rating using 4 fields:
[user] [item] [time bin] [rating]



Common command line options (for all algorithms)




--training the training input file
--validationthe validation input file (optional). Validation is data with known ratings which not used for training.
--testthe test input file (optional). Test input file is used for computing predictions to a predefined list of user/item pairs.
--minvalmin allowed rating (optional). It is highly recommended to set this value since it improves prediction accuracy.
--maxvalmax allowed rating (optional). It is highly recommended to set this value since it improves prediction accuracy.
--max_iternumber of iterations to run
--quietrun with less traces. (optional, default = 0).
--halt_on_rmse_increase(optional, default = 0). Stops execution when validation error goes up. Runs at least the number of iterations specified in the flag. For example --halt_on_rmse_increase=10 will run at least 10 iterations, and then stop if validation RMSE increases.
--load_factors_from_file (optional, default = 0). This options allows for two functionalities. Instead of starting with a random state, you can start from any predefined state for the algorithm. This also allows for running a few iterations, saving the results to disk for fault tolerance, and running later FROM THE SAME EXACT state.
--D width of the factorized matrix. Default is 20.
--R_output_format              Save output in sparse matrix market format (compatible
                                                       with R)

Baseline method The baseline method is a simple and quick way of checking the accuracy of the predictions.
The baseline method support three operation modes:
  --algorithm=global_mean // assigns all recommendations to be the global rating mean.
  --algorithm=user_mean //assigns recommendations based on each user mean value.
  --algorithm=item_mean //assigns recommendations based on each item mean value.

To summarize, the baseline method assigns one of the three possible means as the recommendation results and computes the prediction error. Any other algorithm should give a better result than the baseline method, and thus it can be used a sanity check for the deployed algorithms.
ALS (Alternating least squares) Pros: Simple to use, not many command line arguments
Cons: intermediate accuracy, higher computational overhead

ALS is a simple yet powerful algorithm. In this model the prediction is computed as:
  r_ui = p_u * q_i
Where r_ui is a scalar rating of user u to item i, and p_u is the user feature vector of size D, q_i is the item feature vector of size D and the product is a vector product.
The output of ALS is two matrices: filename_U.mm and filename_V.mm The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). In linear algebra notation the rating matrix R ~ UV


Below are ALS related command line options:




Basic
Confirmation
--lambda=XXSet regularization. Regularization helps to prevent overfitting.



CCD++ (Alternating least squares, parallel coordinate descent)
Pros: Simple to use, not many command line arguments, faster than ALS
Cons: Slower convergence relative to ALS

In CCD++ the prediction is computed as:
  r_ui = p_u * q_i
Where r_ui is a scalar rating of user u to item i, and p_u is the user feature vector of size D, q_i is the item feature vector of size D and the product is a vector product.
The output of CCD++ are two matrices: filename_U.mm and filename_V.mm The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). In linear algebra notation the rating matrix R ~ UV


Below are CCD++ related command line options:



Basic
Confirmation
--lambda=XXSet regularization. Regularization helps to prevent overfitting.
Stochastic gradient descent (SGD) Pros: fast method
Cons: need to tune step size, more iterations are needed relative to ALS.

SGD is a simple gradient descent algorithm. Prediction in SGD is done as in ALS:
  r_ui = p_u * q_i
Where r_ui is a scalar rating of user u to item i, and p_u is the user feature vector of size D, q_i is the item feature vector of size D and the product is a vector product.
The output of SGD is two matrices: filename.U and filename.V. The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). In linear algebra notation the rating matrix R ~ UV

--lambda -  regularization (optional). Default value 1e-3.
--gamma - gradient step size (optional).Default value 1e-3.
--sgd_step_dec - multiplicative step decrement (optional). Default is 0.9.

Bias-SGD 

Pros: fast method
Cons: need to tune step size

Bias-SGD is a simple gradient descent algorithm, where besides of the feature vector we also compute item and user biases (how much their average rating differs from the global average).
Prediction in bias-SGD is done as follows:

r_ui = global_mean_rating + b_u + b_i + p_u * q_i

Where global_mean_rating is the global mean rating, b_u is the bias of user u, b_i is the bias of item i and p_u and q_i are feature vectors as in ALS. You can read more about bias-SGD in reference [N].

The output of bias-SGD consists of two matrices: filename.U and filename.V. The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). Additionally, the output consists of two vectors: bias for each user, bias for each item. Last, the global mean rating is also given as output.

bias-SGD command line arguments:
--biassgd_lambda -regularization (optional).  Default value 1e-3.
--biassgd_gamma -gradient step size (optional). Default value 1e-3.
--biassgd_step_dec - multiplicative gradient step decrement (optional). Default is 0.9.


Bias-SGD2 

Pros: fast method, supports logistic loss and MAE

Cons: need to tune step size. Need to supply both --minval and --maxval, the allowed range for recommendations.

Bias-SGD2 is a simple gradient descent algorithm, where besides of the feature vector we also compute item and user biases (how much their average rating differs from the global average).
Prediction in bias-SGD is done as follows:
r_ui = global_mean_rating + b_u + b_i + p_u * q_ir_ui = 1/ (1 + exp(-rui))r_ui = min_rating + rui * rating_range


Where global_mean_rating is the global mean rating, b_u is the bias of user u, b_i is the bias of item i and p_u and q_i are feature vectors as in ALS. You can read more about bias-SGD in reference [N].

The output of bias-SGD2 consists of two matrices: filename.U and filename.V. The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). Additionally, the output consists of two vectors: bias for each user, bias for each item. Last, the global mean rating is also given as output.

bias-SGD2 command line arguments:
--biassgd_lambda -regularization (optional).  Default value 1e-3.
--biassgd_gamma -gradient step size (optional). Default value 1e-3.
--biassgd_step_dec - multiplicative gradient step decrement (optional). Default is 0.9.
--loss=square/abs/logistic

Koren’s SVD++ 

Pros: more accurate method than SGD once tuned, relatively fast method
Cons: a lot of parameters for tuning, subject to numerical errors when parameters are out of scope.

Koren SVD++ is an algorithm which is slightly more fancy than bias-SGD and give somewhat better prediction results.

Prediction in Koren’s SVD++ algorithm is computed as follows:

r_ui = global_mean_rating + b_u + b_i + q_i * ( p_u + w_u )
Where r_ui is the scalar rating for user u to item i, global_mean_rating is the global mean rating, b_u is a scalar bias for user u, b_i is a scalar bias for item i, p_u is a feature vectors of length D for user u, q_i is a feature vector of length D for item i, and w_u is an additional feature vector of length D (the weight) for user u. The product is a vector product.

The output of Koren’s SVD++ is 5 output files:
Global mean ratings - include the scalar global mean rating.
user_bias  - includes a vector with bias for each user
movie_bias - includes a vector with bias for each movie
matrix U - includes in each row the feature vector p_u of size D and then the weight vector w_u of size D total width of 2D.
matrix V - includes in each row the item feature vector q_i of width D.

SVD++ command line arguments:
--svdpp_item_bias_step, --svdpp_user_bias_step, --svdpp_user_factor_step, --svdpp_user_factor2_step - gradient step size (optional). Default value 1e-4.
--svdpp_item_bias_reg, --svdpp_user_bias_reg, --svdpp_user_factor_reg, --svdpp_user_factor2_reg - regularization (optional). Default value 1e-4.
--svdpp_step_dec - multiplicative gradient step decrement (optional). Default is 0.9.

Weighted Alternating Least Squares (WALS)

Pros: allows weighting of ratings (can be thought of as confidence in rating), almost the same computational cost like in ALS.
Cons: worse modeling error relative to ALS

Weighted ALS is a simple extension for ALS where each user/item pair has an additional weight. In this sense, WALS is a tensor algorithm since besides of the rating it also maintains a weight for each rating. Algorithm is described in references [I, J].

Prediction in WALS is computed as follows:
r_ui = w_ui * p_u * q_i

The scalar value r for user u and item i is computed by multiplying the weight of the rating w_ui by the vector product p_u * q_i. Both p and q are feature vectors of size D.

Note: for weighted-ALS, the input file has 4 columns:
[user] [item] [weight] [rating]. See example file in section 5e).

--lambda - regularization

Alternating least squares with sparse factors
Pros: excellent for spectral clustering
Cons: less accurate linear model because of the sparsification step

This algorithm is based on ALS, but an additional sparsifying step is performed on either the user feature vectors, the item feature vectors or both. This algorithm is useful for spectral clustering: first the rating matrix is factorized into a product of one or two sparse matrices, and then clustering can be computed on the feature matrices to detect similar users or items.

The underlying algorithm which is used for sparsifying is CoSaMP. See reference [K1].

Below are sparse-ALS related command line options:

Basic configuration--user_sparsity=XXA number between 0.5 to 1 which defines how sparse is the resulting user feature factor matrix
--movie_sparsity=XXA number between 0.5 to 1 which defines how sparse is the resulting movie feature factor matrix

--algorithm=XX
There are three run modes:
 SPARSE_USR_FACTOR = 1
 SPARSE_ITM_FACTOR = 2
 SPARSE_BOTH_FACTORS = 3



Prediction in sparse-ALS is computing like in ALS.


Example running sparse-ALS:

bickson@thrust:~/graphchi$ ./bin/sparse_als.cpp --training=smallnetflix_mm --user_sparsity=0.8 --movie_sparsity=0.8 --algorithm=3 --quiet=1 --max_iter=15
WARNING:  sparse_als.cpp(main:202): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[user_sparsity] => [0.8]
[movie_sparsity] => [0.8]
[algorithm] => [3]
[quiet] => [1]
[max_iter] => [15]
  0) Training RMSE:    1.11754  Validation RMSE:    3.82345
  1) Training RMSE:    3.75712  Validation RMSE:      3.241
  2) Training RMSE:    3.22943  Validation RMSE:    2.03961
  3) Training RMSE:    2.10314  Validation RMSE:    2.88369
  4) Training RMSE:    2.70826  Validation RMSE:    3.00748
  5) Training RMSE:    2.70374  Validation RMSE:    3.16669
  6) Training RMSE:    3.03717  Validation RMSE:     3.3131
  7) Training RMSE:    3.18988  Validation RMSE:    2.83234
  8) Training RMSE:    2.82192  Validation RMSE:    2.68066
  9) Training RMSE:    2.29236  Validation RMSE:    1.94994
 10) Training RMSE:    1.58655  Validation RMSE:    1.08408
 11) Training RMSE:     1.0062  Validation RMSE:    1.22961
 12) Training RMSE:    1.05143  Validation RMSE:     1.0448
 13) Training RMSE:   0.929382  Validation RMSE:    1.00319
 14) Training RMSE:   0.920154  Validation RMSE:   0.996426


tensor-ALS


Note: for tensor-ALS, the input file has 4 columns:
[user] [item] [time] [rating]. See example file in section 5b).

--lambda - regularization


Non-negative matrix factorization (NMF) Non-negative matrix factorization (NMF) is based on Lee and Seung [reference H].
Prediction is computed like in ALS:
   r_ui = p_u * q_i

Namely the scalar prediction r of user u is composed of the vector product of the user feature vector p_u (of size D), with the item feature vector q_i (of size D). The only difference is that both p_u and q_i have all nonnegative values.
The output of NMF is two matrices: filename.U and filename.V. The matrix U holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix V holds the feature vectors for each time (Each vector has again exactly D columns). In linear algebra notation the rating matrix R ~ UV, U>=0, V>=0.

NMF Has no special command line arguments.

SVD


SVD is implemented using the restarted lanczos algorithm.
The input is a sparse matrix market format input file.
The output are 3 files: one file containing the singular values, and two dense matrix market format files containing the matrices U and V.

Note: for larger models, it is advised to use svd_onesided since it significantly saved memory.

Here is an example Matrix Market input file for the matrix A2:

<235|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ cat A2
%%MatrixMarket matrix coordinate real general
3 4 12
1 1  0.8147236863931789
1 2  0.9133758561390194
1 3  0.2784982188670484
1 4  0.9648885351992765
2 1  0.9057919370756192
2 2  0.6323592462254095
2 3  0.5468815192049838
2 4  0.1576130816775483
3 1  0.1269868162935061
3 2  0.09754040499940952
3 3  0.9575068354342976
3 4  0.9705927817606157


Here is an for running SVD :

bickson@thrust:~/graphchi$ ./bin/svd --training=A2 --nsv=4 --nv=4 --max_iter=4 --quiet=1 WARNING: svd.cpp(main:329): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com [training] => [A2] [nsv] => [4] [nv] => [4] [max_iter] => [4] [quiet] => [1] Load matrix set status to tol Number of computed signular values 4 Singular value 0 2.16097 Error estimate: 2.35435e-16 Singular value 1 0.97902 Error estimate: 1.06832e-15 Singular value 2 0.554159 Error estimate: 1.56173e-15 Singular value 3 9.2673e-65 Error estimate: 6.51074e-16 Lanczos finished 7.68793


Listing the output files:

#> ls -lrt
-rw-r--r--  1 bickson bickson       2728 2012-09-20 01:57 graphchi_metrics.txt
-rw-r--r--  1 bickson bickson       2847 2012-09-20 01:57 graphchi_metrics.html
-rw-r--r--  1 bickson bickson        188 2012-09-20 01:57 A2.V.3
-rw-r--r--  1 bickson bickson        179 2012-09-20 01:57 A2.V.2
-rw-r--r--  1 bickson bickson        179 2012-09-20 01:57 A2.V.1
-rw-r--r--  1 bickson bickson        177 2012-09-20 01:57 A2.V.0
-rw-r--r--  1 bickson bickson        208 2012-09-20 01:57 A2.U.3
-rw-r--r--  1 bickson bickson        195 2012-09-20 01:57 A2.U.2
-rw-r--r--  1 bickson bickson        195 2012-09-20 01:57 A2.U.1
-rw-r--r--  1 bickson bickson        194 2012-09-20 01:57 A2.U.0
-rw-r--r--  1 bickson bickson        271 2012-09-20 01:57 A2.singular_values

Verifying solution accuracy in matlab

>>A2=mmread('A2');
>> full(A2)'

ans =

    0.8147    0.9058    0.1270
    0.9134    0.6324    0.0975
    0.2785    0.5469    0.9575
    0.9649    0.1576    0.9706

Now we read graphchi output using:
% read the top 3 singular values:

>> sigma=mmread('A2.singular_values');
>> sigma=sigma(1:3);

Read the top 3 vectors v:

>> v0=mmread('A2.V.0');
>> v1=mmread('A2.V.1');
>> v2=mmread('A2.V.2');

Read the top 3 vectors u:

>> u0=mmread('A2.U.0');
>> u1=mmread('A2.U.1');
>> u2=mmread('A2.U.2');

Compute an approximation to A2:
>> [u0 u1 u2] * diag(sigma) * [v0 v1 v2]'

ans =

    0.8147    0.9058    0.1270
    0.9134    0.6324    0.0975
    0.2785    0.5469    0.9575
    0.9649    0.1576    0.9706

As you can see we got an identical result to A2.

where

>>  [u0 u1 u2]

ans =

    0.5047    0.5481    0.2737
    0.4663    0.4726   -0.2139
    0.4414   -0.4878    0.7115
    0.5770   -0.4882   -0.6108

>> [v0 v1 v2]'

ans =

    0.7019    0.5018    0.5055
    0.2772    0.4613   -0.8428
   -0.6561    0.7317    0.1847

>> diag(sigma)

ans =

    2.1610         0         0
         0    0.9790         0
         0         0    0.5542

SVD Command line arguments


Basic configuration--trainingInput file name (in sparse matrix market format)
--nvNumber of inner steps of each iterations. Typically the number should be greater than the number of singular values you look for.
--nsvNumber of singular values requested. Should be typically less than --nv
--ortho_repeatsNumber of repeats on the orthogonalization step. Default is 1 (no repeats). Increase this number for higher accuracy but slower execution. Maximal allowed values is 3.
--max_iterNumber of allowed restarts. The minimum is 2= no restart.
--save_vectors=0
Disable saving the factorized matrices U and V to file. On default save_vectors=1.
--tolConvergence threshold. For large matrices set this number set this number higher (for example 1e-1, while for small matrices you can set it to 1e-16). As smaller the convergence threshold execution is slower.


Understanding the error measure

Following Slepc, the error measure is computed by a combination of:
sqrt( ||Av_i - sigma(i) u_i ||_2^2 + ||A^Tu_i - sigma(i) V_i ||_2^2 ) / sigma(i)


Namely, the deviation of the approximation sigma(i) u_i  from Av_i , and vice versa.

Scalability

Currently the code was tested with up to 3.5 billion non-zeros on a 24 core machine. Each Lanczos iteration takes about 30 seconds.

Difference to Mahout

Mahout SVD solver is implemented using the same Lanczos algorithm. However, there are several differences
1) In Mahout there are no restarts, so quality of the solution deteriorates very rapidly, after 5-10 iterations the solution is no longer accurate. Running without restarts can be done using our solution with the --max_iter=2 flag.
2) In Mahout there is a single orthonornalization step in each iteration while in our implementation there are two (after computation of u_i and after v_i ).
3) In Mahout there is no error estimation while we provide for each singular value the approximated error.
4) Our solution is typically x100 times faster than Mahout.



Notes about parameter tuning (In case not enough singular vectors have converged):


SVD have a few tunable parameters you need to play with.
1) --tol=XX, this is the tolerance. When not enough singular vectors converge to a desired
tolerance you can increase it, for example from 1e-4 to 1e-2, etc.
2) --nv=XX this number should be larger than nsv. Typically you can try 20% more or even larger. 
3) --nsv=XX this is the number of the desired singular vectors
4) --max_iter=XX - this is the number of restarts. When the algorithm does not converge you can increase the number of restarts.

Restricted Bolzman Machines (RBM) RBM algorithm is detailed in [Hinton's paper]. It is a MCMC method that works on binary data.
In other words, the ratings have to be binned into a discrete space. For example, for KDD CUP 2011 rating between 0 to 100 can be binned into 10 bins: 0-10, 10-20 etc. rbm_scaling defines the factor to divide the rating for binning (in the example it is 10). rbm_bins defines how many bins are there in total. In this example we have 11 bins: 0,1,..,10.

Basic configuration--rbm_mult_step_dec=XXMultiplicative step decrement (should be 0.1 to 1, default is 0.9)
--rbm_alpha=XXAlpha parameter: gradient descent step size
--rbm_beta=XXBeta parameter: regularization
--rbm_scaling=XXOptional. Scale the rating by dividing it with the rbm_scaling constant. For example for KDD cup data rating of 0..100 can be scaled to the bins 0,1,2,3,.. 10 by setting the rbm_scaling=10
--rbm_bins=XXTotal number of binary bins used. For example in Netflix data where we have 1,2,3,4,5 the number of bins is 6

Advanced topic: Understanding RBM output format and predicting values.  

In case you like to use GraphChi output file for computing RBM predictions you should compute the following:

You should implement the RBM prediction function which is found here: https://github.com/GraphChi/graphchi-cpp/blob/master/toolkits/collaborative_filtering/rbm.cpp#L129-L154



Assume D is the feature vector length. The default D=20. 

Basically, each user node has 3 fields: h, h0 and h1, each of them is a vector of size 20.

Those vectors are appended to get a single vector of default size 60.

The U matrix has row as the number of users (M) x 60.


Each movie node has 3 fields: ni (a double), bi is a vector of size rbm_bins (the default is 6).
and w is a vector of size rbm_bins * D = 120 in default. 
In the output file first the bi vector is written (size = 6) and then w, total of 126.
The V matrix has rows as the number of items (N) x 126.

Note that the prediction involves bi, h, w but does not involve h0, h1, ni. 

Koren time-SVD++ Pros: more accurate than SVD++
Cons: many parameters to tune, prone to numerical errors.

Koren’s time-SVD++ [Korens paper above] takes into account also the temporal aspect of the rating.
Prediction in time-SVD++ algorithm is computed as follows:
  r_uik = global_mean_rating + b_u + b_i + ptemp_u * q_i + x_u * z_k + pu_u * pt_i * q_k

The scalar rating r_uik (rating for intersection of user u, item i, and time bin k) equals the above sum. Like in Koren’s SVD++ the rating equals the sum of the global mean rating and biases for user and item. The following are feature vectors. For the user we have ptemp_i , x_u and pu_u. All of length D. For the item we have additional three feature vectors: ptemp_u, x_u and pu_u.
For the time bins we have z_k and q_k, two feature vectors of size D.


Basic configuration--lrate=XXLearning rate
--betaBeta parameter (bias regularization)
--gammaGamma parameter (feature vector regularization)
--lrate_mult_decMultiplicative step decrement (0.1 to 1, default 0.9)
--D=XFeature vector width. Common values are 20 - 150.

Special Note: This is a tensor factorization algorithm. Please don’t forget to prepare a 4 column matrix market format file, with [user] [ item ] [ time ] [ rating ] in each row.
It is advised to delete intermediate files created by als_tensor, since they have a different format.

Factorization Machines (FM)

GraphChi's libFM algorithm implementation contains a subset of the full libFM
functionality with only three predictions: user, item and time. Users are encouraged to check the original libFM library: http://www.libfm.org/ for enhanced implementation. libFM library by Steffen Rendle has a track record performance in KDD CUP and is highly recommended collaborative filtering package.

Factorization machines is a SGD type algorithm. It has two differences relative to bias-SGD:
1) It handles time information by adding feature vectors for each time bin
2) It adds additional feature for the last item rated for each user.
Those differences are supposed to make it more accurate than bias-SGD.
Factorization machines is detailed in reference [P]. There are several variants, here the SGD variant is implemented (and not the ALS).

Prediction in LIBFM is computed as follows:

r_ui = global_mean_rating + b_u + b_i + b_t + b_li + 0.5*sum(p_u.^2 + q_i.^2  + w_t.^2 + s_li.^2 - (p_u + q_i + w_t + s_li ).^2)  

Where global_mean_rating is the global mean rating, b_u is the bias of user u, b_i is the bias of item i , b_t is the bias of time t, b_li is the bias of the last item li, and p_u is the feature vector of user u, and q_i is the feature vector of item i, w_t is the feature vector of time t, s_li is the feature vector of last item li. All feature vectors have size of D as in ALS.  .^2 is the element by element power operation (as in matlab).

The output of LIBFM consists of three matrices: filename.Users, filename.Movies and filename.Times. The matrix Users holds the user feature vectors in each row. (Each vector has exactly D columns). The matrix Movies holds the feature vectors for each item (Each vector has again exactly D columns). The matrix Times holds the feature vectors for each time. Additionally, the output consists of four vectors: bias for each user, bias for each item, bias for each time bin and bias for each last item. Last, the global mean rating is also given as output.


Basic configuration--libfm_rate=XXGradient descent step size
--libfm_regw=XXGradient descent regularization for biases
--libfm_regv=XXGradient descent regularization for feature vectors
--libfm_mult_dec=XXMultiplicative step decrease. Should be between 0.1 to 1.
Default is 0.9
--D=XFeature vector width. Common values are 20 - 150.

PMF

Pros: once tuned, better accuracy than ALS, since it involves extra sampling step
Cons: sensitive to numerical errors, needs fine tuning, does not work on every dataset, higher computational cost, higher prediction computational cost.

PMF and BPTF are two Markov Chain Monte Carlo (MCMC) sampling methods. They are based on ALS, but on each step a sampling from the probability is perform for obtaining the next state.
Prediction in PMF/BPTF is like in ALS, but instead of computing one vector product of the current feature vector, the whole chain of products is computed and the average is taken.

More formally, the prediction rule of PMF is:
r_ui = [ p_u(1) * q_i(1) + p_u(2) * q_i(2) + ..  + p_u(l) * q_i(l) ] / l

Where l is the length of the chain.

Note: typically in MCMC methods, the first XX samples of the chain are thrown away, so p_u and q_i will start from XX and not from 1.

The prediction rule of BPTF includes a feature vector for each time bin, denote w:
r_uik = [ p_u(1) * q_i(1) * w_k(1) + p_u(2) * q_i(2) * w_k(2) + ..  + p_u(l) * q_i(l) * w_k(l) ] / l
Where the product is a tensor product, namely \sum_j p_uj * q_ij * w_kj

Basic configuration--pmf_burn_in=XXThrow away the first XX samples in the chain
--pmf_additional_output=1 Save as output all the samples in the chain (after the burn in period).
Each sample is composed of two feature vectors. Each will be saved on its own file.

Example running PMF


Here we run 10 iterations of PMF, where the first 5 are discarded (pmf_burn_in) and the rest are used for computing the prediction:
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/pmf --training=smallnetflix_mm --quiet=1 --minval=1 --maxval=5 --max_iter=10 --pmf_burn_in=5
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
   1.24716) Iteration:   0 Training RMSE:    1.56917  Validation RMSE:     2.4979 ratings_per_sec:          0
    2.5872) Iteration:   1 Training RMSE:    2.44993  Validation RMSE:    2.03815 ratings_per_sec: 1.16359e+06
   3.95615) Iteration:   2 Training RMSE:     1.7831  Validation RMSE:    1.26519 ratings_per_sec: 1.55628e+06
   5.33609) Iteration:   3 Training RMSE:    1.08493  Validation RMSE:    1.05008 ratings_per_sec: 1.76283e+06
   6.73702) Iteration:   4 Training RMSE:   0.939768  Validation RMSE:   0.993025 ratings_per_sec: 1.88536e+06
Finished burn-in period. starting to aggregate samples
   8.16872) Iteration:   5 Training RMSE:    0.88499  Validation RMSE:   0.978547 ratings_per_sec: 1.95767e+06
   9.54684) Iteration:   6 Training RMSE:   0.864345  Validation RMSE:   0.972835 ratings_per_sec: 2.01243e+06
   10.9789) Iteration:   7 Training RMSE:   0.837162  Validation RMSE:   0.948436 ratings_per_sec: 2.04756e+06
     12.43) Iteration:   8 Training RMSE:   0.823885  Validation RMSE:   0.939388 ratings_per_sec: 2.0749e+06
   13.8361) Iteration:   9 Training RMSE:   0.814482  Validation RMSE:    0.93436 ratings_per_sec: 2.10232e+06


As a sanity check, now we run 10 iteration were all 10 are discarded (pmf_burn_in=10): 

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/pmf --training=smallnetflix_mm --quiet=1 --minval=1 --maxval=5 --max_iter=10 --pmf_burn_in=10
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
   1.18773) Iteration:   0 Training RMSE:    1.55811  Validation RMSE:     2.4997 ratings_per_sec:          0
   2.56929) Iteration:   1 Training RMSE:    2.45047  Validation RMSE:    2.07669 ratings_per_sec: 1.16566e+06
   3.94601) Iteration:   2 Training RMSE:    1.75645  Validation RMSE:    1.32239 ratings_per_sec: 1.56984e+06
   5.27586) Iteration:   3 Training RMSE:    1.11416  Validation RMSE:    1.04864 ratings_per_sec: 1.78811e+06
   6.69646) Iteration:   4 Training RMSE:   0.937365  Validation RMSE:   0.994412 ratings_per_sec: 1.89396e+06
   8.05631) Iteration:   5 Training RMSE:   0.886551  Validation RMSE:   0.978154 ratings_per_sec: 1.97636e+06
   9.44744) Iteration:   6 Training RMSE:   0.861688  Validation RMSE:   0.972389 ratings_per_sec: 2.03489e+06
   10.8185) Iteration:   7 Training RMSE:   0.846078  Validation RMSE:   0.972082 ratings_per_sec: 2.07996e+06
   12.2176) Iteration:   8 Training RMSE:   0.836964  Validation RMSE:   0.971611 ratings_per_sec: 2.1115e+06
   13.6285) Iteration:   9 Training RMSE:   0.829531  Validation RMSE:    0.96975 ratings_per_sec: 2.13407e+06

As you can see, the sampling step improves validation prediction from 0.969 to 0.934.

GenSGD

It is recommended to read first the GenSGD detailed case studies here:
http://bickson.blogspot.co.il/2012/12/collaborative-filtering-3rd-generation.html

Mandatory configuration--trainingInput file

--from_posColumn number of the feature which is used as "users" in the matrix factorization case. Column number starts from zero.
--to_posColumn number of the feature which is used as "items" in the matrix factorization case. Column number starts from zero.
--val_posColumn number of the value (the target variable) we would like to predict. For example the rating in the matrix factorization case. Column number starts from zero.
--file_columnsNumber of features in the input (training) file. (Note that from_pos, to_pos, val_pos should be smaller than the file_column number)
Optional configuation--rehash=1If some or all of the feature fields are strings, you should use rehash=1 to translate them into numeric ids. If all the fields are numbers, use --rehash=0. Default is 0.
--D
Latent feature vector width. Default is 20.
--calc_error=1When used for classification, calc_error treats the target is binary value and counts how many validation/training instances are wrong. See cuttoff.
--cutoffWhen used for binary classification cutoff is the threshold value were prediction > cutoff is positive and position <= cutoff is negative. Default is 0. 
--user_fileFile name of additional user properties (optional). Each line should start with user id and then a list of features.
--item fileFile name of additional item properties (optional). Each line should start with item id and then a list of features.
--limit_rating=XFor debugging: limit the number of rows in training file to X.
SGD tunable parameters--gensgd_rate1SGD step size for users (from_pos). Default 1e-2.
--gensgd_rate2SGD step size for items (to_pos). Default 1e-2.
--gensgd_rate3SGD step size of rating features in training file. Default 1e-2.
--gensgd_rate4SGD step size of user/item features in additional feature files. Default 1e-2.
--gensgd_mult_decSGD multiplicative step size decrement - default 0.9.
--gensgd_regwSGD bias regularization. Default 1e-3.
--gensgd_reg0SGD global mean regularization. Default 1e-1.
--gensgd_regvSGD features regularization. Default 1e-3.


Prediction computation in gensgd:

Prediction is computed as follows.
rating = global_mean + sum_f (bias_f) + 1/2*(sum_f (pvec_f) - sum_f (pvec_f.^2))

Where f is an index going over all the factors involved, pvec_f is the feature vector of
factor f, bias_f is the bias of factor f, and .^2 is elementwise square. See equation (5)
in the libFM paper. (Note: that x_i and x_j are all equal 1 in our implementation).

Output of gensgd

The output of gensgd are the following files:
1) a matrix of size f x D, where f is the number of feature vectors used and D is the feature vectors width. Generated filename is training_file_name + "_U.mm".
2) a vector of size f x 1, where f is the number of feature vectors which holds the scalar bias for each feature vector. Generated filename is training_file_name + "_bias_U.mm".
3) the global mean. Generated filename is training_file_name + "_global_mean.mm"
4) Mapping file for each feature. For each feature (each column) there is a map between the
   feature string name, and the integer id of this feature, in the arrays (1) and (2) above. The mapping files are generated only when using the --rehash=1 option. Generated file names are training_file_name + ".map." + feature_id

Sparse_GenSGD http://bickson.blogspot.co.il/2012/12/3rd-generation-collaborative-filtering.html

CLiMF

CLiMF was contributed by Mark Levy(last.fm). The CLiMF algorithm, described in the paper: CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. Yue Shi, Martha Larson, Alexandros Karatzoglou, Nuria Oliver, Linas Baltrunas, Alan Hanjalic, Sixth ACM Conference on Recommender Systems, RecSys '12.


 CLiMF is a ranking method which optimizes MRR (mean reciprocal rank) which is an information retrieval measure for top-K recommenders. CLiMF is a variant of latent factor CF which optimises a significantly different objective function to most methods: instead of trying to predict ratings CLiMF aims to maximise MRR of relevant items. The MRR is the reciprocal rank of the first relevant item found when unseen items are sorted by score i.e. the MRR is 1.0 if the item with the highest score is a relevant prediction, 0.5 if the first item is not relevant but the second is, and so on. By optimising MRR rather than RMSE or similar measures CLiMF naturally promotes diversity as well as accuracy in the recommendations generated. CLiMF uses stochastic gradient ascent to maximise a smoothed lower bound for the actual MRR. It assumes binary relevance, as in friendship or follow relationships, but the graphchi implementation lets you specify a relevance threshold for ratings so you can run the algorithm on standard CF datasets and have the ratings automatically interpreted as binary preferences.

CLiMF-related command-line options:
 --binary_relevance_thresh=xx Consider the item liked/relevant if rating is at least this value [default: 0]
 --halt_on_mrr_decrease Halt if the training set objective (smoothed MRR) decreases [default: false]
 --num_ratings Consider this many top predicted items when computing actual MRR on validation set [default:10000]


 Here is an example on running CLiMF on Netflix data: 
./toolkits/collaborative_filtering/climf --training=smallnetflix_mm --validation=smallnetflix_mme --binary_relevance_thresh=4 --sgd_gamma=1e-6 --max_iter=6 --quiet=1 --sgd_step_dec=0.9999 --sgd_lambda=1e-6
 Training objective:-9.00068e+07 
 Validation MRR: 0.169322 
 Training objective:-9.00065e+07 
 Validation MRR: 0.171909 
 Training objective:-9.00062e+07 
 Validation MRR: 0.172372 
 Training objective:-9.0006e+07 
 Validation MRR: 0.172503 
 Training objective:-9.00057e+07 
 Validation MRR: 0.172544 
 Training objective:-9.00054e+07 
 Validation MRR: 0.172549

Prediction is computed in CLiMF as follows

reciproal_rank_ij = g( U_i  ' * V_j  )
where g() is the logistic function, and U_i is the feature vector of user i and V_j is the feature vector of user J. Both feature vectors are of size D.

The output of CLiMF are two files training_file_name_U.mm and training_file_name_V.mm.

Post processing of the output

Example 1: Load output in Matlab, for computing recommendations for ALS/SGD/NMF


a) run ALS     ./toolkits/collaborative_filtering/als --training=smallnetflix_mm --validation=smallnetflix_mme --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1

b) download the script mmread.m.
c) # matlab
    >> A=mmread('smallnetflix_mm');
    >> U=mmread('smallnetflix_mm_U.mm');
    >> V=mmread('smallnetflix_mm_V.mm');
    >> whos
  Name          Size                  Bytes  Class     Attributes

  A         95526x3561             52799104  double    sparse    
  U         95526X5                3821040   double              
  V         3561X5                 142480    double          

c) compute prediction for user 8 movie 12:
   >> U(8,:)*V(12,:)'

d) compute approximation error    
     >> norm(A-U*V') % may be slow... depending on problem size

Example 2: Load output in Matlab, for verifying bias-SGD results

a) run the command line:
./toolkits/collaborative_filtering/biassgd --training=smallnetflix_mm --validation=smallnetflix_mme --biassgd_lambda=1e-4 --biassgd_gamma=1e-4 --minval=1 --maxval=5 --max_iter=6 --quiet=1 

b) download the script mmread.m.


c) # matlab
>> V=mmread('smallnetflix_mm_V.mm');                 % read item matrix V
>> U=mmread('smallnetflix_mm_U.mm');                 % read user matrix U
>> m=mmread('smallnetflix_mm_global_mean.mm');       % read global mean
>> bV=mmread('smallnetflix_mm_V_bias.mm');           % read user bias
>> bU=mmread('smallnetflix_mm_U_bias.mm');           % read item bias
>> pairs = load('pairs');                            % read user/item pairs
>> A=mmread('smallnetflix_mme');                     % read rating matrix
>>
>> rmse = 0;
>> for r=1:545177,                                   % run over each rating
                                                     % compute bias-SGD prediction
            % using the prediction rule:
            % prediction = global_mean + bias_user + bias_item + vector_user*vector_item
  pred = m + bU(pairs(r,1)) + bV(pairs(r,2)) + U(pairs(r,1),:)*V(pairs(r,2),:)';
  pred = min( 5, pred );                             % truncate prediction [1,5]
  pred = max( 1, pred );
  obs = A( pairs(r,1), pairs(r,2) );                 
  rmse = rmse + (pred - obs).^2;                     % compute RMSE by (observation-    
                                                     % prediction)^2    
end             
>>                 
>> sqrt( rmse/545177.0 )                             % print RMSE
ans =




   1.1239                                            % compare the training RMSE to g    

                                                     % graphchi output

Computing top-K recommendations

For computing top K recommendations out of the computed linear model, use the rating/rating2 commands. The following algorithms are supported: ALS, sparse-ALS, NMF, SGD, WALS, SVD++, bias-SGD, CLiMF, SVD.
For ALS, Sparse-ALS, NMF, SGD,CliMF, SVD use rating application.
For SVD++, bias-SGD, RBM use rating2 application.

First you need to run one of the above methods (ALS, SGD, NMF etc.) . Next, compute recommended ratings as follows:

./toolkits/collaborative_filtering/rating --training=smallnetflix_mm --num_ratings=5 --quiet=1 --algorithm=als
WARNING:  common.hpp(print_copyright:128): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[num_rating] => [5]
[quiet] => [1]
Computing recommendations for user 1 at time: 0.827547
Computing recommendations for user 1001 at time: 0.871366
Computing recommendations for user 2001 at time: 0.908017
...
Computing recommendations for user 95001 at time: 2.15397
Rating output files (in matrix market format): smallnetflix_mm.ratings, smallnetflix_mm.ids 
Distance statistics: min 0 max 42.1831 avg 9.51098


The output of the rating algorithm are two files. The first one is more useful.
1) filename.ids - includes recommended item ids for each user.
2) filename.ratings - includes scalar ratings of the top K items
bickson@thrust:~/graphchi/toolkits/collaborative_filtering$ head smallnetflix_mm.ids %%MatrixMarket matrix array real general 
%This file contains item ids matching the ratings. In each row i the top K item ids for user i. (First column is user id, next are top recommendations for this user). 
95526 6 
1 3424 1141 1477 2151 2012 
2 2784 1900 516 1835 1098 
3 1428 3450 2284 2328 58 
4 209 1073 3285 60 1271 
5 132 1702 2575 1816 2284 
6 2787 1816 3024 2514 985
7 3078 375 168 2514 2460 
 ... 

bickson@thrust:~/graphchi/toolkits/collaborative_filtering$ head smallnetflix_mm.ratings %%MatrixMarket matrix array real general
%This file contains user scalar rating. In each row i, K top scalar ratings of different items recommended for user i. 
95526 6 
1 7.726248219530e+00 7.321665743778e+00 7.023083603761e+00 7.008616274552e+00 
2 6.670937980807e+001.222724647853e+01 1.162004403228e+01 1.144299819709e+01 
3 1.133374751034e+01 1.061483854315e+017.497070438026e+00 7.187132667285e+00 
4 6.686989429238e+00 6.550680427186e+00 6.542147872641e+001.158861203665e+01 
5 9.885307642785e+00 9.045366124418e+00 8.801333430322e+00 8.713271980918e+00 
... ...


Command line arguments


Basic configuration--training(Mandatory) Input file name (in sparse matrix market format) for training data
--num_ratings(Mandatory) Number of top items to recommend
--knn_sample_percent(optional) A value between (0,1]. When the dataset is big and there are a lot of user/item pairs it may not be feasible to compute all possible pairs. knn_sample_percent tells the program how many pairs to sample
--minvalTruncate allowed ratings in range (optional)
--maxvalTruncate allowed ratings in range (optional)
--quiet
Less verbose (optional)
--algorithm


(Mandatory) The type of algorithm output for which the top K ratings are computed. For rating application the following algorithms are supported: als,sparse_als,nmf,sgd,wals
For rating2 application: svd++,biassgd,rbm. For example --algorithm=als

--start_user


(optional) Limit the rating computed starting from start_user (including)

--end_user


(optional) Limit the rating computed ending by end_user (not including)


The rating command does not support yet all algorithms. Contact me if you like to add additional algorithms. 

Handling implicit ratings

Implicit rating handles the case where we have only positive examples (for example when a user bought a certain product) but we never have indication when a user DID NOT buy another product. The paper [Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 502-511. ] proposes to add negative examples at random for unobserved user/item pairs. Implicit rating is implemented in the collaborative filtering library and can be used with any of the algorithms explained above.


Basic configuration--implicitratingtype=1Adds implicit ratings at random
--implicitratingpercentage




Alternatively,
--implicitratingnumedges
A number between 1e-8 to 0.8 which determines what is the percentage of negative ratings (edges) to add to the sparse model. 0 means none while 1 means fully dense model.
OR
The number of negative ratings (edges) to add
--implicitratingvalue The value of the rating added. On default it is zero, but you can change it.
--implicitratingweight Weight of the implicit rating (for WALS) OR
Time of the explicit rating (for tensor algorithms)


Example for implicit rating addition:

./toolkits/collaborative_filtering/sgd --training=smallnetflix_mm --implicitratingtype=1 --implicitratingvalue=-1 --implicitratingpercentage=0.00001
WARNING:  sgd.cpp(main:182): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[implicitratingtype] => [1]
[implicitratingvalue] => [-1]
[implicitratingpercentage] => [0.00001]
INFO:     sharder.hpp(start_preprocessing:164): Started preprocessing: smallnetflix_mm --> smallnetflix_mm.4B.bin.tmp
INFO:     io.hpp(convert_matrixmarket:190): Starting to read matrix-market input. Matrix dimensions: 95526 x 3561, non-zeros: 3298163
INFO:     implicit.hpp(add_implicit_edges:71): Going to add: 3401 implicit edges.
INFO:     implicit.hpp(add_implicit_edges:79): Finished adding 3401 implicit edges. 

...

Computing test predictions

It is possible to compute test predictions: namely entering a list of user / movie pairs and getting predictions for each item in the list. For creating such a list, create a sparse matrix market format file with the user/movie pair list in each row (and for the unknown prediction put a zero or any other number).

Here is an example for generating predictions on the user/movie pair list on Netflix data:

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/biassgd --training=smallnetflix_mm --validation=smallnetflix_mme --test=smallnetflix_mme --quiet=1WARNING:  biassgd.cpp(main:210): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 

[training] => [smallnetflix_mm]

[validation] => [smallnetflix_mme]

[test] => [smallnetflix_mme]

[quiet] => [1]  0.726158) 

Iteration:   0 Training RMSE:    1.40926  Validation RMSE:     1.1636   1.61145) Iteration:   1 Training RMSE:    1.07647  Validation RMSE:    1.09299     2.536) Iteration:   2 Training RMSE:    1.02413  Validation RMSE:    1.05944   3.41652) Iteration:   3 Training RMSE:   0.996051  Validation RMSE:    1.03869   4.29683) Iteration:   4 Training RMSE:   0.977975  Validation RMSE:    1.02426   5.15537) Iteration:   5 Training RMSE:   0.965243  Validation RMSE:    1.01354

Finished writing 545177 predictions to file: smallnetflix_mme.predict


The input user/movie pair list is specified using the --test=filename command.
The output predictions is found in the file smallnetflix_mme.predictions:

bickson@thrust:~/graphchi$ head smallnetflix_mme.predict 
%%MatrixMarket matrix coordinate real general 
%This file contains user/item pair predictions. In each line one prediction. The first column is user id, second column is item id, third column is the computed prediction.
95526 3561 545177
135 1 3.6310739
140 1 3.7827248
141 1 3.5731169
154 1 3.9835398
162 1 3.9378759
167 1 3.9865881
169 1 3.6489052
171 1 4.0544691 
 ...

Speeding up execution

0) Verify that your program is compiled using the "-O3" compiler flag. (Should be enabled on default). This gives significant speedup (for example x5). Verify that your program is compiled using EIGEN_NDEBUG compiler flag. (Should be enabled on default).

1) If your system has enough memory, you can preload the problem into memory instead of reading them from disk on each iteration. This is done using the --nshards=1 command.

This gives around x2 speedup.

2) If your system has enough memory, you can increase used memory size using the membudget_mb command.  Example:
./toolkits/collaborative_filtering/als --training=smallnetflix_mm --validation=smallnetflix_mme --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1 membudget_mb 20000

3) You can tune the number of execution threads using execthreads command.
Depends on your machine different number of threads may give better results. The thumb rule is one thread per physical core. 
Example for setting the number of threads:
./toolkits/collaborative_filtering/als --training=smallnetflix_mm --validation=smallnetflix_mme --lambda=0.065 --minval=1 --maxval=5 --max_iter=6 --quiet=1 execthreads 4

4) You can disable compression by defining the following macro in your program code:
#define GRAPHCHI_DISABLE_COMPRESSION

 and recompiling. This will require increased disk space but will speed up execution.


K-Fold cross validation


It is possible to apply K-fold cross validation to your dataset. This is done by applying the following two flags:
--kfold_cross_validation=10, enables k-fold cross validation by setting K=10 and so on.
--kfold_cross_validation_index=3, defines that we are working on the 4th fold (out of 10, indices start from zero). 

Notes:
1) Currently supported algorithms for k-fold cross validation are: als, wals, sparse_als, svdpp, nmf, pmf, sgd, biassgd, biassgd2, rbm, timesvdpp, baseline.
2) Selection is done by rows, so when using K=10, index=3 every 4th row in 10 rows will be excluded from the training set.

Example run:
./toolkits/collaborative_filtering/als --training=smallnetflix_mm --kfold_cross_validation=10 --quiet=1 --kfold_cross_validation_index=3 --
WARNING:  common.hpp(print_copyright:149): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[kfold_cross_validation] => [10]
[quiet] => [1]
[kfold_cross_validation_index] => [3]
[validation] => [smallnetflix_validation]
...
   4.91313) Iteration:   0 Training RMSE:   2.03244  Validation  RMSE:   1.19777
   6.33828) Iteration:   1 Training RMSE:  0.748826  Validation  RMSE:   1.15937
    7.8193) Iteration:   2 Training RMSE:  0.690095  Validation  RMSE:   1.14381
   9.25151) Iteration:   3 Training RMSE:  0.665744  Validation  RMSE:   1.13516
   10.6588) Iteration:   4 Training RMSE:  0.649499  Validation  RMSE:   1.13151
   12.0984) Iteration:   5 Training RMSE:  0.638833  Validation  RMSE:   1.13044
Finished writing 329816 predictions to file: smallnetflix_mm.predict

Now run for a different fold:


./toolkits/collaborative_filtering/als --training=smallnetflix_mm --kfold_cross_validation=10 --quiet=1 --kfold_cross_validation_index=4 --validation=smallnetflix_validation



WARNING:  common.hpp(print_copyright:149): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[kfold_cross_validation] => [10]
[quiet] => [1]
[kfold_cross_validation_index] => [4]
[validation] => [smallnetflix_validation]
...
    4.7885) Iteration:   0 Training RMSE:   1.97529  Validation  RMSE:   1.19191
   6.11275) Iteration:   1 Training RMSE:  0.745336  Validation  RMSE:   1.15712
   7.52895) Iteration:   2 Training RMSE:  0.685904  Validation  RMSE:   1.14291
   8.91787) Iteration:   3 Training RMSE:  0.661709  Validation  RMSE:   1.13662
   10.2528) Iteration:   4 Training RMSE:  0.646958  Validation  RMSE:   1.13445
   11.5109) Iteration:   5 Training RMSE:  0.637327  Validation  RMSE:   1.13369

Other cost functions

Most of the algorithms compute RMSE by default. We also support MAP@K metric. You can run it using the --calc_ap=XX flag. The --ap_number=XX flag defines K.
Note: the assumption is that the dataset has binary values (0/1).

Common errors and their meaning

File not found error:

bickson@thrust:~/graphchi$ ./bin/example_apps/matrix_factorization/als_vertices_inmem file smallnetflix_mm 
INFO:     sharder.hpp(start_preprocessing:164): Started preprocessing: smallnetflix_mm --> smallnetflix_mm.4B.bin.tmp
ERROR:    als.hpp(convert_matrixmarket_for_ALS:153): Could not open file: smallnetflix_mm, error: No such file or directory

Solution:
Input file is not found, repeat step 5 and verify the file is in the right folder

Environment variable error:
bickson@thrust:~/graphchi/bin/example_apps/matrix_factorization$ ./als_vertices_inmem 
ERROR: Could not read configuration file: conf/graphchi.local.cnf
Please define environment variable GRAPHCHI_ROOT or run the program from that directory.

Solution:
export GRAPHCHI_ROOT=/path/to/graphchi/folder/

Error:

FATAL:    io.hpp(convert_matrixmarket:169): Failed to read global mean from filesmallnetflix_mm.gm

Solution: remove all temporary files created by the preprocessor, verify you have write permissions to your working folder and try again.

Adding fault tolerance

For adding fault tolerance, use the command line flag --load_factors_from_file=1 when continuing any previous run.

The following algos are supported: ALS, WALS, sparse_ALS, tensor_ALS, NMF, SGD, bias-SGD and SVD++.

Here is an example for bias-SGD.
1) Run a few rounds of the algo:
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/biassgd --training=smallnetflix_mm --max_iter=3 --quiet=1
WARNING:  biassgd.cpp(main:210): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[max_iter] => [3]
[quiet] => [1]
    1.5052) Iteration:   0 Training RMSE:    1.40926  Validation RMSE:     1.1636
   3.30333) Iteration:   1 Training RMSE:    1.07647  Validation RMSE:    1.09299
   5.28362) Iteration:   2 Training RMSE:    1.02413  Validation RMSE:    1.05944

 === REPORT FOR biassgd-inmemory-factors() ===
..

2) Now continue from the same run, after the 3 iterations:

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/biassgd --training=smallnetflix_mm --max_iter=3 --quiet=1 --load_factors_from_file=1
WARNING:  biassgd.cpp(main:210): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[training] => [smallnetflix_mm]
[max_iter] => [3]
[quiet] => [1]
[load_factors_from_file] => [1]
   2.63894) Iteration:   0 Training RMSE:   0.996053  Validation RMSE:    1.03869
   4.07894) Iteration:   1 Training RMSE:   0.977975  Validation RMSE:    1.02427
    5.6297) Iteration:   2 Training RMSE:   0.965245  Validation RMSE:    1.01355
..


As you can see the second runs, starts from the state of the first run.


Item based similarity methods

Item based similarity methods documentation is found here.

Case studies

ACM KDD CUP 2012 - in this post I show how to utilize multiple feature information for predicting advertisement clicked by users, using KDD CUP 2012 data (we won 4th place out of 192 groups).
Airline on time dataset + Hearst machine learning challenge - in this post I show how to predict airplane flight time using airline on time dataset, and how to predict user reaction to email campaign using the hearst machine learning challenge. 
ACM KDD CUP 2010 - in this post I explain how to predict student learning abilities using ACM KDD CUP 2010 dataset.
Million songs dataset - in this post I explain how to obtain the winning solution in the millions songs dataset contest, using a computation of item based similarities and their derived recommendaitons.


Acknowledgements/ Hall of Fame

Deployment of GraphChi CF toolkit was not possible without the great help of data scientist around the world who contributed their efforts for improving my code! Here is a preliminary list, I hope I did not forget anyone...