Large Scale Machine Learning and Other Animals: big-datasets

Showing posts with label big-datasets. Show all posts

Thursday, August 21, 2014

Do you like "The Killings"? Dive into Seattle police data!

Here is an interesting blog post analyzing Seattle police data. I got it from Carlos Guestrin, our CEO.

Another interesting dataset is Allstate insruance claims data, which is from their Kaggle competition.

Monday, October 14, 2013

How clean are SF restaurants?

I stumbled upon this interesting tutorial with a unique freely available dataset. Tutorial credit: Zipfian Academy

Saturday, February 2, 2013

Case study: million songs dataset

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli's cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities.

Instructions for computing item to item similarities:

1) For obtaining the dataset, download and extract this zip file.

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.
$ sh createTrain.sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

$ ./toolkits/collaborative_filtering/itemcf --training=train --K=500 --asym_cosine_alpha=0.15 --distance=3 --min_allowed_intersection=5
Explanation: --training points to the training file. --K=500 means we compute the top 500 similar items.
--distance=3 is Aillio's metric. --min_allowed_intersection=5 - means we take into account only items that were rated together by at least 5 users.

Note: this operation requires around 20GB of memory and may take a few ours...

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities
$ rm -fR train.* train-topk.*
$ ./toolkits/collaborative_filtering/itemsim2rating --training=train --similarity=train-topk --K=500 membudget_mb 50000 --nshards=1 --max_iter=2 --Q=3 --clean_cache=1
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk --training=test --K=500

Output file is: test.ids

2) Prepare training recommendations:

./toolkits/parsers/topk --training=train-rec --K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval --training=train-rec.ids --test=test.ids --K=500

About performance:

With the following settings: --min_allowed_intersection=5, K=500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances.

ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431

With --min_allowed_intersection=1, K=2500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 6.0811 Finished evaluating 100000 instances.
ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.167994

Acknowledgements:

Clive Cox, RummbleLabs.com for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

Friday, February 1, 2013

Spotlight: Kaggle's RTA Challenge

I had an interesting talk with José P. González-Brenes, A 6th year grad student from CMU LTI dept.
During the talk, I learned that Jose participated in the Kaggle's RTA challenge and actually won the 1st place out of more than 300 groups.

The challenge was for predicting RTA highway travel times. The data was recorded time of different segments different cars traveled. The winning solution (of Jose and Guido Matías Cortés) was composed of a very simple method - a random forest. Unfortunately, there was no paper published about it, but here is a blog post summarizing the solution method. And here is a link to their presentation. What is further interesting about the solution method is that it was composed of 90 lines of matlab code!

The reason we actually talked is that Jose was recently trying out my GraphChi collaborative filtering code for his research, so I gave him some advice on which methods to use. Once he has some interesting results I hope he will update us!

Friday, January 4, 2013

Interesting dataset: million songs dataset

As you probably all know we are always looking for additional free, high quality datasets to try some of our techniques on. I got the million songs dataset link from Clive Cox, Chief Scientist at Rammble Labs, our man in London.

Here is some information from their website:

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

SecondHandSongs dataset -> cover songs

musiXmatch dataset -> lyrics

Last.fm dataset -> song-level tags and similarity

Taste Profile subset -> user data

Here is information on getting the dataset. Kaggle managed a contest for rating music items drawn from this dataset. For evaluating performance they used MAP@500 metric described here. Anyway I am soon going to try out our GraphChi CF toolbox on this dataset. Keep posted for some results!

An update: as promised, here are some GraphChi runtime results deployed on the million songs dataset and instructions how to reproduce them.

Wednesday, September 5, 2012

I got this link from Igor Caron. The famous compressed sensing and matrix factorization blogger.

In one of the thread, there was a discussion about recommender capabilities. Since we were looking at Arxaliv.org as a model (this is a Reddit clone), I went to the reddit discussion of the development of that open source platform and found that they, Reddit, actually are looking for a recommeder system and they have a nice dataset.

There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.A reddit is a category. A link is a subject (in Arxaliv it would be a paper) so that matrix (43976 x 3436063) is pretty sparsely filled (1.5e-5). Some SVD has been tried but I am sure they haven't looked at low rank solvers. Since Reddit is such a massive platform, if your algorithm provides good results, it will get to be known beyond your expectations.

Tuesday, August 28, 2012

Airlines on time performance dataset

I got the following interesting dataset link from Brad Cox :

The ASA challenge site has two decades of flight data totaling many GB. Data contains year, month, day, origin, carrier, destination, delay, etc. Goal is to determine what factors (day, carrier, destination, etc) best predicts the delay time.

From ASA website:

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The aim of the data expo is to provide a graphical summary of important features of the data set. This is intentionally vague in order to allow different entries to focus on different aspects of the data, but here are a few ideas to get you started:

When is the best time of day/day of week/time of year to fly to minimise delays?

Do older planes suffer more delays?

How does the number of people flying between different locations change over time?

How well does weather predict plane delays?

Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD).

Next: how to use GraphChi for computing predictions on this dataset.

Large Scale Machine Learning and Other Animals