Large Scale Machine Learning and Other Animals: open-source-tools

Showing posts with label open-source-tools. Show all posts

Wednesday, August 29, 2012

Rapid Miner & Myrrix

Another new thing I learned from Brad Cox, Rapid Miner software seemed to be heavily used in the homeland security / defense sectors.

Here is a short introductory video.

It seems it does not scale well to large models but have an excellent UI which helps visualize the data.

As a follow up, I got this from Dave Laxer:

Did you know about RADoop? RADoop = RapidMiner + Hadoop.

http://www.radoop.eu/

It seems it is an effort to scale rapid miner to larger models.

Slightly related, I got a link to Myrrix from Charles Martin. It seems to be a recommendation engine on top of Mahout headed by Sean Owen.

Wednesday, August 15, 2012

Steffen Rendle - libFM

News from the KDD CUP workshop. I was highly impressed by Steffen Rendle, the author of libFM collaborative filtering library. Steffen won the 2nd place in track1 and the 3rd place in track2. Unlike our team who had around 15 people, and the Taiwanese team who had around 20 people Steffen worked alone, and got a great rating in BOTH tracks.

What is nice about Steffen's work, is that he is using only a SINGLE algorithm, and not an ensemble of methods as typically deployed. The trick is the he does very smart feature engineering to create a very good feature matrix. Once he gets the representative feature matrix he uses the libFM algorithm.

I asked Steffen to explain the essense of the method:

A is the input (design) matrix where each row is a case and each column a (real-valued) predictor variable. I.e. the same way of feature engineering as in other standard ML algorithms such as linear/ polynomial regression, SVM, etc. Internally, the FM model works similarly as a polynomial regression, i.e. it contains all pairwise interactions between any two variables in A. The important difference to polynomial regression is that the model parameters for variable interactions are factorized. This is the reason why FMs perform well in problems such as recommender systems.

Some of the recent notable work of Steffen is a caching method for caching ALS computation, that according to Steffen, significantly speeds up ALS computation and makes it a lighter algorithm like SGD. The work is described in his recent SIGIR 2011 paper.

A second interesting work is an online matrix factorization computation described in the paper: Steffen Rendle, Lars Schmidt-Thieme (2008): Online-Updating Regularized Kernel Matrix Factorization Models forLarge-Scale Recommender Systems, in Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys 2008), ACM.
When new users/ items are added into the system, only an incremental update is computed.

Finally, Steffen gave a very detailed tutorial in KDD, about a whole bunch of matrix factorization methods and the relations between them. I find the tutorial a good overview of the connections between algorithms, however it is intended for intermediate user level who actually master some of the algorithms in this domain.

As you may know, we have a very crude and preliminary libFM implementation in GraphLab. libFM algorithm implementation contains a subset of the full libFM functionality with only three predictions: user, item and time. Users are encouraged to check the original libFM library for enhanced implementation. libFM library has a track record performance in KDD CUP and is highly recommended collaborative filtering package.

Wednesday, August 8, 2012

Gephi - nice graph visualization software

I got a recommendation for the Gephi software two independent sources: my CMU collaborator Jay Haijie and Marcos Sainz (ModCloth). Gephi is a free Graph visualization software. It has some cool properties as can be seen in the below video. It seems that many of Graph ranking algorithms like Pagerank and HITS are implemented inside and can be calculated on the loaded graphs. I am not sure to which data magnitude this software can scale - from my experience trying to visualize large graphs results in a mess...

Saturday, August 4, 2012

Data Wrangler

One of the most impressive lectures in our GraphLab workshop was given by Jefferey Heer from the HCI dept in Stanford. Data Wrangler is a visual tool for helping out cleaning large datasets - a time demanding task task which is often ignored when talking about machine learning algorithms. Using Data Wrangler it is posible to visually specify how to clean the data on a small sample of it and generate map/reduce or python scripts automatically that will run on the full dataset.

Here is a quick video preview (the full lecture will be online soon):
Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Here is a link to the full paper.
By the way, my second advisor, Prof. Joe Hellerstein from Berkeley is also involved in this nice project.

Thursday, August 2, 2012

Geo Deepdive

In MMDS I liked the lecture by Christopher Re from WISC about Geo Deepdive.
Geo Deepdive is a data mining application in the field of geology. The application digests a huge amount of technical reports and research papers, mines useful measurement information and
presents it using a cool UI. Here is a note I got from Chris:

We've started to get some videos up at www.youtube.com/HazyResearch. One in particular is this one (creepy Siri voice and all!):
More actual technical information (e.g., tutorials on how to build your ownsystem) should be up over the next 2-3 weeks. We'll also have had a handfulof geoscientists more intensely using the system by then.

Geodeepdive reminds me of the work done in quantiFind I wrote about a while back. It is a much needed application in the field of missing infrastructure and applications. I can imagine a lot of other use case besides of geology.

Large Scale Machine Learning and Other Animals

Wednesday, August 29, 2012

Rapid Miner & Myrrix

Wednesday, August 15, 2012

Steffen Rendle - libFM

Wednesday, August 8, 2012

Gephi - nice graph visualization software

Saturday, August 4, 2012

Data Wrangler

Thursday, August 2, 2012

Geo Deepdive

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax