Large Scale Machine Learning and Other Animals: Parser

Showing posts with label Parser. Show all posts

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

In many cases your data is a collection of strings and you like to convert it to numerical form so it can be used in any machine learning software.

In GraphLab v2, I have added parsers library that can help you accomplish this task hopefully easier.
Let's start with an example. Suppose I have a collection of documents and in each document I have
a bag of words that appear. The input to GraphLab parser is:

1::the boy with big hat was here
2::no one would have believe in the last years of the nineteenth century

where 1 and 2 are the file numeric ids, we use '::' as a separator, and the rest of the line contains keywords that appear in that document.

Assuming you have this format, it is very easy to convert it to be used in GraphLab. You simply use
the texttokenparser application.
Preliminaries: you will need to install GraphLab v2 (explanation under installation section here).

And here is an example run of the parser:

./texttokenparser --dir=./ --outdir=./ --data=document_corpus.txt --gzip=false --debug=true

WARNING: texttokenparser.cpp(main:209): Eigen detected. (This is actually good news!)

INFO: texttokenparser.cpp(main:211): GraphLab parsers library code by Danny Bickson, CMU

Send comments and bug reports to danny.bickson@gmail.com

Currently implemented parsers are: Call data records, document tokens

Schedule all vertices

INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.

INFO: io.hpp(gzip_in_file:698): Opening input file: ./document_corpus.txt

INFO: io.hpp(gzip_out_file:729): Opening output file ./document_corpus.txt.out

Read line: 1 From: 1 To: 1 string: the

Read line: 1 From: 1 To: 2 string: boy

Read line: 4 From: 1 To: 3 string: with

INFO: texttokenparser.cpp(operator():159): Parsed line: 50000 map size is: 30219

INFO: texttokenparser.cpp(operator():159): Parsed line: 100000 map size is: 39510

INFO: texttokenparser.cpp(operator():159): Parsed line: 150000 map size is: 45200

INFO: texttokenparser.cpp(operator():159): Parsed line: 200000 map size is: 50310

INFO: texttokenparser.cpp(operator():164): Finished parsing total of 230114 lines in file document_corpus.txt

total map size: 52655

Finished in 17.0022

Total number of edges: 0

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.map map size: 52655

INFO: io.hpp(save_map_to_file:813): Save map to file: ./.reverse.map map size: 52655

The output of the parser :
1) Text file containing consecutive integers in sparse matrix market format. In other words, each string is assigned an id, and a sparse matrix is formed where the rows are the document numbers and the non-zero columns are the strings.
NOTE: currently you will need to manually create the two header lines as explained here. The header lines specify the number of rows, columns and non-zero entires in the matrix. In the future I will automate this process.
2) A mapping from each text keyword to its matching integer
3) A mapping from each integer to its matching string.

Advanced options:
1) It is possible to parse in parallel (on a multicore machine) multiple files and still have the ids assigned correctly. Use the --filter= command line argument to select all files starting with a certain prefix. Do not use the --data= command line argument in that case.
2) Support for gzip input format. Using --gzip=true command line option.
3) Save the mapping into readable text file using the --save_in_text=true command line argument.
4) Incrementally add more documents to an existing map by using the --load=true command line flag.
5) Limit the number of parsed lines using --lines=XX command line flag (useful for debugging!)
6) Enable verbose mode using --debug=true command line flag.

Thursday, December 29, 2011

Multicore parser - part 2 - parallel perl tutorial

In the first part of this post, I described how to program a multicore parser, where the task is to translate string IDs into consecutive integers that will be used for formatting the input of many machine learning algorithms. The output of part 1 is a map between strings and unsigned ints. The map is built using a single pass over all the dataset.

Now an additional task remains, namely translating the records (in my case phone call records) into a Graph to be used in Graphlab. This is an embarrassingly parallel task - since the map is read only - multiple threads can read it in parallel and translate the record names into graph edges. For example the following records:

YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
YXVaVQJfYZp BZurhasRwat 050803 235959 6
BqFnHyiRwam jdJBsGbXUwu 050803 235959 242

are translated into undirected edges:

1 2 
1 3
2 4

etc. etc.
The code is part of Graphlab v2 and can be downloaded from our download page.

In the current post, I will quickly explain how to continue setting up the parser.
The task we have now is to merge multiple phone calls into a single edge. It is also useful to sort the edges by their node id. I have selected to program this part in perl, since as you are going to see in a minute it is going to be a very easy task.

INPUT: A gzipped files with phone call records, where each row has two columns: the caller and the receiver; each of them as an unsigned integer. It is possible the same row will repeat multiple times in the file (in case multiple phone calls between the same pair of people where logged in different times).
OUTPUT: A gzipped output file with sorted unique phone call records. Each unique caller receiver pair will appear only once.

Tutorial - how to execute a parallel task in Perl.
1) Download and extract Parallel Fork Manager

wget http://search.cpan.org/CPAN/authors/id/D/DL/DLUX/Parallel-ForkManager-0.7.5.tar.gz
tar xvzf Parallel-ForkManager-0.7.5.tar.gz
mv Parallel-ForkManager-0.7.5 Parallel

2) Create a file named parser.pl with the following lines in it:

#!/bin/perl -w
my $THREADS_NUM = 8;
use Parallel::ForkManager;
$pm = new Parallel::ForkManager($THREADS_NUM);

opendir(DIR, "your/path/to/files");
@FILES= readdir(DIR); 

foreach $file (@FILES) {

  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  print "working on file " . $file;
  system("gunzip -c $file | sort -u  -T . -n -k 1,1 -k 2,2 -S 4G >" . $file . "gz" );

  $pm->finish; # Terminates the child process
}

closedir(DIR);

Explanation
1) ForkManager($THREADS_NUM) sets the number of parallel threads - in this case 8.
2) For each file, the file is unzipped using "gunzip -c", and sorted uniquely (-u command line flag). The -T flag is an optional argument in case your temp drive does not have enough space. -k 1,1 sets the sorting key to be the first column and the second -k 2,2 sets column 2 as the key in case of a match in the first column. -S flag sets the buffer size such that the full input file will fit into memory. 4G is 4 Gygabytes of memory.
3) The system() command runs any shell command line, so you can change the parallel loop execution to perform your own task easily.

Overall, we got in a few lines of code a parallel execution environment which would be much harder to setup otherwise.

Sunday, December 25, 2011

How to write a multicore parser

When dealing with machine learning, one usually ignores the (usually boring!) task of preparing the data to be used in any of the machine learning algorithms. Most of the algorithms have either linear algebra or statistical foundation and thus the data has to be converted to a numeric form.

In the last couple of weeks I am working on the efficient design of multicore parser, that allows converting raw string data into a format usable by many machine learning algorithms. Specifically, I am using CDR (call data records) from a large European country. However, the dataset has several typical properties, so I believe my experience is useful for other domains.

The raw CDR data I am using looks like this:

YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
xtGBWvpYgYK jdJBsGbXUwu 050803 235959 242
ZpYQLFFKyTa atslZokWZRL 050803 235959 504
WMjbYygLglR BqFnCfuNgio 050803 235959 51
hcLiEoumskU RcNSJSEmidT 050803 235959 7
qBvUQlMABPv atslBvPNusB 050803 235959 3609
jdSqVjxPlBn BqFnHyiRwam 050803 235959 23
VCWyivVLzRr atslSbOOWXz 050803 235959 8411
PnLaFqLJrEV atslZokWZRL 050803 235959 8806
PnLaCsEnqei atslBvPNusB 050803 235959 590

The first column is an anonymized caller ID, the second column is an anonymized receiver ID, the third column is the date, the fourth is the time, and the last column is the duration of the call.

Now to the data magnitude. If your dataset is small, no need for any fancy parsing, you can write a python/perl/matlab script to convert it to numeric form and avoid reading further... However, this dataset is rather big: every day there are about 300M unique phone calls. So depending on how many days you aggregate together, you can get to quite a large magnitude. For a month there are about 9 billion phone calls logged.

To make the CDR useful, we need to convert the hashed string ID into a number, hopefully a consecutive increasing number. That way we can express the phone call information as a matrix.
Then we can use any of the fundamental machine learning algorithms like: SVM, Lasso, Sparse logistic regression, matrix factorization, etc. etc.

One possible approach for converting strings to integer, is taken in Vowpal Wabbit, where strings are hashed into numeric IDs during the run. However, there is a risk that two different string IDs will be mapped into the same integer. So depending on the application this may be acceptable. I have chosen to take a different approach - which is to simply assign a growing consecutive ID to each string.

I have implemented the code in GraphLab, where GraphLab is not the intuitive tool to be used for this task (although it was convenient to use). In a multicore machine, several GraphLab threads are running in parallel and parsing different portions of the input files concurrently. We have to be careful that node IDs will remain consecutive across the different files. Since stl/boost data structures are typically not thread safe, I had to use a mutex for defending against concurrent insertions to the map. (Concurrent reads from a stl/boost map are perfectly fine).

void assign_id(uint & outval, const string &name){

  //find if the string is already in the map.
  //this part is thread safe since find() is thread safe
  boost::unordered_map<string,uint>::iterator it = hash2nodeid.find(name);
  if (it != hash2nodeid.end()){
     outval = it->second;
     return;
  }

  //if not, we need to insert it to the map
  //now, we must lock since operator[] is not thread safe
  mymutex.lock();
  outval = hash2nodeid[name];
  if (outval == 0){
      hash2nodeid[name] = ++conseq_id;
      outval = conseq_id;
  }
  mymutex.unlock();
}

One should be careful here, since as I verified using gprof profiler, about 95% of the running time is wasted on this critical section of assigning strings to ints.

Initially I used std::map<string,int> but I found it to be rather slow. It seems that std::map is implemented using an underlying tree and so insertions are costly: log(N). I switched to boost::unordered_map which is actually a hash table implementation with O(1) insertions. This gave x2 speedup in runtime.

Second, since each day of input file amount to about 5GB of gzipped file, I used boost gzipped stream for avoiding the intermediate extraction of the input files. Here is an example:

char linebuf[128];
    std::ifstream in_file(filename).c_str(), std::ios::binary);
    boost::iostreams::filtering_stream<boost::iostreams::input> fin;
    fin.push(boost::iostreams::gzip_decompressor());
    fin.push(in_file); 

    while(true){
      fin.getline(linebuf, 128);
      if (fin.eof())
        break;
      //parse the line
    }
    fin.pop();
    fin.pop();
    in_file.close();

Overall, I believe the result is quite efficient: for parsing 115GB of compressed CDR data (total of 6.4 billion phone calls) it takes 75 minutes on a 4 core machine. There where about 182M unique IDs assigned. (Quad core AMD Opteron 2.6Ghz). Total of 12.8 billion map lookups (about 3M lookups a second).

Some performance graph:

Summary of lessons learned:

C parsing is way more efficient than perl/python/matlab.
Opening gzipped files is a waste of time and space - better work directly on the gzipped version.
Parallel parsing has a good speedup up to a few (3) cores. More cores do not improve (due to heavy IO..).
Better use hash table than sorted tree: boost::unordered_map is twice is fast than std::map

Large Scale Machine Learning and Other Animals

Sunday, March 18, 2012

How to prepare your problem to be used in GraphLab / GraphLab parsers

Thursday, December 29, 2011

Multicore parser - part 2 - parallel perl tutorial

Sunday, December 25, 2011

How to write a multicore parser

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax