Skip to content

Latest commit

 

History

History
206 lines (140 loc) · 6.3 KB

File metadata and controls

206 lines (140 loc) · 6.3 KB
title REGRESSION

6. A working example: Regression

As mentioned in Section 2 Basics and definitions, there are two feature types in LBJava: discrete and real. In machine learning, classification refers to the problem of predicting the class of unlabeled data for which the output type is discrete. On the other hand, regression refers to the problem that the desired output is continuous or real. Section 3 A working example: classifying newsgroup documents into topics gives an example of how to use LBJava for discrete type and this tutorial is dedicated to real type.

6.1 Setting Up

Let's name a class as MyData and use it for internal representation.

In terms of internal data structure, from the data set examples, there are two fields: feature vector and label, while the label being real or continuous type. Intuitively, feature vector and label are declared as the following:

private List<Double> features;
private double label;

The class MyData is the representation for a single example from the data set. However, the data set consists of many examples. Let's name a class as MyDataReader for the internal data structure of the data set.

For data structure, lines denotes all lines of examples in the data set. currentLineNumber keeps track which line that we are reading now.

private final List<String> lines;
private int currentLineNumber;

The constructor of MyDataReader reads each line from the data set file and stores them into internal data structure lines.

public MyDataReader(String filePath) {
    this.lines = new ArrayList<>();
    this.currentLineNumber = 0;

    Reader reader;
    try {
        reader = new FileReader(filePath);
        BufferedReader bufferedReader = new BufferedReader(reader);
        String eachLine;
        while ((eachLine = bufferedReader.readLine()) != null) {
            lines.add(eachLine);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

MyDataReader is inherited from Parser and its method next() is overridden in MyDataReader serving as an iterator giving the next element.

The function body is shown below:

public Object next() {
    if (currentLineNumber < lines.size()) {
        MyData ret = new MyData(lines.get(currentLineNumber));
        this.currentLineNumber ++;
        return ret;
    }
    return null;
}

6.2 Classifier Declarations

For declaring the classifier, we need to use Section 4 LBJava Language.

6.2.1 Feature

The features are declared as following:

import java.util.List;

real[] MyFeatures(MyData d) <- {
    for (int i = 0; i < d.getFeatures().size(); i++) {
        sense d.getFeatures().get(i);
    }
}

In particular, type real[] is referring the fact that the literal values of features are being used, rather than the index.

For example, if the example looks like this:

10 20 30 -1

where 10 20 30 are features and -1 is the label.

If type real[] is used, the features become 10 20 10 to classifier. However, if real% is used, the features become 0 1 2, which are the indices.

Please refer to Section 4.1.2.4 Conjunctions for details on types.

6.2.2 Label

The label is declared as following:

real MyLabel(MyData d) <- {
    return d.getLabel();
}

6.2.3 Classifier

Since we are using a classifier with real output type, we need to choose a training method compatible this output type. In this example we use Stochastic Gradient Descent. (visit Training Algorithms for complete list of training algorithms with the expected output types.)

The declaration is the following:

real SGDClassifier(MyData d) <-
    learn MyLabel
    using MyFeatures

    with SGD {}

end

6.3 Using SGDClassifier in a Java Program

6.3.1 Generate SGDClassifier

To compile your LBJava file and execute the LBJava code, run the following:

mvn lbjava:compile

This will compile all Java files pertinent to the .lbj file, then generate Java files from the .lbj file. These generated Java files are put in a location which is determined by two things: the gspFlag parameter, and the package at the top of the .lbj file.

For example, if gspFlag is src/main/java (the default), and the package is "my.package" then the generated Java files are put in ./src/main/java/my/package/.

The model files (*.lc, *.lex) are put in the directory determined by the dFlag parameter. By default, this is target/classes.

If you only want generate the Java translations of the LBJava code but not execute it, you can run:

mvn lbjava:generate

Then to compile all classes run:

mvn compile

If you want to remove the Java files generated by running "mvn lbj:compile", then run the following:

mvn lbjava:clean

To remove target/classes, you run:

mvn clean

Note: If the generated Java files already exist (from a previous run of lbjava:compile or lbjava:compile-only) you need to run lbjava:clean before compiling again.

Acknowledgement to Christos Christodoulopoulos.

6.3.1 Use SGDClassifier programmatically

Once SGDClassifier is generated from the previous step, you may invoke it programmatically.

Here is the sample code to use it:

MyDataReader train = new MyDataReader("data/train.txt");

// training
Learner learner = new SGDClassifier();
BatchTrainer trainer = new BatchTrainer(learner, train);
trainer.train(1000);

First read training data set into MyDataReader and create a SGDClassifier. Pass SGDClassifier to BatchTrainer and invoke method train for number of times.

6.4 Testing a Real Classifier

Here is the sample code to use TestReal class:

MyDataReader test = new MyDataReader("data/test.txt");
Classifier oracle = new MyLabel();
TestReal.testReal(learner, oracle, test, true);

First read testing data set into MyDataReader and create a oracle Classifier using the labels.

The class TestReal is used to evaluate classifiers with real output.

The method testReal is a static method in TestReal class. Thus passing SGDClassifier, the oracle Classifier, the testing data set test and a debug boolean flag into testReal as arguments.

TestReal class outputs Root Mean Square error for reference.