YerevaNN

Challenges of reproducing R-NET neural network using Keras

2017-08-25T00:00:00+00:00

By Martin Mirakyan, Karen Hambardzumyan and Hrant Khachatrian.

In this post we describe our attempt to re-implement a neural architecture for automated question answering called R-NET, which is developed by the Natural Language Computing Group of Microsoft Research Asia. This architecture demonstrates the best performance among single models (not ensembles) on The Stanford Question Answering Dataset (as of August 25, 2017). MSR researchers released a technical report describing the model but did not release the code. We tried to implement the architecture in Keras framework and reproduce their results. This post describes the model and the challenges we faced while implementing it [View on GitHub ].

Problem statement
The architecture of R-NET
Implementation details
Results and comparison with R-NET technical report
Challenges of reproducibility

Problem statement

Given a passage and a question, the task is to predict an answer to the question based on the information found in the passage. The SQuAD dataset further constrains the answer to be a continuous sub-span of the provided passage. Answers usually include non-entities and can be long phrases. The neural network needs to “understand” both the passage and the question in order to be able to give a valid answer. Here is an example from the dataset.

Passage: Tesla later approached Morgan to ask for more funds to build a more powerful transmitter. When asked where all the money had gone, Tesla responded by saying that he was affected by the Panic of 1901, which he (Morgan) had caused. Morgan was shocked by the reminder of his part in the stock market crash and by Tesla’s breach of contract by asking for more funds. Tesla wrote another plea to Morgan, but it was also fruitless. Morgan still owed Tesla money on the original agreement, and Tesla had been facing foreclosure even before construction of the tower began.

Question: On what did Tesla blame for the loss of the initial money? Answer: Panic of 1901

The architecture of R-NET

The architecture of R-NET network is designed to take the question and the passage as inputs and to output an interval on the passage that contains the answer. The process consists of several steps:

Encode the question and the passage
Obtain question aware representation for the passage
Apply self-matching attention on the passage to get its final representation.
Predict the interval which contains the answer of the question.

Each of these steps is implemented as some sort of recurrent neural network. The model is trained end-to-end.

Drawing complex recurrent networks

We are using GRU cells (Gated Recurrent Unit) for all RNNs. The authors claim that their performance is similar to LSTM, but they are computationally cheaper.

Most of the modules of R-NET are implemented as recurrent networks with complex cells. We draw these cells using colorful charts. Here is a chart that corresponds to the original GRU cell.

White rectangles represent operations on tensors (dot product, sum, etc.). Yellow rectangles are activations (tanh, softmax or sigmoid). Orange circles are the weights of the network. Compare this to the formula of GRU cell (taken from Olah’s famous blogpost):

$\begin{aligned} \large z_t &=\sigma(W_z \cdot [h_{t-1}, x_t]) \\ r_t &=\sigma(W_r \cdot [h_{t-1}, x_t]) \\ \tilde{h}_t &= tanh(W \cdot [r_t \circ h_{t-1}, x_t]) \\ h_t &= (1 - z_t) \circ h_{t-1} + z_t \circ \tilde{h}_t \end{aligned}$

Some parts of R-NET architecture require to use tensors that are neither part of a GRU state nor part of an input at time $t$ . These are “global” variables that are used in all timesteps. Following Theano’s terminology, we call these global variables non-sequences.

To make it easier to create GRU cells with additional features and operations we’ve created a utility class called WrappedGRU which is a base class for all GRU modules. WrappedGRU supports operations with non-sequences and sharing weights between modules. Keras doesn’t directly support weight sharing, but instead it supports layer sharing and we use SharedWeight layer to solve this problem (SharedWeight is a layer that has no inputs and returns tensor of weights). WrappedGRU supports taking SharedWeight as an input.

1. Question and passage encoder

This step consists of two parts: preprocessing and text encoding. The preprocessing is done in a separate process and is not part of the neural network. First we preprocess the data by splitting it into parts, and then we convert all the words to corresponding vectors. Word-vectors are generated using gensim.

The next steps are already part of the model. Each word is represented by a concatenation of two vectors: its GloVe vector and another vector that holds character level information. To obtain character level embeddings we use an Embedding layer followed by a Bidirectional GRU cell wrapped inside a TimeDistributed layer. Basically, each character is embedded in $H$ dimensional space, and a BiGRU runs over those embeddings to produce a vector for the word. The process is repeated for all the words using TimeDistributed layer.

Code on GitHub

TimeDistributed(Sequential([
    InputLayer(input_shape=(C,), dtype='int32'),
    Embedding(input_dim=127, output_dim=H, mask_zero=True),
    Bidirectional(GRU(units=H))
]))

When the word is missing from GloVe, we set its word vector to all zeros (as described in the technical report).

Following the notation of the paper, we denote the vector representation of the question by $u^Q$ and the representation of the passage by $u^P$ ( $Q$ corresponds to the question and $P$ corresponds to the passage).

The network takes the preprocessed question $Q$ and the passage $P$ , applies masking on each one and then encodes them with 3 consecutive bidirectional GRU layers.

Code on GitHub

# Encode the passage P
uP = Masking() (P)

for i in range(3):
    uP = Bidirectional(GRU(units=H,
                           return_sequences=True,
                           dropout=dropout_rate)) (uP)
uP = Dropout(rate=dropout_rate, name='uP') (uP)

# Encode the question Q
uQ = Masking() (Q)

for i in range(3):
    uQ = Bidirectional(GRU(units=H,
                           return_sequences=True,
                           dropout=dropout_rate)) (uQ)
uQ = Dropout(rate=dropout_rate, name='uQ') (uQ)

After encoding the passage and the question we finally have their vector representations $u^P$ and $u^Q$ . Now we can delve deeper in understanding the meaning of the passage having in mind the question.

2. Obtain question aware representation for the passage

The next module computes another representation for the passage by taking into account the words inside the question sentence. We implement it using the following code:

Code on GitHub

vP = QuestionAttnGRU(units=H,
             return_sequences=True) ([
                 uP, uQ,
                 WQ_u, WP_v, WP_u, v, W_g1
             ])

QuestionAttnGRU is a complex extension of a recurrent layer (extends WrappedGRU and overrides the step method by adding additional operations before passing the input to the GRU cell).

The vectors of question aware representation of the passage are denoted by $v^P$ . As a reminder $u^P_t$ is the vector representation of the passage $P$ , $u^Q$ is the matrix representation of the question $Q$ (each row corresponds to a single word).

In QuestionAttnGRU first we combine three things:

the previous state of the GRU ( $v^P_{t-1}$ )
matrix representation of the question ( $u^Q$ )
vector representation of the passage ( $u^P_{t}$ ) at the $t$ -th word.

We compute the dot product of each input with the corresponding weights, then sum-up all together after broadcasting them into the same shape. The outputs of dot( $u^P_{t}$ , $W^P_{u}$ ) and dot( $v^P_{t-1}$ , $W^P_{v}$ ) are vectors, while the output of dot( $u^Q$ , $W^Q_{u}$ ) is a matrix, therefore we broadcast (repeat several times) the vectors to match the shape of the matrix and then compute the sum of three matrices. Then we apply tanh activation on the result. The output of this operation is then multiplied (dot product) by a weight vector $V$ , after which $softmax$ activation is applied. The output of the $softmax$ is a vector of non-negative numbers that represent the “importance” of each word in the question. This kind of vectors are often called attention vectors. When computing the dot product of $u^Q$ (matrix representation of the question) and the attention vector, we obtain a single vector for the entire question which is a weighted average of question word vectors (weighted by the attention scores). The intuition behind this part is that we get a representation of the parts of the question that are relevant to the current word of the passage. This representation, denoted by $c_{t}$ , depends on the current word, the whole question and the previous state of the recurrent cell (formula 4 on page 3 of the report).

These ideas seem to come from a paper by Rocktäschel et al. from Deepmind. The authors suggested to pass this $c_{t}$ vector as an input to the GRU cell. Wang and Jiang from Singapore Management University argued that passing $c_{t}$ is not enough, because we are losing information from the “original” input $u^P_{t}$ . So they suggested to concatenate $c_{t}$ and $u^P_{t}$ before passing it to the GRU cell.

The authors of R-NET did one more step. They applied an additional gate to the concatenated vector $[c_{t}, u^P_{t}]$ . The gate is simply a dot product of some new weight matrix $W_{g}$ and the concatenated vector, passed through a sigmoid activation function. The output of the gate is a vector of non-negative numbers, which is then (element-wise) multiplied by the original concatenated vector (see formula 6 on page 4 of the report). The result of this multiplication is finally passed to the GRU cell as an input.

3. Apply self-matching attention on the passage to get its final representation

Next, the authors suggest to add a self attention mechanism on the passage itself.

Code on GitHub

hP = Bidirectional(SelfAttnGRU(units=H,
                               return_sequences=True)) ([
                       vP, vP,
                       WP_v, WPP_v, v, W_g2
                   ])
hP = Dropout(rate=dropout_rate, name='hP') (hP)

The output of the previous step (Question attention) is denoted by $v^P$ . It represents the encoding of the passage while taking into account the question. $v^P$ is passed as an input to the self-matching attention module (top input, left input). The authors argue that the vectors $v^P_{t}$ have very limited information about the context. Self-matching attention module attempts to augment the passage vectors by information from other relevant parts of the passage.

The output of the self-matching GRU cell at time $t$ is denoted by $h^P_{t}$ .

The implementation is very similar to the previous module. We compute dot products of weights $W^PP_{u}$ with the current word vector $v^P_{t}$ , and $W^P_{v}$ with the entire $v^P$ matrix, then add them up and apply $\tanh{}$ activation. Next, the result is multiplied by a weight-vector $V$ and passed through $softmax$ activation, which produces an attention vector. The dot product of the attention vector and $v^P$ matrix, again denoted by $c_{t}$ , is the weighted average of all word vectors of the passage that are relevant to the current word $v^P_{t}$ . $c_{t}$ is then concatenated with $v^P_{t}$ itself. The concatenated vector is passed through a gate and is given to GRU cell as an input.

The authors consider this step as their main contribution to the architecture.

It is interesting to note that the authors write BiRNN in Section 3.3 (Self-Matching Attention) and just RNN in Section 3.2 (which describes question-aware passage representation). For that reason we used BiGRU in SelfAttnGRU and unidirectional GRU in QuestionAttnGRU. Later we discovered a sentence in Section 4.1 which suggests that we were not correct: the gated attention-based recurrent network for question and passage matching is also encoded bidirectionally in our experiment.

4. Predict the interval which contains the answer of a question

Finally we’re ready to predict the interval of the passage which contains the answer of the question. To do this we use QuestionPooling layer followed by PointerGRU (Vinyals et al., Pointer networks, 2015).

Code on GitHub

rQ = QuestionPooling() ([uQ, WQ_u, WQ_v, v])
rQ = Dropout(rate=dropout_rate, name='rQ') (rQ)

...

ps = PointerGRU(units=2 * H,
                return_sequences=True,
                initial_state_provided=True,
                name='ps') ([
            fake_input, hP,
            WP_h, Wa_h, v, rQ
        ])

answer_start = Slice(0, name='answer_start ') (ps)
answer_end = Slice(1, name='answer_end') (ps)

QuestionPooling is the attention pooling of the whole question vector $u^Q$ . Its purpose is to create the first hidden state of PointerGRU. It is similar to the other attention-based modules, but has a strange description in the report. Formula 11 on page 5 includes a product of two tensors $W_v^Q$ and $V_r^Q$ . Both these tensors are trainable parameters (as confirmed by Furu Wei, one of the coauthors of the technical report), and it is not clear why this dot product is not replaced by a single trainable vector.

$h^P$ is the output of the previous module and it contains the final representation of the passage. It is passed to this module as an input to obtain the final answer.

In Section 4.2 of the technical report the authors write that after submitting their paper to ACL they made one more modification. They have added another bidirectional GRU on top of $h^P$ before feeding it to PointerGRU.

PointerGRU is a recurrent network that works for just two steps. The first step predicts the first word of the answer span, and the second step predicts the last word. Here is how it works. Both $h^P$ and the previous state of the PointerGRU cell are multiplied by their corresponding weights $W$ and $W^a_{v}$ . Recall that the initial hidden state of the PointerGRU is the output of QuestionPooling. The products are then summed up and passed through $tanh$ activation. The result is multiplied by the weight vector $V$ and $softmax$ activation is applied which outputs scores over $h^P$ . These scores, denoted by $a^t$ are probabilities over the words of the passage. Argmax of $a^1$ vector is the predicted starting point, and argmax of $a^2$ is the predicted final point of the answer (formula 9 on page 4 of the report). The hidden state of PointerGRU is determined based on the dot product of $h^P$ and $a^t$ , which is passed as an input to a simple GRU cell (formula 10 on page 4 of the report). So, unlike all previous modules of R-NET, the output of PointerGRU (the red diamond at the top-right corner of the chart) is different from its hidden state.

Implementation details

We use Theano backend for Keras. It was faster than TensorFlow in our experiments. Our experience shows that TensorFlow is usually faster for simple network architectures. Probably Theano’s optimization process is more efficient for complex extensions of recurrent networks.

Layers with masking support

One of the most important challenges in training recurrent networks is to handle different lengths of data points in a single batch. Keras has a Masking layer that handles the basic cases. We use it in the encoding layer. But R-NET has more complex scenarios for which we had to develop our own solutions. For example, in all attention pooling modules we use $softmax$ which is applied along “time” axis (e.g. over the words of the passage). We don’t want to have positive probabilities after the last word of the sentence. So we have implemented a custom Softmax function which supports masking:

def softmax(x, axis, mask):
    m = K.max(x, axis=axis, keepdims=True)
    e = K.exp(x - m) * mask
    s = K.sum(e, axis=axis, keepdims=True)
    s = K.clip(s, K.floatx(), None)
    return e / s

m is used for numerical stability. To support masking we multiply e by the mask. We also clip s by a very small number, because in theory it is possible that all positive values of e are outside the mask.

Note that details like this are not described in the technical report. Probably these are considered as commonly known tricks. But sometimes the details of the masking process can have critical effects on the results (we know this from the work on medical time series).

Slice layer

Slice layer is supposed to slice and return the input tensor at the given indices. It also supports masking. The slice layer in R-NET model is needed to extract the final answer (i.e. the interval_start and interval_end numbers). The final output of the model is a tensor with shape (batch x 2 x passage_length). The first row contains probabilities for answer_start and the second one for answer_end, that’s why we need to slice the rows first and then extract the required information. Obviously we could accomplish the task without creating a new layer, yet it wouldn’t be a “Kerasic” solution.

Generators

Keras supports batch generators which are responsible for generating one batch per each iteration. One benefit of this approach is that the generator is working on a separate thread and is not waiting for the network to finish its training on the previous batch.

Bidirectional GRUs

R-NET uses multiple bidirectional GRUs. The common way of implementing BiRNN is to take two copies of the same network (without sharing the weights) and then concatenate the hidden states to produce the output. One can take the sum of the vectors instead of concatenating them, but concatenation seems to be more popular (that’s the default version of Bidirectional layer in Keras).

Dropout

The report indicates that dropout is applied “between layers with a dropout rate of 0.2”. We have applied dropout before each of the three layers of BiGRUs of both encoders, at the outputs of both encoders, right after QuestionAttnGRU, after SelfAttnGRU and after QuestionPooling layer. We are not sure that this is exactly what the authors did.

One more implementation detail is related to the way dropout is applied on the passage and question representation matrices. The rows of these matrices correspond to different words and the “vanilla” dropout will apply different masks on different words. These matrices are used as inputs to recurrent networks. But it is a common trick to apply the same mask at each “timestep”, i.e. each word. That’s how dropout is implemented in recurrent layers in Keras. The report doesn’t discuss these details.

The report doesn’t explicitly describe which weights are shared. We have decided to share those weights that are represented by the same symbol in the report. Note that the authors use the same symbol (e.g. $c_{t}$ ) for different variables (not weights) that obviously cannot be shared. But we hope that our assumption is true for weights. In particular, we share:

$W^Q_{u}$ matrix between QuestionAttnGRU and QuestionPooling layers,
$W^P_{v}$ matrix between QuestionAttnGRU and SelfAttnGRU layers,
$V$ vector between all four instances (it is used right before applying softmax).

We didn’t share the weights of the “attention gates”: $W_{g}$ . The reason is that we have a mix of uni- and bidirectional GRUs that use this gate and require different dimensions.

Hyperparameters

The authors of the report tell many details about hyperparameters. Hidden vector lengths are 75 for all layers. As we concatenate the hidden states of two GRUs in bidirectional, we effectively get 150 dimensional vectors. 75 is not an even number so it could not refer to the length of the concatenated vector :) AdaDelta optimizer is used to train the network with learning rate=1, $\rho=0.95$ and $\varepsilon=1e^{-6}$ . Nothing is written about the size of batches, or the way batches are sampled. We used batch_size=50 in our experiments to fit in 4GB GPU memory.

We couldn’t get good performance with 75 hidden units. The models were quickly overfitting. We got our best results using 45 dimensional hidden states.

Weight initialization

The report doesn’t discuss weight initialization. We used default initialization schemes of Keras. In particular, Keras uses orthogonal initialization for recurrent connections of GRU, and uniform (Glorot, Bengio, 2010) initialization for the connections that come from the inputs. We used Glorot initialization for all shared weights. It is not obvious that this was the best solution.

Training

The training script is very simple. First we create the model:

model = RNet(hdim=args.hdim,                                            # Defauls is 45
             dropout_rate=args.dropout,                                 # Default is 0 (0.2 in the report)
             N=None,                                                    # Size of passage
             M=None,                                                    # Size of question
             char_level_embeddings=args.char_level_embeddings)          # Default is false

It is possible to slightly speed up computations by fixing M and N. It usually helps Theano’s compiler to further optimize the computational graph.

We compile the model and fit it on the training set. Our training data is 90% of the original training set of SQuAD dataset. The other 10% is used as an internal validation dataset. We check the validation score after each epoch and save the current state of the model if it was better than the previous best one. The original development set of SQuAD is used as a test set, we don’t do model selection based on that.

We had an idea to form the batches in a way that passages inside each batch have almost the same number of words. That would allow to train a little bit faster (as there would be many batches with short sequences), but we didn’t use this trick yet. We took maximum 300 words from passages and 30 words from questions to avoid very long sequences.

Each epoch took around 100 minutes on a GTX980 GPU. We got our best results after 31 epochs.

Results and comparison with R-NET technical report

R-NET is currently (August 2017) the best model on Stanford QA benchmark among single models. SQuAD dataset uses two performance metrics: exact match (EM) and F1-score (F1). Human performance is estimated to be EM=82.3% and F1=91.2% on the test set.

The report by Microsoft Research describes two versions of R-NET. The first one is called R-NET (Wang et al., 2017) (which refers to a paper which is not yet available online) and reaches EM=71.3% and F1=79.7% on the test set. It is the model we described above without the additional biGRU between SelfAttnGRU and PointerGRU. The second version called R-NET (March 2017) has the additional BiGRU and reaches EM=72.3% and F1=80.7%. The current best single model on SQuAD leaderboard has a higher score, which means R-NET development continued since the technical report was released. Ensemble models reach even higher scores.

The best performance we got so far with our implementation is EM=57.52% and F1=67.42% on the development set. These results would put R-NET at the bottom of the SQuAD leaderboard. The model is available on GitHub. We want to emphasize that R-NET’s technical report is pretty good in terms of the reported details of the architecture compared to many other papers. Probably we misunderstood several important details or have bugs in the code. Any feedback will be appreciated.

Challenges of reproducibility

Recently, ICML 2017 hosted a special workshop devoted to the issues of reproducibility in machine learning. Hugo Larochelle shared the slides of his presentation, where he discussed many aspects of the problem. He argues that the research should be considered as reproducible if the code is open-sourced. On the other hand he suggests that the community should not require researchers to compare their new models with a related published result if the code for the latter is not available.

As a radical solution he suggests to use platforms like AI-ON. AI-ON is open-sourcing not only the code, but the whole research process, including discussions and code experiments. We think about starting AI-ON projects just for reproducing the results of important papers that come without code.

On the other hand, there are many simple tricks that can significantly improve reproducibility with little effort. For example, many papers report the number of parameters in the neural network. This number is a good checksum for other people. Another simple trick is to write the shapes of the tensors in the diagrams (just like we did in this post) or even in the text.

The best open source model on SQuAD that we are aware of is the implementation of DrQA architecture released in Facebook’s ParlAI repository. It reaches EM=66.4% and F1=76.5%. We will continue to play with our codebase and try to improve the results.

Interpreting neurons in an LSTM network

2017-06-27T00:00:00+00:00

By Tigran Galstyan and Hrant Khachatrian.

A few months ago, we showed how effectively an LSTM network can perform text transliteration.

For humans, transliteration is a relatively easy and interpretable task, so it’s a good task for interpreting what the network is doing, and whether it is similar to how humans approach the same task.

In this post we’ll try to understand: What do individual neurons of the network actually learn? How are they used to make decisions?

Transliteration
Network architecture
Analyzing the neurons
- How does “t” become “ծ”?
- What did this neuron learn?
Visualizing LSTM cells
Concluding remarks

Transliteration

About half of the billions of internet users speak languages written in non-Latin alphabets, like Russian, Arabic, Chinese, Greek and Armenian. Very often, they haphazardly use the Latin alphabet to write those languages.

Привет: Privet, Privyet, Priwjet, …
كيف حالك: kayf halk, keyf 7alek, …
Բարև Ձեզ: Barev Dzez, Barew Dzez, …

So a growing share of user-generated text content is in these “Latinized” or “romanized” formats that are difficult to parse, search or even identify. Transliteration is the task of automatically converting this content into the native canonical format.

Aydpes aveli sirun e.: Այդպես ավելի սիրուն է:

What makes this problem non-trivial?

Different users romanize in different ways, as we saw above. For example, v or w could be Armenian վ.
Multiple letters can be romanized to the same Latin letter. For example, r could be Armenian ր or ռ.
A single letter can be romanized to a combination of multiple Latin letters. For example, ch could be Cyrillic ч or Armenian չ, but c and h by themselves are for other letters.
English words and translingual Latin tokens like URLs occur in non-Latin text. For example, the letters in youtube.com or MSFT should not be changed.

Humans are great at resolving these ambiguities. We showed that LSTMs can also learn to resolve all these ambiguities, at least for Armenian. For example, our model correctly transliterated es sirum em Deep Learning into ես սիրում եմ Deep Learning and not ես սիրում եմ Դեեփ Լէարնինգ.

Network architecture

We took lots of Armenian text from Wikipedia and used probabilistic rules to obtain romanized text. The rules are chosen in a way that they cover most of the romanization rules people use for Armenian.

We encode Latin characters as one-hot vectors and apply character level bidirectional LSTM. At each time-step the network tries to guess the next character of the original Armenian sentence. Sometimes a single Armenian character is represented by multiple Latin letters, so it is very helpful to align the romanized and original texts before giving them to LSTM (otherwise we should use sequence-to-sequence networks, which are harder to train). Fortunately we can do the alignment, because the romanized version was generated by ourselves. For example, dzi should be transliterated into ձի, where dz corresponds to ձ and i to ի. So we add a placeholder character in the Armenian version: ձի becomes ձ_ի, so that now z should be transliterated into _. After the inference we just remove _s from the output string.

Our network consists of two LSTMs (228 cells) going forward and backward on the Latin sequence. The outputs of the LSTMs are concatenated at each step (concat layer), then a dense layer with 228 neurons is applied on top of it (hidden layer), and another dense layer (output layer) with softmax activations is used to get the output probabilities. We also concatenate the input vector to the hidden layer, so it has 300 neurons. This is a more simplified version of the network described in our previous post on this topic (the main difference is that we don’t use the second layer of biLSTM).

Analyzing the neurons

We tried to answer the following questions:

How does the network handle interesting cases with several possible outcomes (e.g. r => ր vs ռ etc.)?
What are the problems particular neurons are helping solve?

How does “t” become “ծ”?

First, we fixed one particular character for the input and one for the output. For example we are interested in how t becomes ծ (we know t can become տ, թ or ծ). We now that it usually happens when t appears in a bigram ts, which should be converted to ծ_.

For every neuron, we draw the histograms of its activations in cases where the correct output is ծ, and where the correct output is not ծ. For most of the neurons these two histograms are pretty similar, but there are cases like this:

Input = `t`, Output = `ծ`	Input = `t`, Output != `ծ`

These histograms show that by looking at the activation of this particular neuron we can guess with high accuracy whether the output for t is ծ. To quantify the difference between the two histograms we used Hellinger distance (we take the minimum and maximum values of neuron activations, split the range into 1000 bins and apply discrete Hellinger distance formula on two histograms). We calculated this distance for all neurons and visualized the most interesting ones in a single image:

The color of a neuron indicates the distance between its two histograms (darker colors correspond to larger distances). The width of a line between two neurons indicate the mean of the value that the neuron on the lower end of the connection contributes to the neuron on the higher end. Orange and green lines correspond to positive and negative signals, respectively.

The neurons at the top of the image are from the output layer, the neurons below the output layer are from the hidden layer (top 12 neurons in terms of the distance between histograms). Concat layer comes under the hidden layer. The neurons of the concat layer are split into two parts: the left half of the neurons are the outputs of the LSTM that goes forward on the input sequence and the right half contains the neurons from the LSTM that goes backwards. From each LSTM we display top 10 neurons in terms of the distance between histograms.

In the case of t => ծ, it is obvious that all top 12 neurons of the hidden layer pass positive signals to ծ and ց (another Armenian character that is often romanized as ts), and pass negative signals to տ, թ and others.

We can also see that the outputs of the right-to-left LSTM are darker, which implies that these neurons “have more knowledge” about whether to predict ծ. On the other hand, the lines between those neurons and the hidden layer are thicker, which means that they have more contribution in activating the top 12 neurons in the hidden layer. This is a very natural result, because we know that t usually becomes ծ when the next symbol is s, and only the right-to-left LSTM is aware of the next character.

We did the same analysis for the neurons and gates inside the LSTMs. The results are visualized as six rows of neurons at the bottom of the image. In particular, it is interesting to note that the most “confident” neurons are the so called cell inputs. Recall that cell inputs, as well as all the gates, depend on the input at the current step and the hidden state of the previous step (which is the hidden state at the next character as we talk about the right-to-left LSTM), so all of them are “aware” of the next s, but for some reason cell inputs are more confident than others.

In the cases where s should be transliterated into _ (the placeholder), the useful information is more likely to come from the LSTM that goes forward, as s becomes _ mainly in case of ts => ծ_. We see that in the next plot:

What did this neuron learn?

In the second part of our analysis we tried to figure out in which ambiguous cases each of the neurons is most helpful. We took the set of Latin characters that can be transliterated into more than one Armenian letters. Then we removed the cases where one of the possible outcomes appears less than 300 times in our 5000 sample sentences, because our distance metric didn’t seem to work well with few samples. And we analyzed every fixed neuron for every possible input-output pair.

For example, here is the analysis of the neuron #70 of the output layer of the left-to-right LSTM. We have seen in the previous visualization that it helps determining whether s should be transliterated into _. We see that the top input-output pairs for this neuron are the following:

Hellinger distance	Latin character	Armenian character
0.9482	s	_
0.8285	h	հ
0.8091	h	_
0.6125	o	օ

So this neuron is most helpful when predicting _ from s (as we already knew), but it also helps to determine whether Latin h should be transliterated as Armenian հ or the placeholder _ (e.g. Armenian չ is usually romanized as ch, so h sometimes becomes _).

We visualize Hellinger distances of the histograms of neuron activations when the input is h and the output is _, and see that the neuron #70 is among the top 10 neurons of the left-to-right LSTM for the h=>_ pair.

Visualizing LSTM cells

Inspired by this paper by Andrej Karpathy, Justin Johnson and Fei-Fei Li, we tried to find neurons or LSTM cells specialised in some language specific patterns in the sequences. In particular, we tried to find the neurons that react most to the suffix թյուն (romanized as tyun).

The first row of this visualization is the output sequence. Rows below show the activations of the most interesting neurons:

Cell #6 in the LSTM that goes backwards,
Cell #147 in the LSTM that goes forward,
37th neuron in the hidden layer,
78th neuron in the concat layer.

We can see that Cell #6 is active on tyuns and is not active on the other parts of the sequence. Cell #144 of the forward LSTM behaves the opposite way, it is active on everything except tyuns.

We know that t in the suffix tyun should always become թ in Armenian, so we thought that if a neuron is active on tyuns, it may help in determining whether the Latin t should be transliterated as թ or տ. So we visualized the most important neurons for the pair t => թ.

Indeed, Cell #147 in the forward LSTM is among the top 10.

Concluding remarks

Interpretability of neural networks remains an important challenge in machine learning. CNNs and LSTMs perform well for many learning tasks, but there are very few tools to understand the inner workings of these systems. Transliteration is a pretty good problem for analyzing the impact of particular neurons.

Our experiments showed that too many neurons are involved in the “decision making” even for the simplest cases, but it is possible to identify a subset of neurons that have more influence than the rest. On the other hand, most neurons are involved in multiple decision making processes depending on the context. This is expected, since nothing in the loss functions we use when training neural nets forces the neurons to be independent and interpretable. Recently, there have been some attempts to apply information-theoretic regularization terms in order to obtain more interpretability. It would be interesting to test those ideas in the context of transliteration.

We would like to thank Adam Mathias Bittlingmayer and Zara Alaverdyan for helpful comments and discussions.

Announcing YerevaNN non-profit foundation

2016-10-17T00:00:00+00:00

Today we have officially registered YerevaNN scientific educational foundation, which aims to promote world-class AI research in Armenia and develop high quality educational programs in machine learning and related disciplines. The board members of the foundation are Gor Vardanyan, founder of FimeTech, Vazgen Hakobjanyan, cofounder of Teamable, and Rouben Meschian, founder of Arminova Technologies. Hrant Khachatrian is the director of the foundation.

The core project of the foundation is to support an AI research lab based in Yerevan, Armenia. Inspired by OpenAI, the lab focuses on non-commercial machine learning research and is committed to publish all obtained results and release all the code on GitHub. The three initial members of YerevaNN lab, Tigran Galstyan, Karen Hambardzumyan and Hrayr Harutyunyan, currently work on projects ranging from generative models to natural language processing.

Sentence representations and question answering (slides)

2016-09-21T00:00:00+00:00

The success of neural word embedding models like word2vec and GloVe motivated research on representing sentences in an n-dimensional space. Michael Manukyan and Hrayr Harutyunyan reviewed several sentence representation algorithms and their applications in state-of-the-art automated question answering systems during a talk at the Armenian NLP meetup. The slides of the talk are below. Follow us on SlideShare to get the latest slides from YerevaNN.

Sentence representations and question answering (YerevaNN) from YerevaNN

Automatic transliteration with LSTM

2016-09-09T00:00:00+00:00

By Tigran Galstyan, Hrayr Harutyunyan and Hrant Khachatrian.

Many languages have their own non-Latin alphabets but the web is full of content in those languages written in Latin letters, which makes it inaccessible to various NLP tools (e.g. automatic translation). Transliteration is the process of converting the romanized text back to the original writing system. In theory every language has a strict set of romanization rules, but in practice people do not follow the rules and most of the romanized content is hard to transliterate using rule based algorithms. We believe this problem is solvable using the state of the art NLP tools, and we demonstrate a high quality solution for Armenian based on recurrent neural networks. We invite everyone to adapt our system for more languages.

Problem description
Data processing
Network architecture
Results
Future work

Problem description

Since early 1990s computers became widespread in many countries, but the operating systems did not fully support different alphabets out of the box. Most keyboards had only latin letters printed on them, and people started to invent romanization rules for their languages. Every language has its own story, and these stories are usually not known outside their own communities. In case of Armenian, some solutions have been developed, but even those who knew how to write in Armenian characters, were not sure that the readers (r.g. the recipient of the email) would be able to read that.


Armenian alphabet in the Unicode space. Source: Wikipedia

In the Unicode era all major OSes started to support displaying Armenian characters. But the lack of keyboard layouts was still a problem. In late 2000s mobile internet penetration exploded in Armenia, and most of the early mobile phones did not support writing in Armenian. For example, iOS doesn’t include Armenian keyboard and started to officially support custom keyboards only in 2014! The result was that lots of people entered the web (mostly through social networks) without having access to Armenian letters. So everyone started to use some sort of romanization (obviously no one was aware that there are fixed standards for the romanization of Armenian).

Currently there are many attempts to fight romanized Armenian on forums and social networks. Armenian keyboard layouts are developed for every popular platform. But still lots of content is produced in non-Armenian letters (maybe only Facebook knows the exact scale of the problem), and such content remains inaccessible for search indexing, automated translation, text-to-speech, etc. Recently the problem started to flow outside the web, people use romanized Armenian on the streets.


Romanized Armenian on the street. Source: VKontakte social network

There are some online tools that correctly transliterate romanized Armenian if its written using strict rules. Hayeren.am is the most famous example. Facebook’s search box also recognizes some romanizations (but not all). But for many practical cases these tools do not give a reasonable output. The algorithm must be able to use the context to correctly predict the Armenian character.


Facebook’s search box recognizes some romanized Armenian. Note that the spelling suggestion is not for Armenian.

Finally, there are debates whether these tools actually help fighting the “translit” problem. Some argue that people will not be forced to use Armenian keyboard if there are very good tools to transliterate. We believe that the goal of making this content available for the NLP tools is extremely important, as no one will (and should) develop, say, language translation tools for romanized alphabets.

Wikipedia has similar stories for Greek, Persian and Cyrillic alphabets. The problem exists for many writing systems and is mostly overlooked by the NLP community, although it’s definitely not the hardest problem in NLP. We hope that the solution we develop for Armenian might become helpful for other languages as well.

Data processing

We are using a recurrent neural network that takes a sequence of characters (romanized Armenian) at its input and outputs a sequence of Armenian characters. In order to train such a system we take a lot of text in Armenian, romanize it using probabilistic rules and give them to the network.

Source of the data

We chose Armenian Wikipedia as the easiest available large corpus of Armenian text. The dumps are available here. These dumps are in a very complicated XML format, but they can be parsed by the WikiExtractor tool. The details are in the Readme file of the repository we released today.

The disadvantage of Wiki is that it doesn’t contain very diverse texts. For example, it doesn’t contain any dialogs or non formal speech (while social networks are full of them). On the other hand it’s very easy to parse and it’s quite large (356MB). We splitted this into training (284MB), validation (36MB) and test (36MB) sets, but then we understood that the overlap between training and validation sets can be very high. Finally we decided to use some fiction text with lots of dialogs as a validation set.

Romanization rules

To generate the input sequences for the network we need to romanize the texts. We use probabilistic rules, as different people prefer different romanizations. Armenian alphabet has 39 characters, while Latin has only 26. Some of the Armenian letters are romanized in a unique way, like ա-a, բ-b, դ-d, ի-i, մ-m, ն-n. Some letters require a combination of two Latin letters: շ-sh, ժ-zh, խ-kh. The latter is also romanized to gh or even x (because this one looks like Russian х which is pronounced the same way as Armenian խ).

But the main obstacle is that the same Latin character can correspond to different Armenian letters. For example c can come from both ց and ծ, t can come from both տ and թ, and so on. This is what the network has to learn to infer from the context.

We have created a probabilistic mapping, so that each Armenian letter is romanized according to the given probabilities. For example, ծ is replaced by ts in 60% of cases, c in 30% of cases, and & in 10% of cases. The full set of rules are here and can be browsed here.


Some of the romanization rules for Armenian

Geographic dependency

The romanization rules vary a lot in different countries. For example, Armenian letter շ is mostly romanized as sh, but Armenians in Germany prefer sch, Armenians in France sometimes use ch, and Armenians in Russia use w (because w is visually similar to Russian ш which sounds like sh). There are many other similar differences that might require separate analysis.

Finally, Armenian language has two branches: Eastern and Western Armenian. These branches have crucial differences in romanization rules. Here we focus only on the rules for Eastern Armenian and those that are commonly used in Armenia.

Filtering out large non-Armenian chunks

Wikidumps contain some large regions where there are no Armenian characters. We noticed that these regions were confusing the network. So now when generating a chunk to give to the system we drop the ones that do not contain at least 33% Armenian characters.

This is a difficult decision, as one might want the system to recognize English words in the text and leave them without transliteration. For example, the word You Tube should not be transliterated to Armenian. We hope that such small cases of English words/names will remain in the training set.

Network architecture

Our search for a good network architecture started from Lasagne implementation of Karpathy’s popular char-rnn network. Char-rnn is a language model, it predicts the next character given the previous ones and is based on 2 layers of LSTMs going from left to right. The context from the right is also important in our case, so we replaced simple LSTMs with bidirectional LSTMs (introduced here back in 1995).

We have also added a shortcut connection from the input to the output of the 2nd biLSTM layer. This should help to learn the “easy” transliteration rules on this short way and leave LSTMs for the complex stuff.

Just like char-rnn, our network works on character level data and has no access to dictionaries.

Encoding the characters

First we define the set of possible characters (“vocabularies”) for the input and the output. The input “vocabulary” contains all the characters that appear in the right hand sides of the romanization rules, the digits and some punctuation (that can provide useful context). Then a special program runs over the entire corpus, generates the romanized version, and every symbol outside the input vocabulary is replaced by some placeholder symbol (#) in both original and romanized versions. The symbols that are left in the original version form the “output vocabulary”.

All symbols are encoded as one-hot vectors and are passed to the network. In our case the input vectors are 72 dimensional and the output vectors are 152 dimensional.

Aligning

After some experiments we noticed that LSTMs are really struggling when the characters are not aligned in inputs and outputs. As one Armenian character can be replaced by 2 or 3 Latin characters, the input and output sequences usually have different lengths, and the network has to “remember” by how many characters the romanized sequence is ahead of the Armenian sequence in order to print the next character in the correct place. This turned to be extremely difficult, and we decided to explicitly align the Armenian sequence by adding some placeholder symbols after those characters that are romanized to multi-character Latin.


Character level alignment of Armenian text with the romanization

Also there is one exceptional case in Armenian: the Latin letter ‘u’ should be transliterated to 2 Armenian symbols: ու. This is another source of misalignment. We explicitly replace all ու pairs with some placeholder symbol to avoid the problem.

Bidirectional LSTM with residual-like connections

LSTM network expects a sequence of vectors at its input. In our case it is a sequence of one-hot vectors, and the sequence length is a hyperparameter. We used --seq_len 30 for the final model. This means that the network reads 30 characters in Armenian, transforms to Latin characters (it usually becomes a bit longer than 30), then crops up to the latest whitespace before the 30th symbol. The remaining cells are filled with another placeholder symbol. This ensures that the words are not split in the middle.


Network architecture. Green boxes encapsulate all the magic inside LSTM. Grey trapezoids denote dense connections. Dotted line is an identity connection without trainable parameters.

These 30 one-hot vectors are passed to the first layer of bidirectional LSTM. Basically it is a combination of two separate LSTMs, first one is passing over the sequence from left to right, and the other is passing from right to left. We use 1024 neurons in all LSTMs. Both LSTMs output some 1024-dimensional vectors at every position. These outputs are concatenated into a 2048 dimensional vector and are passed through another dense layer that outputs a 1024 dimensional vector. That’s what we call one layer of a bidirectional LSTM. The number of such layers is another hyperparameter (--depth). Our experiments showed that 2 layers learn better than 1 or 3 layers.

At every position the output of the last bidirectional LSTM is concatenated with the one-hot vector of the input forming a 1096 dimensional vector. Then it is densely connected to the final layer with 152 neurons on which softmax is applied. The total loss is the mean of the cross entropy losses of the current sequence.

The concatenation of the input vector to the output of the LSTM is similar to the residual connections introduced in deep residual networks. Some of the transliteration rules are very easy and deterministic, so they can be learned by a diagonal-like matrix between input and output vectors. For more complex rules the output of LSTMs will become important. One important difference from deep residual networks is that instead of adding the input vector to the output of LSTMs, we just concatenate them. Also, our residual connections do not help fighting the vanishing/exploding gradient problem, we have LSTM for that.

Results

We have trained this network using adagrad algorithm with gradient clipping (learning rate was set to 0.01 and was not modified). Training is not very stable and it’s very hard to wait until it overfits on our hardware (NVidia GTX 980). We use --batch_size 350 and it consumes more than 2GB of GPU memory.

python -u train.py --hdim 1024 --depth 2 --seq_len 30 --batch_size 350 &> log

The model we got for Armenian was trained for 42 hours. Here are the plots of training and validation sets:


Loss functions. Green is the validation loss, blue is the training loss.

The loss quickly drops in the first quarter of the first epoch, then continues to slowly decrease. We stopped after 5.1 epochs. The Levenshtein distance between the original Armenian text and the output of the network on the validation test is 405 (the length is 36694). For example, hayeren.am’s converter output has more than 2500 edit distance.

Here are some results.

Romanized snippet from Wikipedia (test set)	Transliteration by translit-rnn
Belgiayi gyuxatntesutyuny Belgiayi tntesutyan jyuxeric mekn e։ Gyuxatntesutyany bnorosh e bardzr intyensivutyune, sakayn myec che nra der@ erkri tntesutyan mej։ Byelgian manr ev mijin agrarayin tntesutyunneri erkir e։ Gyuxatntyesutyan mej ogtagortsvox hoghataracutyan mot kese patkanum e 5-ic 20 ha unecox fermernerin, voronq masnagitacats yen qaxaknerin mterqner matakararelu gorcum, talis en apranqayin artadranqi himnakan zangvatse։	Բելգիայի գյուղատնտեսությունը Բելգիայի տնտեսության ճյուղերից մեկն է։ Գյուղատնտեսությանը բնորոշ է բարձր ինտենսիվությունը, սակայն մեծ չէ նրա դերը երկրի տնտեսության մեջ։ Բելգիան մանր և միջին ագրարային տնտեսությունների երկիր է։ Գյուղատնտեսության մեջ օգտագործվող հողատարածության մոտ կեսը պատկանում է 5-ից 20 հա ունեցող ֆերմերներին, որոնք մասնագիտացած են քաղաքներին մթերքներ մատակարարելու գործում, տալիս են ապրանքային արտադրանքի հիմնական զանգվածը։

Romanized snippet from Wikipedia (test set)

Transliteration by translit-rnn

Belgiayi gyuxatntesutyuny Belgiayi tntesutyan jyuxeric mekn e։ Gyuxatntesutyany bnorosh e bardzr intyensivutyune, sakayn myec che nra der@ erkri tntesutyan mej։ Byelgian manr ev mijin agrarayin tntesutyunneri erkir e։ Gyuxatntyesutyan mej ogtagortsvox hoghataracutyan mot kese patkanum e 5-ic 20 ha unecox fermernerin, voronq masnagitacats yen qaxaknerin mterqner matakararelu gorcum, talis en apranqayin artadranqi himnakan zangvatse։

Բելգիայի գյուղատնտեսությունը Բելգիայի տնտեսության ճյուղերից մեկն է։ Գյուղատնտեսությանը բնորոշ է բարձր ինտենսիվությունը, սակայն մեծ չէ նրա դերը երկրի տնտեսության մեջ։ Բելգիան մանր և միջին ագրարային տնտեսությունների երկիր է։ Գյուղատնտեսության մեջ օգտագործվող հողատարածության մոտ կեսը պատկանում է 5-ից 20 հա ունեցող ֆերմերներին, որոնք մասնագիտացած են քաղաքներին մթերքներ մատակարարելու գործում, տալիս են ապրանքային արտադրանքի հիմնական զանգվածը։

Edit distance between this output and the original text is 0. Next we try some legal text in Armenian:

Romanized snippet from Armenian constitution	Transliteration by translit-rnn
Zhoghovurdn ir ishkhanutyunn irakanatsnum e azat yntrutyunneri, hanraqveneri, inchpyes naev Sahmanadrutyamb naghatesvac petakan ev teghakan inqnakaravarman marminnyeri u pashtonatar anzanc midjocov:	հողովուրդն իր իշխանությունն իրականացնում է ազատ ընտրությունների, հանրաքվեների, ինչպես նաև Սահմանադրությամբ նախատեսված պետական և տեղական ինքնակառավարման մարմինների ու պաշտոնատար անձանց միջոցով:

There is only one error here. The first word should start by Ժ and not հ. The possible reason for this is that the network doesn’t have a left-side context for that character.

An interesting feature of this system is that it also tries to learn when the Latin letters should not be converted to Armenian. Next example comes from a random Facebook group:

Random post from a Facebook group	Transliteration by translit-rnn
aysor aravotyan jamy 10;40–11;00 ynkac hatvacum 47 hamari yertuxayini miji txa,vor qez pahecir txamardavari u vori hamar MERSI.,xndrum em ete kardas PM gri. p.s.anlurj, animast u antexi commentner chgreq,karevor e u lurj.	այսօր առավոտյան ժամը 10;40–11;00 ընկած հատվածում 47 համարի երթուղայինի միջի տղա,որ քեզ պահեցիր տղամարդավարի ու որի համար ՄԵՐSI.,խնդրում եմ եթե կարդաս ՊՄ գրի. p.s.անլուրջ, անիմաստ ու անտեղի ցոմմենտներ չգրեք,կարևոր է ու լուրջ.

It is interesting that the sequence p.s. is not transliterated. Also it decided to leave half of the letters of MERSI in Latin which is probably because it’s written in all caps (Wikipedia doesn’t contain a lot of text in all caps, maybe except some abbreviations). Also, the word commentner is transliterated as ցոմմենտներ (instead of քոմենթներ), because it’s not really a romanized Armenian word, it just includes the English word comment (and it definitely doesn’t appear in Wiki).

Future work

First we plan to understand what the system actually learned by visualizing its behavior on different cases. It is interesting to see how the residual connection performed and also if the network managed to discover some rules known from Armenian orthography.

Next, we want to bring this tool to the web. We will have to make much smaller/faster model, translate it to Javascript, and probably wrap it in a Chrome extension.

Finally, we would like to see this tool applied to more languages. We have released all the code in the translit-rnn repository and prepared instructions on how to add a new language. Basically a large corpus and probabilistic romanization rules are required.

We would like to thank Adam Mathias Bittlingmayer for many valuable discussions.

Combining CNN and RNN for spoken language identification

2016-06-26T00:00:00+00:00

By Hrayr Harutyunyan and Hrant Khachatrian

Last year Hrayr used convolutional networks to identify spoken language from short audio recordings for a TopCoder contest and got 95% accuracy. After the end of the contest we decided to try recurrent neural networks and their combinations with CNNs on the same task. The best combination allowed to reach 99.24% and an ensemble of 33 models reached 99.67%. This work became Hrayr’s bachelor’s thesis.

Inputs and outputs
Network architecture
Ensembling
Final remarks

Inputs and outputs

As before, the inputs of the networks are spectrograms of speech recordings. It seems spectrograms are the standard way to represent audio for deep learning systems (see “Listen, Attend and Spell” and “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”).

Some networks use up to 11khz frequencies (858 x 256 image) and others use up to 5.5khz frequencies (858 x 128 image). In general the networks which use up to 5.5khz frequencies perform a little bit better (probably because the higher frequencies do not contain much useful information and just make overfitting easier).

The output layer of all networks is a fully connected softmax layer with 176 units.

We didn’t augment the data using vocal tract length augmentation.

Network architecture

We have tested several network architectures. First set of architectures are plain AlexNet-like convolutional networks. The second set contains no convolutions and interprets the columns of the spectrogram as a sequence of inputs to a recurrent network. The third set applies RNN on top of the features extracted by a convolutional network. All models are implemented in Theano and Lasagne.

Almost all networks easily reach 100% accuracy on the training set. In the following tables we describe all architectures we tried and report accuracy on the validation set.

Convolutional networks (CNN)

The network consists of 6 blocks of 2D convolution, ReLU nonlinearity, 2D max pooling and batch normalization. We use 7x7 filters for the first convoluational layer, 5x5 for the second and 3x3 for the rest. Pooling size is always 3x3 with a stride 2.

Batch normalization significantly increases the training speed (this fact is reported in lots of recent papers). Finally we use only 1 fully connected layer between the last pooling layer and the softmax layer, and apply 50% dropout on that.

Network	Accuracy	Notes
tc_net	<80%	The difference between this network and the CNN descibed in the previous work is that this network has only one fully connected layer. We didn’t train this network much because of `ignore_border=False`, which slows down the training
tc_net_mod	97.14	This network is the same as `tc_net` but instead of `ignore_border=False`, we put `pad=2`
tc_net_mod_5khz_small	96.49	This network is a smaller copy of `tc_net_mod` network and works with up to 5.5khz frequencies

The Lasagne setting ignore_border=False prevents Theano from using CuDNN. Setting it to True significantly increased the speed.

Here is the detailed description of the best network of this set: tc_net_mod.

Nr	Type	Channels	Width	Height	Kernel size / stride
0	Input	1	858	256
1	Conv	16	852	250	7x7 / 1
	ReLU	16	852	250
	MaxPool	16	427	126	3x3 / 2, pad=2
	BatchNorm	16	427	126
2	Conv	32	423	122	5x5 / 1
	ReLU	32	423	122
	MaxPool	32	213	62	3x3 / 2, pad=2
	BatchNorm	32	213	62
3	Conv	64	211	60	3x3 / 1
	ReLU	64	211	60
	MaxPool	64	107	31	3x3 / 2, pad=2
	BatchNorm	64	107	31
4	Conv	128	105	29	3x3 / 1
	ReLU	128	105	29
	MaxPool	128	54	16	3x3 / 2, pad=2
	BatchNorm	128	54	16
5	Conv	128	52	14	3x3 / 1
	ReLU	128	52	14
	MaxPool	128	27	8	3x3 / 2, pad=2
	BatchNorm	128	27	8
6	Conv	256	25	6	3x3 / 1
	ReLU	256	25	6
	MaxPool	256	14	3	3x3 / 2, pad=2
	BatchNorm	256	14	3
7	Fully connected	1024
	ReLU	1024
	BatchNorm	1024
	Dropout	1024
8	Fully connected	176
	Softmax Loss	176

During the training we accidentally discovered a bug in Theano, which was quickly fixed by Theano developers.

Recurrent neural networks (RNN)

The spectrogram can be viewed as a sequence of column vectors that consist of 256 (or 128, if only <5.5KHz frequencies are used) numbers. We apply recurrent networks with 500 GRU cells in each layer on these sequences.

Network	Accuracy	Notes
rnn	93.27	One GRU layer on top ot the input layer
rnn_2layers	95.66	Two GRU layers on top ot the input layer
rnn_2layers_5khz	98.42	Two GRU layers on top ot the input layer, maximum frequency: 5.5khz

The second layer of GRU cells improved the performance. Cropping out frequencies above 5.5KHz helped fight overfitting. We didn’t use dropout for RNNs.

Both RNNs and CNNs were trained using adadelta for a few epochs, then by SGD with momentum (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy.

Combinations of CNN and RNN

The general architecture of these combinations is a convolutional feature extractor applied on the input, then some recurrent network on top of the CNN’s output, then an optional fully connected layer on RNN’s output and finally a softmax layer.

The output of the CNN is a set of several channels (also known as feature maps). We can have separate GRUs acting on each channel (with or without weight sharing) as described in this picture:

Another option is to interpret CNN’s output as a 3D-tensor and run a single GRU on 2D slices of that tensor:

The latter option has more parameters, but the information from different channels is mixed inside the GRU, and it seems to improve performance. This architecture is similar to the one described in this paper on speech recognition, except that they also use some residual connections (“shortcuts”) from input to RNN and from CNN to fully connected layers. It is interesting to note that recently it was shown that similar architectures work well for text classification.

Network	Accuracy	Notes
tc_net_rnn	92.4	CNN consists of 3 convolutional blocks and outputs 32 channels of size 104x13. Each of these channels is fed to a separate GRU as a sequence of 104 vectors of size 13. The outputs of GRUs are combined and fed to a fully connected layer
tc_net_rnn_nodense	91.94	Same as above, except there is no fully connected layer on top of GRUs. Outputs of GRU are fed directly to the softmax layer
tc_net_rnn_shared	96.96	Same as above, but the 32 GRUs share weights. This helped to fight overfitting
tc_net_rnn_shared_pad	98.11	4 convolutional blocks in CNN using `pad=2` instead of `ignore_broder=False` (which enabled CuDNN and the training became much faster). The output of CNN is a set of 32 channels of size 54x8. 32 GRUs are applied (one for each channel) with shared weights and there is no fully connected layer
tc_net_deeprnn_shared_pad	95.67	4 convolutional block as above, but 2-layer GRUs with shared weights are applied on CNN’s outputs. Overfitting became stronger because of this second layer
tc_net_shared_pad_augm	98.68	Same as tc_net_rnn_shared_pad, but the network randomly crops the input and takes 9s interval. The performance became a bit better due to this
tc_net_rnn_onernn	99.2	The outputs of a CNN with 4 convolutional blocks are grouped into a 32x54x8 3D-tensor and a single GRU runs on a sequence of 54 vectors of size 32*8
tc_net_rnn_onernn_notimepool	99.24	Same as above, but the stride along the time axis is set to 1 in every pooling layer. Because of this the CNN outputs 32 channels of size 852x8

The second layer of GRU in this setup didn’t help due to the overfitting.

It seems that subsampling in the time dimension is not a good idea. The information that is lost during subsampling can be better used by the RNN. In the paper on text classification by Yijun Xiao and Kyunghyun Cho, the authors even suggest that maybe all pooling/subsampling layers can be replaced by recurrent layers. We didn’t experiment with this idea, but it looks very promising.

These networks were trained using SGD with momentum only. The learning rate was set to 0.003 for around 10 epochs, then it was manually decreased to 0.001 and then to 0.0003. On average, it took 35 epochs to train these networks.

Ensembling

The best single model had 99.24% accuracy on the validation set. We had 33 predictions by all these models (there were more than one predictions for some models, taken after different epochs) and we just summed up the predicted probabilities and got 99.67% accuracy. Surprisingly, our other attempts of ensembling (e.g. majority voting, ensemble only on some subset of all models) didn’t give better results.

Final remarks

The number of hyperparameters in these CNN+RNN mixtures is huge. Because of the limited hardware we covered only a very small fraction of possible configurations.

The organizers of the original contest did not publicly release the dataset. Nevertheless we release the full source code on GitHub. We couldn’t find many Theano/Lasagne implementations of CNN+RNN networks on GitHub, and we hope these scripts will partially fill that gap.

This work was part of Hrayr’s bachelor’s thesis, which is available on academia.edu (the text is in Armenian).

Playground for bAbI tasks

2016-02-23T00:00:00+00:00

Recently we have implemented Dynamic memory networks in Theano and trained it on Facebook’s bAbI tasks which are designed for testing basic reasoning abilities. Our implementation now solves 8 out of 20 bAbI tasks which is still behind state-of-the-art. Today we release a web application for testing and comparing several network architectures and pretrained models.

Attention module
Architecture extensions
Results
Visualizing Dynamic memory networks
Looking for feedback

Attention module

One of the key parts in the DMN architecture, as described in the original paper, is its attention system. DMN obtains internal representations of input sentences and question and passes these to the episodic memory module. Episodic memory passes over all the facts, generates episodes, which are finally combined into a memory. Each episode is created by looking at all input sentences according to some attention. Attention system gives a score for each of the sentences, and if the score is low for some sentence, it will be ignored when constructing the episode.

Attention system is a simple 2 layer neural network where input is a vector of features computed based on input sentence, question and current state of the memory. This vector of features is described in the paper as follows:

where c is an input sentence, q is the question, m is the current state of the memory. We tried to stay as close to the original as possible in our first implementation, but probably we understood these expressions too literally. We implemented |c-q| as an absolute value of a difference of two vectors, which caused lots of trouble, as Theano’s implementation of (the gradient of) abs function gave NaNs at random during training. Then, the terms cWq and cWm actually produce just two numbers, and they do not affect anything in a large vector.

Later we implemented another version called dmn_smooth which uses Euclidean distance between two vectors (instead of abs). This version is much more stable and gives better results. It is interesting to note that this version trains faster on CPU than on our GPU (GTX 980). It could be because of our not so optimal code or some issue in Theano’s scan function.

Architecture extensions

The only significant difference between our implementation and the original DMN, as we understand it, is the fixed number of episodes. In the paper the authors describe a stop condition, so that the network decides if it needs to compute more episodes. We did not implement it yet.

Our implementations heavily overfit on many tasks. We tried several techniques to fight that, but with little luck. First, we have implemented a version of dmn_smooth which supports mini-batch training. Then we applied dropout and batch normalization on top of the memory module (before passing to the answer module). All of these tricks help for some tasks for some hyperparameters, but still we could not beat the results obtained using simple dmn_smooth trained without mini-batches.

We plan to bring some ideas from the Neural Reasoner paper, especially the idea of recovering the input sentences based on the outputs of the input module.

Results

We train our implementations on bAbI tasks in a weakly supervised setting, as described in our previous post. Here we compare our results to End-to-end memory networks (MemN2N).

So far our best results are obtained by training dmn_smooth with 100 neurons for internal representations, 5 memory hops, using simple gradient descent for 11 epochs. We train jointly on all 20 bAbI tasks.

Task	MemN2N best version	Joint100 75.05%
1. Single supporting fact	99.9%	100%
2. Two supporting facts	81.2%	39.7%
3. Three supporting facts	68.3%	41.5%
4. Two argument relations	82.5%	75.5%
5. Three arguments relations	87.1%	50.1%
6. Yes/no questions	98%	97.7%
7. Counting	89.9%	91.4%
8. Lists/sets	93.9%	95.2%
9. Simple negation	98.5%	99%
10. Indefinite knowledge	97.4%	87.3%
11. Basic coreference	96.7%	100%
12. Conjuction	100%	87%
13. Compound coreference	99.5%	96.4%
14. Time reasoning	98%	73.1%
15. Basic deduction	98.2%	53.9%
16. Basic induction	49%	49.5%
17. Positional reasoning	57.4%	59.3%
18. Size reasoning	90.8%	98.3%
19. Path finding	9.4%	9%
20. Agent’s motivations	99.8%	97.1%
Average accuracy	84.775%	75.05%
Solved tasks	10	8

We solve (obtain >95% accuracy) 8 tasks. Our system outperforms MemN2N on some tasks, but on average stays behind by 10 percentage points. Experiments show that our networks do not manage to find connections between several sentences at once (tasks 2, 3 etc.). Task 19 (path finding) remains the most difficult one. It is actually the only task on which none of our implementations overfit. The authors of Neural Reasoner claim some success on that task when training on 10 000 examples. We use only 1000 samples per task for all experiments.

Visualizing Dynamic memory networks

We have created a web application / playground for Dynamic memory networks focused on bAbI tasks. It allows to choose a pretrained model and send custom input sentences and questions. The app shows the predicted answer and visualizes attention scores for each memory step.


Web-based playground for bAbI tasks

These visualizations show that the network does not significantly change its attention for different episodes, so it is very hard to correctly answer the questions from tasks 2 or 3.

Web app is accessible at http://yerevann.com/dmn-ui/. Note that the vocabulary of bAbI tasks is quite limited, and our implementation of DMN cannot process out-of-vocabulary words. Sample button is a good starting point, it gives a random sample from bAbI test set.

Looking for feedback

Everything described in this post is available on Github. DMN implementations are here, Flask-based restful server of the web app is in the /server/ folder, UI is in another repository. Feel free to fork, report issues, and please share your thoughts.

Implementing Dynamic memory networks

2016-02-05T00:00:00+00:00

The Allen Institute for Artificial Intelligence has organized a 4 month contest in Kaggle on question answering. The aim is to create a system which can correctly answer the questions from the 8th grade science exams of US schools (biology, chemistry, physics etc.). DeepHack Lab organized a scientific school + hackathon devoted to this contest in Moscow. Our team decided to use this opportunity to explore the deep learning techniques on question answering (although they seem to be far behind traditional systems). We tried to implement Dynamic memory networks described in a paper by A. Kumar et al. Here we report some preliminary results. In the next blog post we will describe the techniques we used to get to top 5% in the contest.

bAbI tasks
Memory networks
Dynamic memory networks
Initial experiments
Next steps

bAbI tasks

The questions of this contest are quite hard, they not only require lots of knowledge in natural sciences, but also abilities to make inferences, generalize the concepts, apply the general ideas to the examples and so on. The methods based on deep learning do not seem to be mature enough to handle all of these difficulties. On the other hand these questions have 4 answer candidates. That’s why, as was noted by Dr. Vorontsov, simple search engine indexed on lots of documents will perform better as a question answering system than any “intelligent” system.

But there is already some work on creating question answering / reasoning systems using neural approaches. As another lecturer of the DeepHack event, Tomas Mikolov, told us, we should start from easy, even synthetic questions and try to gradually increase the difficulty. This roadmap towards building intelligent question answering systems is described in a paper by Facebook researchers Weston, Bordes, Chopra, Rush, Merriënboer and Mikolov, where the authors introduce a benchmark of toy questions called bAbI tasks which test several basic reasoning capabilities of a QA system.

Questions in the bAbI dataset are grouped into 20 types, each of them has 1000 samples for training and another 1000 samples for testing. A system is said to have passed a given task, if it correctly answers at least 95% of the questions in the test set. There is also a version with 10K samples, but as Mikolov told during the lecture, deep learning is not necessarily about large datasets, and in this setting it is more interesting to see if the systems can learn answering questions by looking at a few training samples.



Some of the bAbI tasks. More examples can be found in the paper.

Memory networks

bAbI tasks were first evaluated on an LSTM-based system, which achieve 50% performance on average and do not pass any task. Then the authors of the paper try Memory Networks by Weston et al. It is a recurrent network which has a long-term memory component where it can learn to write some data (the input sentences) and read them later.

bAbI tasks include not only the answers to the questions but also the numbers of those sentences which help answer the question. This information is taken into account when training MemNN, they not only get the correct answers but also an information about which input sentences affect the answer. Under this so called strongly supervised setting “plain” Memory networks pass 7 of the 20 tasks. Then the authors apply some modifications to them and pass 16 tasks.


The structure of MemN2N from the paper.

We are mostly interested in weakly supervised setting, because the additional information on important sentences is not available in many real scenarios. This was investigated in a paper by Sukhbaatar, Szlam, Weston and Fergus (from New York University and Facebook AI Research) where they introduce End-to-end memory networks (MemN2N). They investigate many different configurations of these systems and the best version passes 9 tasks out of 20. Facebook’s MemN2N repository on GitHub lists some implementations of MemN2N.

Dynamic memory networks

Another advancement in the direction of memory networks was made by Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani and Socher from Metamind. By the way, Richard Socher is the author of an excellent course on deep learning and NLP at Stanford, which helped us a lot to get into the topic. Their paper introduces a new system called Dynamic memory networks (DMN) which passes 18 bAbI tasks in the strongly supervised setting. The paper does not talk about weakly supervised setting, so we decided to implement DMN from scratch in Theano.


High-level structure of DMN from the paper.

Semantic memory

The input of the DMN is a sequence of word vectors of input sentences. We followed the paper and used pretrained GloVe vectors and added the dimensionality of word vectors to the list of hyperparamaters (controlled by the command line argument --word_vector_size). DMN architecture treats these vectors as part of a so called semantic memory (in contrast to the episodic memory) which may contain other knowledge as well. Our implementation uses only word vectors and does not fine tune them during the training, so we don’t consider it as a part of the neural network.

Input module

The first module of DMN is an input module that is a gated recurrent unit (GRU) running on the sequence of word vectors. GRU is a recurrent unit with 2 gates that control when its content is updated and when its content is erased. The hidden state of the input module is meant to represent the input processed so far in a vector. Input module outputs its hidden states either after every word (--input_mask word) or after every sentence (--input_mask sentence). These outputs are called facts.


Formal definition of GRU. `z` is the update gate and `r` is the reset gate. More details and images can be found here.

Then there is a question module that processes the question word by word and outputs one vector at the end. This is done by using the same GRU as in the input module using the same weights.

Episodic memory

The fact and question vectors extracted from the input enter the episodic memory module. Episodic memory is basically a composition of two nested GRUs. The outer GRU generates the final memory vector working over a sequence of so called episodes. This GRU state is initialized by the question vector. The inner GRU generates the episodes.


Details of DMN architecture from the paper.

The inner GRU generates the episodes by passing over the facts from the input module. But when updating its inner state, the GRU takes into account the output of some attention function on the current fact. Attention function gives a score (between 0 and 1) to each of the fact, and GRU (softly) ignores the facts having low scores. Attention function is a simple 2 layer neural network depending on the question vector, current fact, and current state of the memory. After each full pass on all facts the inner GRU outputs an episode which is fed into the outer GRU which on its turn updates the memory. Then because of the updated memory the attention may give different scores to the facts. So new episodes can be created. The number of steps of the outer GRU, that is the number of the episodes, can be determined dynamically, but we fix it to simplify the implementation. It is configured by --memory_hops setting.

All facts, episodes and memories are in the same n-dimensional space, which is controlled by the command line argument --dim. Inner and outer GRUs share their weights.

###

The final state of the memory is being fed into the answer module, which produces the answer. We have implemented two kinds of answer modules. First is a simple linear layer on top of the memory vector with softmax activation (--answer_module feedforward). This is useful if each answer is just one word (like in the bAbI dataset). The second kind of answer module is another GRU that can produce multiple words (--answer_module recurrent). Its implementation is half baked now, as we didn’t need it for bAbI.

The whole system is end-to-end differentiable and is trained using stochastic gradient descent. We use adadelta by default. More formulas and details of architecture can be found in the original paper. But the paper does not contain many implementation details, so we may have diverged from the original implementation.

Initial experiments

We have tested this system on bAbI tasks with a few randomly selected hyperparameters. We initialized the word vectors by using 50-dimensional GloVe vectors trained on Wikipedia. Answer module is a simple feedforward classifier over the vocabulary (which is very limited in bAbI tasks). Here are the results.


First two columns are for strongly supervised systems MemNN and DMN. Third column is the best results of MemN2N. The last 3 columns are our results with different dimensions of the memory.

First basic observation is that weakly supervised systems are generally worse than the strongly supervised ones. When compared to MemN2N, our system performs much worse on the tasks 2, 3 and 16. As a result we pass only 7 tasks out of 20. On the other hand, our results on tasks 5, 6, 8, 9, 10 and 18 are better than MemN2N. Surprisingly what we got on the 17th task is better than in strongly supervised systems!

Our system converges very fast on some of the tasks (like the first one), overfits on many other tasks and does not converge on tasks 2, 3 and 19.

19th task (path finding) is not solved by any of these systems. Wojciech Zaremba from OpenAI informed us during his lecture about one system which managed to solve it using 10K training samples. This remains a very interesting challenge for us. We need to carefully experiment with various parameters to reach some meaningful conclusions.

We have tried to test on the full shuffled list of 20000 bAbI tasks. We couldn’t reach 60% average accuracy after 50 hours of training on an Amazon instance, while MemN2N authors report 87.6% accuracy.

This implementation of DMN is available on Github. We really need lots of feedback on this code.

Next steps

We need a good way to visualize the attention in the episodic memory. This will help us understand what is exactly going on inside the system. Many papers now include such visualizations on some examples.
Our model overfits on many of the tasks even with 25-dimensional memory. We briefly experimented with L2 regularization but it didn’t help much (--l2).
Currently we are working on a slightly modified architecture which will be optimized for multiple choice questions. Basically it will include one more input module which will read the answer choices and will provide another input for the attention mechanism.
Then we will be able to evaluate our code on more complex QA datasets like MCTest.
Training with batches is not properly implemented yet. There are several technical challenges related to the variable length of input sequences. It becomes much harder to keep in control because of this kind of bugs in Theano.

We would like to thank the organizers of DeepHack.Q&A for the really amazing atmosphere here in PhysTech.

Generating Constitution with recurrent neural networks

2015-11-12T00:00:00+00:00

By Narek Hovsepyan and Hrant Khachatrian

Few months ago Andrej Karpathy wrote a great blog post about recurrent neural networks. He explained how these networks work and implemented a character-level RNN language model which learns to generate Paul Graham essays, Shakespeare works, Wikipedia articles, LaTeX articles and even C++ code. He also released the code of the network on Github. Lots of people did experiments, like generating recipes, Bible or Irish folk music. We decided to test it on some legal texts in Armenian.

Character-level RNN language model
Data
Network parameters
Analysis
Generated samples
NaNoGenMo

Character-level RNN language model

Andrej did a great job explaining how the recurrent networks learn and even visualized how they work on text input in his blog. The program, called char-rnn, treats the input as a sequence of characters and has no prior knowledge about them. For example, it doesn’t know that the text is in English, that there are words and there are sentences, that the space character has a special meaning and so on. After some training it manages to figure out that some character combinations appear more often than the others, learns to predict English words, uses proper punctuation, and even understands that open parentheses must be closed. When trained on Wikipedia articles it can generate text in MediaWiki format without syntax errors, although the text has little or no meaning.

Data

We decided to test Karpathy’s RNN on Armenian text. Armenian language has a unique alphabet, and the characters are encoded in the Unicode space by the codes U+0530 - U+058F. In UTF-8 these symbols use two bytes where the first byte is always 0xD4, 0xD5 or 0xD6. So the neural net has to look at almost 2 times larger distances (when compared to English) in order to be able to learn the words. Also, the Armenian alphabet contains 39 letters, 50% more than Latin.

Recently the main political topic in Armenia is the Constitutional reform. This helped us to choose the corpus for training. We took all three versions of the Constitution of Armenia (the first version voted in 1995, the updated version of 2005, and the new proposal which will be voted later this year) and concatenated them in a single text file. The size of the corpus is just 440 KB, which is roughly 224 000 Unicode symbols (all non-Armenian symbols, including spaces and numbers use 1 byte). Andrej suggests to use at least 1MB data, so our corpus is very small. On the other hand the text is quite specific, the vocabulary is very small and the structure of the text is fairly simple.

All articles are of the following form:

Հոդված 1. Հայաստանի Հանրապետությունը ինքնիշխան, ժողովրդավարական, սոցիալական, իրավական պետություն է:

The first word, Հոդված, means “Article”. Sentences end with the symbol :.

Network parameters

char-rnn works with basic recurrent neural networks, LSTM networks and GRU-RNNs. In our experiments we only used LSTM network with 2 layers. Actually we don’t really understand how LSTM networks work in details, but we hope to improve our understanding by watching the videos of Richard Socher’s excellent NLP course.

We trained the network for 50 epochs with the default learning rate parameters (base rate is 2e-3, which decays by a factor of 0.97 after each 10 epochs). We wanted to understand how the size of LSTM internal state (rnn_size), dropout and batch size affect the performance. We used grid search over the following values:

rnn_size: 128, 256, 512
batch_size: 25, 50, 100
dropout: 0, 0.2, 0.4 and at the end we tried 0.6

After installing Lua, Torch and CUDA (as described on char-rnn page) we have moved our mini-corpus to /data/input.txt and ran the run.sh file, which contains commands like this:

th train.lua -data_dir data/ -batch_size 50 -dropout 0.4 -rnn_size 512 -gpuid 0 -savefile bs50s512d0.4 | tee log_bs50s512d0.4

File names encode the hyperparameters, and the output of char-rnn is logged using tee command.

Analysis

We have adapted this script written by Hrayr to plot the behavior of loss functions during the 50 epochs. The script, which runs on char-rnn output is available on Github. These graphs show, for example, that we practically do not gain anything after 25 epochs.


Training (blue to aqua) and validation (red to green) loss over 50 epochs. RNN size was set to 256 and the batch size was 50. In particular, this graph shows that when no dropout is used, validation loss actually increases after 20 epochs. Plotted using this script.

Experiments showed that, unsuprisingly, training loss is better (after 50 epochs) when RNN size is increased and when dropout ratio is decreased. Under all configurations we got the lowest train losses using batch size 50 (compared to 25 and 100) and we don’t have explanation for this.

For validation loss, we have the following tables.

	Dropout	0	0.2	0.4	0.6
Batch size	RNN Size
25	128	0.5060	0.4307	0.4813	0.5373
	`-` 256	`-` 0.5322	`-` 0.4185	`-` 0.4021	`-` 0.4261
	`- -` 512	`- -` 0.5596	`- -` 0.4495	`- -` 0.4380	`- -` 0.4126
50	128	0.4883	0.4452	0.4813	0.5373
	`-` 256	`-` 0.5249	`-` 0.3887	`-` 0.3996	`-` 0.4280
	`- -` 512	`- -` 0.5340	`- -` 0.4420	`- -` 0.3997	`- -` 0.3800
100	128	0.5341	0.5144	0.5454	0.6094
	`-` 256	`-` 0.5660	`-` 0.4464	`-` 0.4500	`-` 0.4723
	`- -` 512	`- -` 0.6032	`- -` 0.4804	`- -` 0.4599	`- -` 0.4399

When RNN size is only 128, we notice that the best performance is achieved when dropout is 20%. Larger dropout values do not allow the network to learn enough. When RNN size is increased to 256, the optimal dropout value is somewhere between 20% and 40%. For RNN size 512, the best performance we observed used 60% dropout. We didn’t try to go any further.

As for the batch sizes, we see the best performance on 25 if RNN size is only 128. For larger networks, batch size 50 performs better. Overall we obtained the lowest validation score, 0.38, using 60% dropout, 50 batch size and 512 RNN size.

Generated samples

When the trained models are ready, we can generate text samples by using sample.lua script included in the repository. It accepts one important parameter called temperature which determines how much the network can “fantasize”. Higher temperature gives more diversity but at a cost of making more mistakes, as Andrej explains in his blog post. The command looks like this

th sample.lua cv/lm_bs50s128d0_epoch50.00_0.4883.t7 -length 3000 -temperature 0.5 -gpuid 0 -primetext "Հոդված"

primetext parameter allows to predefine the first characters of the generated sequence. Also it makes the output fully reproducible. Here is a snippet from bs50s128d0 model, which is available on Github (validation loss is 0.4883, sampled with 0.5 temperature).

Հոդված 111. Սահմանադրական դատարանի կազմավորումը, եթե այլ չեն հասատատիրի առնչամի կարելի սահմանափակվել միայն օրենքով, եթե դա անհրաժեշտ է հանցագործությունների իրավունք: Յուրաքանչյուր ոք ունի Հայաստանի Հանրապետության քաղաքացիությունը որոշում է կայացնում դատավորին կազմավորման կարգը

There are 2 nonexistent words here (marked by italic), others are fine. The sentences have no meaning, some parts are quite unnatural, making them difficult to read.

The network easily (even with 128 RNN size) learns to separate the articles by new line and starts them by the word Հոդված followed by some number. But even the best one doesn’t manage to use increasing numbers for consecutive articles. Actually, very often the article number starts with 1, because more than one third of the articles in the corpus have numbers starting with 1. It also understands some basic punctuation. It correctly puts commas before the word եթե, which is the Armenian word for “if”.

With 256 RNN size and 40% dropout the result is much more readable.

Հոդված 14. Պատգամավոր կարող է դնել իր վստահության հարցը: Կառավարության անդամների լիազորությունները համապատասխանական կազմակերպությունների կամ միավորման և գործունեության կարգը սահմանվում է օրենքով:
Հոդված 107. Պատգամավորի լիազորությունները դադարեցնում է Սահմանադրությամբ և օրենքներով: Այդ իրավունքը կարող է սահմանափակվել միայն օրենքով:
Հոդված 126. Հանրապետության նախագահի հրամանագրերը և կարգադրությունները կամ այլ պետությունը միասնական կառավարման մարմինների կողմից հանցագործության կատարման պահին գործող դատարանների նախագահների թեկնածությունների և առաջարկությամբ սահմանադրական դատարանի նախագահ:
Հայաստանի Հանրապետության իրավունքը

Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների պաշտպանության նպատակով:

Ազգային ժողովի նախագահի վերահսկողության կամ Սահմանադրության 190-րդ հոդվածի 1-ին կետով նախատեսված դեպքերում և կարգով ընդունված որոշումները սահմանվում են օրենքով:

Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների սահմանափակումների հետ չապահողական կամ այլ դեպքերում վարչապետի նախագահների նախնական հանձնաժողովներն ստեղծվում են Սահմանադրությամբ և օրենքներով:

Յուրաքանչյուր ոք ունի իր ազգային որոշումները սահմանվում են օրենքով:

Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների պաշտպանության նպատակով:

Only 2 of the 140 words are nonexistent, but both are syntactically correct. For example there is no such word չապահողական in Armenian, but չ and ապա are prefixes, հող means “soil” and ական is a suffix. Sentences still do not have valid structure.

The network learned that sometimes ordered lists appear in the articles, but couldn’t learn to properly enumerate the points. Sometimes it counts up to 2 only :) It would be interesting to see on what kind of corpora it will be able to count a bit more.

Here is one more snippet using the best performing model bs50s512d0.6 (temperature is again 0.5).

Հոդված 21. Յուրաքանչյուր ոք ունի ազատ տեղաշարժվելու և բնակություն է կառավարության անդամներին: Հանրապետության Նախագահը պաշտոնն ստանձնում է Հանրապետության Նախագահը չի կարող զբաղվել ձեռնարկատիրական գործունեությամբ:
Հոդված 50. Հանրապետության Նախագահը պաշտոնն ստանձնում է Հանրապետության Նախագահի պաշտոնը թափուր մնալու դեպքում Հանրապետության Նախագահի արտահերթ ընտրությունը կազմված է վարչապետի առաջարկությամբ վերահսկողությունը

Յուրաքանչյուր ոք ունի ազատ տեղաշարժվելու և բնակավայր ընտրելու իրավունք:

There are virtually no invalid words anymore (less than 0.5%, and most are one character typos). Sentences are better formed. Sometimes a sentence is composed of two exact copies of different sentences that actually occur in the corpus. For example the combination Հանրապետության Նախագահը պաշտոնն ստանձնում է appears 7 times in the corpus, and Հանրապետության Նախագահը չի կարող զբաղվել ձեռնարկատիրական գործունեությամբ appears once. So the generated samples are often boring. Although sometimes the combination of such two parts does have a meaning. The following article is a very good example, and doesn’t appear in the corpus.

Հոդված 151. Հանրապետության Նախագահի հրամանագրերը և կարգադրությունները կատարում է Ազգային ժողովի նախագահը:

When the temperature is increased to 0.75, the samples become more interesting.

Հոդված 52. Հանրապետության Նախագահի լիազորությունները սահմանվում են Սահմանադրությամբ և սահմանադրական դատարանի դատավորների մեկ մտնում առաջին ատյանի դատարանները:
Հոդված 107. Ազգային ժողովի լիազորությունների ժամկետը կեղերով բացասական տեղեկատվության ազատության ենթարկելու հարց հարուցելու կամ այլ գործադիր իշխանության, տեղական ինքնակառավարման մարմինների անկախության մասին.
7) եզրակացություն է տալիս իր լիազորությունների երաշխավորվում է միջազգային իրավունքի սկզբունքները և նախարարներից, ներկայացնում է Ազգային ժողովին եզրակացություններ ներկայացնելու համար:

Typos are a bit more common. An “ordered list” is generated here which starts with 7 and has only one entry. Article numbers are not tied to 1s anymore. Higher temperatures produce more nonexistent words.

NaNoGenMo

Since 1999 every November is declared a National Novel Writing Month, when people are encouraged to write a novel in one month. Since 2013, similar event is organized for algorithms. It’s called National Novel Generating Month. The rules are very simple, each participant must share one generated novel (at least 50 000 words) and release the source code. The Verge wrote about last year’s results.

Armen Khachikyan told us about this, and we thought that we can take part in it with a long enough generated Constitution. Here is our entry. It was generated by the following command:

th sample.lua cv/lm_bs50s512d0.6_epoch50.00_0.3800.t7 -length 900000 -temperature 0.5 -gpuid 0 -primetext "Գ Լ ՈՒ Խ  1" > sample_bs50s512d0.6t0.5.txt

The model was generated by the following command:

th train.lua -data_dir data/ -batch_size 50 -dropout 0.6 -rnn_size 512 -gpuid 0 -savefile bs50s512d0.6 | tee log_bs50s512d0.6

All related files are in our Github repository.

Spoken language identification with deep convolutional networks

2015-10-11T00:00:00+00:00

By Hrayr Harutyunyan

Recently TopCoder announced a contest to identify the spoken language in audio recordings. I decided to test how well deep convolutional networks will perform on this kind of data. In short I managed to get around 95% accuracy and finished at the 10th place. This post reveals all the details.

Dataset and scoring
Preprocessing
Network architecture
Data augmentation
Ensembling
What we learned from this contest
Unexplored options

Dataset and scoring

The recordings were in one of the 176 languages. Training set consisted of 66176 mp3 files, 376 per language, from which I have separated 12320 recordings for validation (Python script is available on GitHub). Test set consisted of 12320 mp3 files. All recordings had the same length (~10 sec) and seemed to be noise-free (at least all the samples that I have checked).

Score was calculated the following way: for every mp3 top 3 guesses were uploaded in a CSV file. 1000 points were given if the first guess is correct, 400 points if the second guess is correct and 160 points if the third guess is correct. During the contest the score was calculated only on 3520 recordings from the test set. After the contest the final score was calculated on the remaining 8800 recordings.

Preprocessing

I entered the contest just 14 days before the deadline, so didn’t have much time to investigate audio specific techniques. But we had a deep convolutional network developed few months ago, and it seemed to be a good idea to test a pure CNN on this problem. Some Google search revealed that the idea is not new. The earliest attempt I could find was a paper by G. Montavon presented in NIPS 2009 conference. The author used a network with 3 convolutional layers trained on spectrograms of audio recordings, and the output of convolutional/subsampling layers was given to a time-delay neural network.

I found a Python script which creates a spectrogram of a wav file. I used mpg123 library to convert mp3 files to wav format.

The preprocessing script is available on GitHub.

Network architecture

I took the network architecture designed for the Kaggle’s diabetic retinopathy detection contest. It has 6 convolutional layers and 2 fully connected layers with 50% dropout. Activation function is always ReLU. Learning rates are set to be higher for the first convolutional layers and lower for the top convolutional layers. The last fully connected layer has 176 neurons and is trained using a softmax loss.

It is important to note that this network does not take into account the sequential characteristics of the audio data. Although recurrent networks perform well on speech recognition tasks (one notable example is this paper by A. Graves, A. Mohamed and G. Hinton, cited by 272 papers according to the Google Scholar), I didn’t have time to learn how they work.

I trained the CNN on Caffe with 32 images in a batch, its description in Caffe prototxt format is available here.

Nr	Type	Batches	Channels	Width	Height	Kernel size / stride
0	Input	32	1	858	256
1	Conv	32	32	852	250	7x7 / 1
2	ReLU	32	32	852	250
3	MaxPool	32	32	426	125	3x3 / 2
4	Conv	32	64	422	121	5x5 / 1
5	ReLU	32	64	422	121
6	MaxPool	32	64	211	60	3x3 / 2
7	Conv	32	64	209	58	3x3 / 1
8	ReLU	32	64	209	58
9	MaxPool	32	64	104	29	3x3 / 2
10	Conv	32	128	102	27	3x3 / 1
11	ReLU	32	128	102	27
12	MaxPool	32	128	51	13	3x3 / 2
13	Conv	32	128	49	11	3x3 / 1
14	ReLU	32	128	49	11
15	MaxPool	32	128	24	5	3x3 / 2
16	Conv	32	256	22	3	3x3 / 1
17	ReLU	32	256	22	3
18	MaxPool	32	256	11	1	3x3 / 2
19	Fully connected	20	1024
20	ReLU	20	1024
21	Dropout	20	1024
22	Fully connected	20	1024
23	ReLU	20	1024
24	Dropout	20	1024
25	Fully connected	20	176
26	Softmax Loss	1	176

Hrant suggested to try the ADADELTA solver. It is a method which dynamically calculates learning rate for every network parameter, and the training process is said to be independent of the initial choice of learning rate. Recently it was implemented in Caffe.

In practice, the base learning rate set in the Caffe solver did matter. At first I tried to use 1.0 learning rate, and the network didn’t learn at all. Setting the base learning rate to 0.01 helped a lot and I trained the network for 90 000 iterations (more than 50 epochs). Then I switched to 0.001 base learning rate for another 60 000 iterations. The solver is available here. Not sure why the base learning rate mattered so much at the early stages of the training. One possible reason could be the large learning rate coefficients on the lower convolutional layers. Both tricks (dynamically updating the learning rates in ADADELTA and large learning rate coefficients) aim to fight the gradient vanishing problem, and maybe their combination is not a very good idea. This should be carefully analysed.


Training (blue) and validation (red) loss over the 150 000 iterations on the non-augmented dataset. The sudden drop of training loss corresponds to the point when the base learning rate was changed from `0.01` to `0.001`. Plotted using this script.

The signs of overfitting were getting more and more visible and I stopped at 150 000 iterations. The softmax loss got to 0.43 and it corresponded to 3 180 000 score (out of 3 520 000 possible). Some ensembling with other models of the same network allowed to get a bit higher score (3 220 000), but it was obvious that data augmentation is needed to overcome the overfitting problem.

Data augmentation

The most important weakness of our team in the previous contest was that we didn’t augment the dataset well enough. So I was looking for ways to augment the set of spectrograms. One obvious idea was to crop random, say, 9 second intervals of the recordings. Hrant suggested another idea: to warp the frequency axis of the spectrogram. This process is known as vocal tract length perturbation, and is generally used for speaker normalization at least since 1998. In 2013 N. Jaitly and G. Hinton used this technique to augment the audio dataset. I used this formula to linearly scale the frequency bins during spectrogram generation:


Frequency warping formula from the paper by L. Lee and R. Rose. α is the scaling factor. Following Jaitly and Hinton I chose it uniformly between 0.9 and 1.1

I also randomly cropped the spectrograms so they had 768x256 size. Here are the results:

Spectrogram of one of the recordings

Cropped spectrogram of the same recording with warped frequency axis

For each mp3 I have created 20 random spectrograms, but trained the network on 10 of them. It took more than 2 days to create the augmented dataset and convert it to LevelDB format (the format Caffe suggests). But training the network proved to be even harder. For 3 days I couldn’t significantly decrease the train loss. After removing the dropout layers the loss started to decrease but it would take weeks to reach reasonable levels. Finally, Hrant suggested to try to reuse the weights of the model trained on the non-augmented dataset. The problem was that due to the cropping, the image sizes in the two datasets were different. But it turned out that convolutional and pooling layers in Caffe work with images of variable sizes, only the fully connected layers couldn’t reuse the weights from the first model. So I just renamed the FC layers in the prototxt file and initialized the network (convolution filters) by the weights of the first model:

./build/tools/caffe train --solver=solver.prototxt --weights=models/main_32r-2-64r-2-64r-2-128r-2-128r-2-256r-2-1024rd0.5-1024rd0.5_DLR_72K-adadelta0.01_iter_153000.caffemodel

This helped a lot. I used standard stochastic gradient descent (inverse decay learning rate policy) with base learning rate 0.001 for 36 000 iterations (less than 2 epochs), then increased the base learning rate to 0.01 for another 48 000 iterations (due to the inverse decay policy the rate decreased seemingly too much). These trainings were done without any regularization techniques, weight decay or dropout layers, and there were clear signs of overfitting. I tried to add 50% dropout layers on fully connected layers, but the training was extremely slow. To improve the speed I used 30% dropout, and trained the network for 120 000 more iterations using this solver. Softmax loss on the validation set reached 0.21 which corresponded to 3 390 000 score. The score was calculated by averaging softmax outputs over 20 spectrograms of each recording.

Ensembling

30 hours before the deadline I had several models from the same network. And even simple ensembling (just the sum of softmax activations of different models) performed better than any individual model. Hrant suggested to use XGBoost, which is a fast implementation of gradient boosting algorithm and is very popular among Kagglers. XGBoost has a good documentation and all parameters are well explained.

To perform the ensembling I was creating a CSV file containing softmax activations (or the average of softmax activations among 20 augmented versions of the same recording) using this script. Then I was running XGBoost on these CSV files. The submission file (which was requested by TopCoder) was generated using this script.

I also tried to train a simple neural network with one hidden layer on the same CSV files. The results were significantly better than with XGBoost.

The best result was obtained by ensembling the following two models: snapshots of the last network (the one with 30% dropout) after 90 000 iterations and 105 000 iterations. Final score was 3 401 840 and it was the 10th result of the contest.

What we learned from this contest

This was a quite interesting contest, although too short when compared with Kaggle’s contests.

Plain, AlexNet-like convolutional networks work quite well for fixed length audio recordings
Vocal tract length perturbation works well as an augmentation technique
Caffe supports sharing weights between convolutional networks having different input sizes
Single layer neural network sometimes performs better than XGBoost for ensembling (although I had just one day to test the both)

Unexplored options

It is interesting to see if a network with 50% dropout layers will improve the accuracy
Maybe larger convolutional networks, like OxfordNet will perform better. They require much more memory, and it was risky to play with them under a tough deadline
Hybrid methods combining CNN and Hidden Markov Models should work better
We believe it is possible to squeeze more from these models with better ensembling methods
Other contestants report better results based on careful mixing of the results of more traditional techniques, including n-gram and Gaussian Mixture Models. We believe the combination of these techniques with the deep models will provide very good results on this dataset

One important issue is that the organizers of this contest do not allow to use the dataset outside the contest. We hope this decision will be changed eventually.

Diabetic retinopathy detection contest. What we did wrong

2015-08-17T00:00:00+00:00

After watching the awesome video course by Hugo Larochelle on neural nets (more on this in the previous post) we decided to test our knowledge on some computer vision contest. We looked at Kaggle and the only active competition related to computer vision (except for the digit recognizer contest, for which lots of perfect out-of-the-box solutions exist) was the Diabetic retinopathy detection contest. This was probably quite hard to become our very first project, but nevertheless we decided to try. The team included Karen, Tigran, Hrayr, Narek (1st to 3rd year bachelor students) and me (PhD student). Long story short, we finished at the 82nd place out of 661 participants, and in this post I will describe in details what we did and what mistakes we made. All required files are on these 2 github repositories. We hope this will be interesting for those who just start to play with neural networks. Also we hope to get feedback from experts and other participants.

The contest
Software and hardware
Image preprocessing
Data augmentation
Choosing training / validation sets
Convolutional network architecture
Loss function
Preparing submissions
Attempts to ensemble
More on this contest
Acknowledgements

The contest

Diabetic retinopathy is a disease when the retina of the eye is damaged due to diabetes. It is one of the leading causes of blindness in the world. The contest’s aim was to see if computer programs can diagnose the disease automatically from the image of the retina. It seems the winners slightly surpassed the performance of general ophthalmologists.

Each eye of the patient can be in one of the 5 levels: from 0 to 4, where 0 corresponds to the healthy state and 4 is the most severe state. Different eyes of the same person can be at different levels (although some contestants managed to leverage the fact that two eyes are not completely independent). Contestants were given 35126 JPEG images of retinas for training (32.5GB), 53576 images for testing (49.6GB) and a CSV file where level of the disease is written for the train images. The goal was to create another CSV file where disease levels are written for each of the test images. Contestants could submit maximum 5 CSV files per day for evaluation.


Healthy eye: level 0	Severe state: level 4

The score was evaluated using a metric called quadratic weighted kappa. It is described as being an agreement between two raters: the agreement between the scores assigned by human rater (which is unknown to contestants) and the predicted scores. If the agreement is random, the score is close 0 (sometimes it can even be negative). In case of a perfect agreement the score is 1. It is quadratic in a sense that, for example, if you predict level 4 for a healthy eye, it is 16 times worse than if you predict level 1. Winners achieved a score more than 0.84. Our best result was around 0.50.

Software and hardware

It was obvious that we were going to use a convolutional neural network for predicting. Not only because of its awesome performance on many computer vision problems, including another Kaggle competition on plankton classification, but also because it was the only technique we knew for image classification. We were aware of several libraries that implement convolutional networks, namely Python-based Theano, Caffe written in C++, cxxnet (developed by the 2nd place winners of the plankton contest) and Torch. We chose Caffe because it seemed to be the simplest one for beginners: it allows to define the neural network by a simple text file (like this) and train a network without writing a single line of code.

We didn’t have a computer with CUDA-enabled GPU in the university, but our friends at Cyclop Studio donated us an Intel Core i5 computer with 4GB RAM and NVidia GeForce GTX 550 TI card. 550 TI has a 1GB of memory which forced us to use very small batch sizes for the neural network. Later we switched to GeForce GTX 980 with 4GB memory, which was completely fine for us.

Karen and Tigran managed to install Caffe on Ubuntu and make it work with CUDA, which was enough to start the training. Later Narek and Hrayr found out how to play with Caffe models using Python, so we can run our models on the test set. Karen has connected Cloud9 to the server, and we could work remotely through a web interface.

Image preprocessing

Images from the training and test datasets have very different resolutions, aspect ratios, colors, are cropped in various ways, some are of very low quality, are out of focus etc. Neural networks require a fixed input size, so we had to resize / crop all of them to some fixed dimensions. Karen and Tigran looked at many sample images and decided that the optimal resolution which preserves the details required for classification is 512x512. We thought that in 256x256 we might lose the small details that differ healthy eye images from level 1 images. In fact, by the end of the competition we saw that our networks cannot differentiate between level 0 and 1 images even with 512x512, so probably we could safely work on 256x256 from the very beginning (which would be much faster to train). All preprocessing was done using imagemagick.

We tried three methods to preprocess the images. First, as suggested by Karen and Tigran, we resized the images and then applied the so called charcoal effect which is basically an edge detector. This highlighted the signs of blood on the retina. One of the challenging problems throughout the contest was to define a naming convention for everything: databases of preprocessed images, convnet descriptions, models, CSV files etc. We used the prefix edge for anything which was based on the images preprocessed this way. The best kappa score achieved on this dataset was 0.42.


Preprocessed image (edge) level 0	Preprocessed image (edge) level 3

But later we noticed that this method makes the dirt on lens or other optical issues appear similar to a blood sign, and it really confused our neural networks. The following two images are of healthy eyes (level 0), but both were recognized by almost all our models as level 4.



Original images of healthy eyes	Preprocessed versions `edge` recognized as level 4

So we decided to avoid using filters on the images, and leave all the work to the convolutional network: just resize and convert to one channel image (to save space and memory). We thought that the color information is not very important to detect the disease, although this could be one of our mistakes. Following the discussion at Kaggle forums we decided to use the green channel only. We got our best results (kappa = 0.5) on this dataset. We used prefix g for these images.

Finally we tried to apply the equalize filter on top of the green channel, which makes the histogram of the image uniform. The best kappa score we managed to get on the dataset preprocessed this way was only 0.4. We used prefix ge for these images.


Just the green channel: `g`	Histogram equalization on top of the green channel: `ge`

Data augmentation

One of the problems of neural networks is that they are extremely powerful. They learn so well that they usually learn something that degrades their performance on other (previously unseen) data. One (made-up) example: the images in the training set are taken by different cameras and have different characteristics. If for some reason, say, the percentage of images of level 2 in dark images is higher than in general, the network may start to predict level 2 more often for dark images. We are not aware of any way to detect such “misleading” correlations by looking at neuron activations of convolution filters. But, fortunately, it is possible to train the network on one subset of data and test it on another, and if the performance on these subsets are different, then the network has learned something very specific to the training data, it has overfit the training data, and we should try to avoid it.

One of the solutions to this problem is to enlarge the dataset in order to minimize the chances of such correlations to happen. This is called data augmentation. The organizers of this contest explicitly forbid to use data outside the dataset they provided. But it’s obvious that if you take an image, zoom it, rotate it, flip it, change the brightness etc. the level of the disease will not be changed. So it is possible to apply these transformations to the images and obtain much larger and “more random” training dataset. One approach is to take all versions of all images into the training set, another approach is to randomly choose one transformation for each of the images. The mixture of these approaches helps to solve another problem which will be discussed in the next section.

We applied very limited transformations only. For every image we created 4 samples: original, rotated by 180 degrees, and the vertical flipped versions of these two. This helped to avoid the problem, that some of the images in the dataset were flipped.

We believe that we spent way too little time on data augmentation. All other contestants we have seen use much more sophisticated transformations. Probably this was our most important mistake.

Choosing training / validation sets

There are two reasons to train the networks only on a subset of the train dataset provided by Kaggle. First reason is to be able to compare different models. We need to choose the model which generalizes best to the unseen data, not the one which performs best on the data it has been trained on. So we train various models on some subset of the dataset (again called a training set), then compare their performance on the other subset (called a validation set) and pick the one which works better on the latter.

The second reason is to detect overfitting while training. During the training we sometimes (in Caffe this is configured by the test_interval parameter) run the network on the validation set and calculate the loss. When we see that the loss on the validation set does not decrease anymore, we know that overfitting happens. This is best illustrated in this image from Wikipedia.

The distribution of images of different levels in the training set provided by Kaggle was very uneven. More than half of the images were of healthy eyes:

Level	Number of images	Percentage
0	25810	73.48%
1	2443	6.95%
2	5292	15.07%
3	873	2.49%
4	708	2.02%

Neural networks seem to be very sensitive to this kind of distributions. Our very first neural network (using softmax classification) was randomly giving labels 0 and 2 to almost all images (which brought a kappa score 0.138). So we had to make the classes more or less equal. Here we did couple of trivial mistakes.

At first we augmented the dataset by creating lots of rotations (multiples of 30 degrees, 12 versions of each image) and created a dataset of around 100K images with equally distributed classes. So we took 36 times more versions of images of level 4 than of images of level 0. As we had only 12 versions of each image, we took every image 3 times. Finally, we separated the training and validation sets after these augmentations. After training 88000 iterations (with batch size 2, we were still on GeForce 550 Ti) we had 0.55 kappa score on our validation set. But on Kaggle’s test set the score was only 0.23. So we had a terrible overfitting and didn’t detect it locally.

The most important point here, as I understand it, is that the separation of training and validation sets should have been done before the data augmentation. In our case we had different rotations of the same image in both sets, which didn’t allow us to detect overfitting.

So later we took 7472 images (21%) as a validation set, and performed the data augmentation on the remaining 27654 images. Validation set had the same ratio of classes as the Kaggle’s test set. This is important for choosing the best model: validation set should be similar to the test set as much as possible.

Also we decided to get rid off the rotations by multiples of 30 degrees, as the images were being distorted (we applied rotations after resizing the images). Although, after the competition we saw that other contestants have used such rotations. So maybe this was another mistake.

Then, it turned out that the idea of taking copies of the same image is terrible, because the network overfits the smaller classes (like level 3 and level 4) and it is hard to notice that just by looking at validation loss values, because the corresponding classes are very small in the validation set. We identified this problem by carefully visualizing neuron activations on training and validation sets (just 2 weeks before the competition deadline):


Every dot corresponds to one image. Blue dots are from the training set, orange dots are from the validation set. `x` axis is the activation of a top layer neuron. `y` axis is the original label (0 to 4). Basically there is no overfitting for the images of level 0, 1 or 2: the activations are very similar. But the overfitting of the images of level 3 and 4 is obvious. Training samples are concentrated around fixed values, while validation samples are spread widely

Every dot corresponds to one image. Blue dots are from the training set, orange dots are from the validation set. x axis is the activation of a top layer neuron. y axis is the original label (0 to 4). Basically there is no overfitting for the images of level 0, 1 or 2: the activations are very similar. But the overfitting of the images of level 3 and 4 is obvious. Training samples are concentrated around fixed values, while validation samples are spread widely

Finally we decided to train a network to differentiate between two classes only: images of level 0 and 1 versus images of level 2, 3 and 4. The ratio of the images in these classes was 4:1. We augmented the training set only by vertical flipping and rotating by 180 degrees. We took all 4 versions of each image of the second class and we randomly took one of the 4 versions of each image of the first class. This way we ended up with a training set of two equal classes. This gave us our best kappa score 0.50.

Later we wanted to train a classifier which would differentiate level 0 images from level 1 images only, but the networks we tried didn’t work at all. Another classifier we used to differentiate between level 2 and level 3 + level 4 images actually learned something, but we couldn’t increase the overall kappa score based on that.

After preparing the list of files for the training and validation sets, we used a tool bundled with Caffe to create a LevelDB database from the directory of images. Caffe prefers to read from LevelDB rather than from directory:

./build/tools/convert_imageset -backend=leveldb -gray=true -shuffle=true data/train.g/ train.g.01v234.txt leveldb/train.g.01v234

gray is set to true because we use single-channel images and shuffle is required to properly shuffle the images before importing into the database.

Convolutional network architecture

Our best performing neural network architecture and corresponding solver are on Github. Batch size was always fixed to 20 (on GTX 980 card). We used a simple stochastic gradient descent with 0.9 momentum and didn’t touch learning rate policy at all (it didn’t decrease the rate significantly). We started at 0.001 learning rate, and sometimes manually decreased it (but not in this particular case which brought the best kappa score). Also in this best performing case we started with 0 weight decay, and after the first signs of overfitting (after 48K iterations, which is almost 20 epochs) increased it to 0.0015.

Convolution was done similar to the “traditional” LeNet architecture (developed by Yann LeCun, who invented the convolutional networks): one max pooling layer after every convolution layer, with fully connected layers at the end.

Almost all other contestants used the other famous approach, with multiple consecutive convolutional layers with small kernels before a pooling layer. This was developed by Karen Simonyan and Andrew Zisserman at Visual Geometry Group, University of Oxford (that’s why it is called VGGNet or OxfordNet) for the ImageNet 2014 contest where they took 1st and 2nd places for localization and classification tasks, respectively. Their approach was popularized by Andrej Karpathy and was successfully used in the plankton classification contest. I have tried this approach once, but it required significantly more memory and time, so I quickly abandoned it.

Here is the structure of our network:

Nr	Type	Batches	Channels	Width	Height	Kernel size / stride
0	Input	20	1	512	512
1	Conv	20	40	506	506	7x7 / 1
2	ReLU	20	40	506	506
3	MaxPool	20	40	253	253	3x3 / 2
4	Conv	20	40	249	249	5x5 / 1
5	ReLU	20	40	249	249
6	MaxPool	20	40	124	124	3x3 / 2
7	Conv	20	40	120	120	5x5 / 1
8	ReLU	20	40	120	120
9	MaxPool	20	40	60	60	3x3 / 2
10	Conv	20	40	56	56	5x5 / 1
11	ReLU	20	40	56	56
12	MaxPool	20	40	14	14	4x4 / 4
13	Fully connected	20	256
14	ReLU	20	256
15	Dropout	20	256
16	Fully connected	20	256
17	ReLU	20	256
18	Dropout	20	256
19	Fully connected	20	1
20	Euclidean Loss	1	1

Some observations related to the network architecture:

ReLU activations on all convolutional and fully connected layers helped a lot, kappa score increased by almost 0.1. It’s interesting to note that Christian Szegedy, one of the GoogLeNet developers (winner of the classification contest at ImageNet 2014), expressed an opinion that the main reason for the deep learning revolution happening now is the ReLU function :)
2 fully connected layers (256 neurons each) at the end is better than one fully connected layer. Kappa was increased by almost 0.03
Number of filters in the convolutional layers are not very important. Difference between, say, 20 and 40 filters is very little
Dropout helps fight overfitting (we used 50% probability everywhere)
We didn’t notice any difference with Local response normalization layers

Below are the 40 filters of the first convolutional layer of our best model (visualization code is adapted from here). They don’t seem to be very meaningful:

I tried to use dropout on convolutional layers as well, but couldn’t make the network learn anything. The loss was quickly becoming nan. Probably the learning rate should have been very different…

Loss function

Submissions of this contest were evaluated by the metric called quadratic weighted kappa. We found an Excel code that implements it which helped us to get some intuition.

At the beginning we started to use softmax loss on top of the 5 neurons of the final fully connected layer. Later we decided to use something that will take into account the fact that the order of the labels matters (0 and 1 are closer than 0 and 4). We left only one neuron in the last layer and tried to use Euclidean loss. We even tried to “scale” the labels of the images in a way that will make it closer to being “quadratic”: we replaced the labels [0,1,2,3,4] with [0,2,3,4,6].

Ideally we would like to have a loss function that implements the kappa metric. But we didn’t risk to implement a new layer in Caffe. Jeffrey De Fauw has implemented some continuous approximation of kappa metric using Theano with a lot of success.

When we switched to 0,1 vs 2,3,4 classification, I thought 2-neuron softmax would be better than Euclidean loss because of the second neuron: it might bring some information that could help to obtain better score. But after some tests I saw that the sum of the activations of the two softmax neurons tends to 1, so the second neuron does not bring new information. The rest of the training was done using Euclidean loss (although I am not sure if that was the best option).

We logged the output of Caffe into a file, then plotted the graphs of training and validation losses using a Python script written by Hrayr:

./build/tools/caffe train -solver=solver.prototxt &> log_g_g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5-wd0-lr0.001.txt

python plot_loss.py log_g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5-wd0-lr0.001.txt

The script allows to print multiple logs on the same image and uses moving average to make the graph look smoother. It correctly aligns the graphs even if the log does not start from the first iteration (in case the training is resumed from a Caffe snapshot). For example, in the plot below train 1 and val 1 correspond to the model described in the previous section with weight decay=0; train 2 and val 2 correspond to the model which started from the 48000th iteration of the previous model but used weight decay=0.0015. The best kappa score was obtained on 81000th iteration of the second model. Then we observe overfitting.

Note that the validation loss is usually lower than the training loss. The reason is that the classes are equal in the training set and are far from being equal in the validation set. So the training and validation losses cannot be compared.

Preparing submissions

After training the models we used a Python script to make predictions for the images in validation set. It creates a CSV file with neuron activations. Then we imported this CSV into Wolfram Mathematica and played with it there.

I use Mathematica mainly because of its nice visualizations. Here is one of them: the x axis is the activation of the single neuron of the last layer, and the graphs present the percentages of the images of each particular label that have x activation. Ideally the graphs corresponding to different labels should be clearly separable by vertical lines. Unfortunately that’s not the case, which visually explains why the kappa score is so low.

In order to convert the neuron activations to predicted levels we need to determine 4 “threshold” numbers. These graphs show that it’s not obvious how to choose these 4 numbers in order to maximize the kappa score. So we take, say, 1000 random 4-tuples of numbers between minimum and maximum activations of the neuron, and calculate the kappa score for each of the tuples. Then we take the 4-tuple for which the kappa was maximal, and use these numbers as thresholds for the images in the test set.

Note that we calculate the kappa scores for the validation set, although there is a risk to overfit the validation set. Ideally we should choose those thresholds which attain maximum kappa score on the train set. But, in practice, the thresholds that maximize the kappa score on validation set perform better on the test set, mainly because the network has already overfit the training set!

Attempts to ensemble

Usually it is possible to improve the scores by merging several models. This is called ensembling. For example, the 3rd place winners of this contest have merged the results of 9 convolutional networks.

We developed couple of ways to merge the results from two networks, but they didn’t work well for us. They gave very small improvements (less than 0.01) only when both networks gave similar kappa scores. When one network was clearly stronger than the other one, the ensemble didn’t help at all. One of our ensemble methods was an extension of the “thresholding” method described in the previous section to 2 dimensions. We plot the images on a 2D plane in a way that each of the coordinates corresponds to a neuron activation of one model. Then we looked for random lines that split the plane in a way that maximizes the kappa score. We tried two methods of splitting the plane which are demonstrated below. Each blue dot corresponds to an image of label 0, orange dots correspond of images having label 4.

We didn’t try to merge more than 2 networks at once. Probably this was another mistake.

The only method of ensembling that worked for us was to take an average over 4 rotated / flipped versions of the images. We also tried to take minimum, maximum and harmonic mean of the neuron activations. Minimum and maximum brought 0.01 improvement to the kappa score, while harmonic and arithmetic means brought 0.02 improvement. The best result we achieved used the arithmetic mean. Note that this required to have 4 versions of test images (which took 2 days to rotate / flip) and to run the network on all versions (which took another day).

All these experiments can be replicated in Mathematica by using the script main.nb and the required CSV files that are available on Github.

Finally, note that Mathematica is the only non-free software used in the whole training process. We believe it is better to keep the ecosystem clean :) We will probably use IPython next time.

More on this contest

Many contestants have published their solutions. Here are the ones I could find. Please, let me know if I missed something. Most of the solution are heavily influenced by the winner method of the plankton classification contest.

1st place: Min-Pooling used OpenCV to preprocess the images, augmented the dataset by scaling, skewing and rotating (and notably not by changing colors), trained several networks on his own SparseConvNet library and used random forests to combine predictions from two eyes of the same person. Kappa = 0.84958
2nd place: o_O team used Theano, Lasagne, nolearn to train OxfordNet-like network on minimal preprocessed images. They have heavily augmented the dataset. They note the importance of using larger images to achieve high scores. Kappa = 0.84479
3rd place: Reformed Gamblers team combined results of 9 convolutional networks (OxfordNet-like and others) with leaky ReLU activations and non-trivial loss functions. They used Torch on multiple GPUs. Kappa = 0.83937
Update: 4th place: Julian and Daniel gave an interview to Kaggle. They did extensive preprocessing and data augmentation, used CXXNet, PyLearn and Keras to train multiple OxfordNet-like networks. They highlight the importance of good parameter initialization.
5th place: Jeffrey De Fauw used Theano to train OxfordNet-like network with leaky ReLU activations on significantly augmented dataset. He has also implemented a smooth approximation of kappa metric and used it as a loss layer. Well written blog post. Kappa = 0.82899
20th place: Ilya Kavalerov, again Theano, OxfordNet, good augmentation, non-obvious loss function. Interesting read. Kappa = 0.76523
46th place: Niko Gamulin used Caffe on GTX 980 GPU (just like us) but OxfordNet architecture. Kappa = 0.63129

After the contest we tried to use leaky ReLUs, something we just didn’t think of during the contest. The results are not promising. Here are the plots of the validation losses with negative slope values (ns) 0, 0.01, 0.33 and 0.5 respectively:

Finally, Hrayr suggested to use different learning rates for different convolutional layers (Caffe supports this by specifying multiplication constants per layer). He used larger coefficients (12) for the first layers than for the top layers. The full prototxt file is on Github. This network allowed to get up to 0.52 kappa score on the local validation set. We didn’t try to run it on test images, although in almost all cases our scores on private leaderboard were higher than the scores on local validation sets.

Acknowledgements

We would like to express gratitude to Hugo Larochelle for his excellent video course on neural networks. After watching the videos we could easily understand almost all the terms in Caffe documentation.

We would like to thank the organizers of the contest for a great competition and the contestants for helpful discussions in forums and published solutions. We learned a lot from this contest.

Getting started with neural networks

2015-07-30T00:00:00+00:00

Who we are

We are a group of students from the department of Informatics and Applied Mathematics at Yerevan State University. In 2014, inspired by successes of neural nets in various fields, especially by GoogLeNet’s excellent performance in ImageNet 2014, we decided to dive into the topic of neural networks. We study calculus, combinatorics, graph theory, algebra and many other topics in the university but we learn nothing about machine learning. Just a few students take some ML courses from Coursera or elsewhere.

Choosing a video course

At the beginning of 2015 the Student Scientific Society of the department initiated a project to study neural networks. We had to choose some video course on the internet, then watch and discuss the videos once per week in the university. We wanted a course that would cover everything from the very basics to convolutional networks and deep learning. We followed Yoshua Bengio’s advice given during his interview on Reddit and chose this excellent class by Hugo Larochelle.

Hugo’s lectures are really great. First two chapters teach the basic structure of neural networks and describe the backpropagation algorithm in details. We loved that he showed the derivation of the gradients of the loss function. Because of this, Hrayr managed to implement a simple multilayer neural net on his own. Next two chapters (which we skipped) talk about Conditional Random Fields. The fifth chapter introduces unsupervised learning with Restricted Boltzmann Machines. This was the hardest part for us, mainly because of our lack of knowledge in probabilistic graphical models. The sixth chapter on autoencoders is our favorite: the magic of denoising autoencoders is very surprising. Then there are chapters on deep learning, another unsupervised learning technique called sparse coding (which we also skipped due to time limits) and computer vision (with strong emphasis on convolutional networks). The last chapter is about natural language processing.

The lectures contain lots of references to papers and demonstrations, the slides are full of visualizations and graphs, and, last but not least, Hugo kindly answers all questions posed in the comments of Youtube videos. After watching the chapter on convolutional networks we decided to apply what we learned on some computer vision contest. We looked at the list of active competitions on Kaggle and the only one related to computer vision was the Diabetic retinopathy detection contest. It seemed to be very hard as a first project in neural nets, but we decided to try. We’ll describe our experience with this contest in the next post.

YerevaNN

Challenges of reproducing R-NET neural network using Keras

Contents

Problem statement

The architecture of R-NET

Drawing complex recurrent networks

1. Question and passage encoder

2. Obtain question aware representation for the passage

3. Apply self-matching attention on the passage to get its final representation

4. Predict the interval which contains the answer of a question

Implementation details

Layers with masking support

Slice layer

Generators

Bidirectional GRUs

Dropout

Weight sharing

Hyperparameters

Weight initialization

Training

Results and comparison with R-NET technical report

Challenges of reproducibility

Interpreting neurons in an LSTM network

Contents

Transliteration

Network architecture

Analyzing the neurons

How does “t” become “ծ”?

What did this neuron learn?

Visualizing LSTM cells

Concluding remarks

Announcing YerevaNN non-profit foundation

Sentence representations and question answering (slides)

Automatic transliteration with LSTM

Contents

Problem description

Data processing

Source of the data

Romanization rules

Geographic dependency

Filtering out large non-Armenian chunks

Network architecture

Encoding the characters

Aligning

Bidirectional LSTM with residual-like connections

Results

Future work

Combining CNN and RNN for spoken language identification

Contents

Inputs and outputs

Network architecture

Convolutional networks (CNN)

Recurrent neural networks (RNN)

Combinations of CNN and RNN

Ensembling

Final remarks

Playground for bAbI tasks

Contents

Attention module

Architecture extensions

Results

Visualizing Dynamic memory networks

Looking for feedback

Implementing Dynamic memory networks

Contents

bAbI tasks

Memory networks

Dynamic memory networks

Semantic memory

Input module

Episodic memory

Initial experiments

Next steps

Generating Constitution with recurrent neural networks

Contents

Character-level RNN language model

Data

Network parameters

Analysis

Generated samples