YerevaNN 2017-11-08T14:53:49+00:00 http://yerevann.github.io/ Hrant Khachatrian hrant.khachatrian@ysu.am Challenges of reproducing R-NET neural network using Keras 2017-08-25T00:00:00+00:00 http://yerevann.github.io//2017/08/25/challenges-of-reproducing-r-net-neural-network-using-keras <p>By <a href="https://github.com/MartinXPN">Martin Mirakyan</a>, <a href="https://github.com/mahnerak">Karen Hambardzumyan</a> and <a href="https://github.com/Hrant-Khachatrian">Hrant Khachatrian</a>.</p> <p>In this post we describe our attempt to re-implement a neural architecture for automated question answering called <a href="https://www.microsoft.com/en-us/research/publication/mrc/">R-NET</a>, which is developed by the Natural Language Computing Group of Microsoft Research Asia. This architecture demonstrates the best performance among single models (not ensembles) on The Stanford Question Answering Dataset (as of August 25, 2017). MSR researchers released a <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">technical report</a> describing the model but did not release the code. We tried to implement the architecture in Keras framework and reproduce their results. This post describes the model and the challenges we faced while implementing it <a class="nav-link" href="https://github.com/YerevaNN/R-NET-in-Keras">[<span class="hidden-xs-down">View on GitHub </span><svg version="1.1" width="16" height="16" viewBox="0 0 16 16" class="octicon octicon-mark-github" aria-hidden="true"><path fill-rule="evenodd" fill="#268bd2" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path></svg>]</a>.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#problem-statement" id="markdown-toc-problem-statement">Problem statement</a></li> <li><a href="#the-architecture-of-r-net" id="markdown-toc-the-architecture-of-r-net">The architecture of R-NET</a> <ul> <li><a href="#drawing-complex-recurrent-networks" id="markdown-toc-drawing-complex-recurrent-networks">Drawing complex recurrent networks</a></li> <li><a href="#1-question-and-passage-encoder" id="markdown-toc-1-question-and-passage-encoder">1. Question and passage encoder</a></li> <li><a href="#2-obtain-question-aware-representation-for-the-passage" id="markdown-toc-2-obtain-question-aware-representation-for-the-passage">2. Obtain question aware representation for the passage</a></li> <li><a href="#3-apply-self-matching-attention-on-the-passage-to-get-its-final-representation" id="markdown-toc-3-apply-self-matching-attention-on-the-passage-to-get-its-final-representation">3. Apply self-matching attention on the passage to get its final representation</a></li> <li><a href="#4-predict-the-interval-which-contains-the-answer-of-a-question" id="markdown-toc-4-predict-the-interval-which-contains-the-answer-of-a-question">4. Predict the interval which contains the answer of a question</a></li> </ul> </li> <li><a href="#implementation-details" id="markdown-toc-implementation-details">Implementation details</a> <ul> <li><a href="#layers-with-masking-support" id="markdown-toc-layers-with-masking-support">Layers with masking support</a></li> <li><a href="#slice-layer" id="markdown-toc-slice-layer">Slice layer</a></li> <li><a href="#generators" id="markdown-toc-generators">Generators</a></li> <li><a href="#bidirectional-grus" id="markdown-toc-bidirectional-grus">Bidirectional GRUs</a></li> <li><a href="#dropout" id="markdown-toc-dropout">Dropout</a></li> <li><a href="#weight-sharing" id="markdown-toc-weight-sharing">Weight sharing</a></li> <li><a href="#hyperparameters" id="markdown-toc-hyperparameters">Hyperparameters</a></li> <li><a href="#weight-initialization" id="markdown-toc-weight-initialization">Weight initialization</a></li> <li><a href="#training" id="markdown-toc-training">Training</a></li> </ul> </li> <li><a href="#results-and-comparison-with-r-net-technical-report" id="markdown-toc-results-and-comparison-with-r-net-technical-report">Results and comparison with <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">R-NET technical report</a></a></li> <li><a href="#challenges-of-reproducibility" id="markdown-toc-challenges-of-reproducibility">Challenges of reproducibility</a></li> </ul> <h2 id="problem-statement">Problem statement</h2> <p>Given a passage and a question, the task is to predict an answer to the question based on the information found in the passage. The SQuAD dataset further constrains the answer to be a continuous sub-span of the provided passage. Answers usually include non-entities and can be long phrases. The neural network needs to “understand” both the passage and the question in order to be able to give a valid answer. Here is an example from the dataset.</p> <p><strong>Passage:</strong> Tesla later approached Morgan to ask for more funds to build a more powerful transmitter. When asked where all the money had gone, Tesla responded by saying that he was affected by the Panic of 1901, which he (Morgan) had caused. Morgan was shocked by the reminder of his part in the stock market crash and by Tesla’s breach of contract by asking for more funds. Tesla wrote another plea to Morgan, but it was also fruitless. Morgan still owed Tesla money on the original agreement, and Tesla had been facing foreclosure even before construction of the tower began.</p> <p><strong>Question:</strong> On what did Tesla blame for the loss of the initial money? <strong>Answer:</strong> Panic of 1901</p> <h2 id="the-architecture-of-r-net">The architecture of R-NET</h2> <p>The <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py">architecture</a> of R-NET network is designed to take the question and the passage as inputs and to output an interval on the passage that contains the answer. The process consists of several steps:</p> <ol> <li>Encode the question and the passage</li> <li>Obtain question aware representation for the passage</li> <li>Apply self-matching attention on the passage to get its final representation.</li> <li>Predict the interval which contains the answer of the question.</li> </ol> <p>Each of these steps is implemented as some sort of recurrent neural network. The model is trained end-to-end.</p> <h3 id="drawing-complex-recurrent-networks">Drawing complex recurrent networks</h3> <p>We are using <a href="https://arxiv.org/abs/1412.3555">GRU</a> cells (Gated Recurrent Unit) for all RNNs. The authors claim that their performance is similar to LSTM, but they are computationally cheaper.</p> <p><img src="https://rawgit.com/YerevaNN/yerevann.github.io/master/public/2017-08-22/GRU.svg" alt="GRU network" title="GRU network" /></p> <p>Most of the modules of R-NET are implemented as recurrent networks with complex cells. We draw these cells using colorful charts. Here is a chart that corresponds to the original GRU cell.</p> <p><img src="https://rawgit.com/YerevaNN/yerevann.github.io/master/public/2017-08-22/GRUcell.svg" alt="GRU cell" title="GRU cell" /></p> <p>White rectangles represent operations on tensors (dot product, sum, etc.). Yellow rectangles are activations (tanh, softmax or sigmoid). Orange circles are the weights of the network. Compare this to the formula of GRU cell (taken from <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Olah’s famous blogpost</a>):</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \large z_t &=\sigma(W_z \cdot [h_{t-1}, x_t]) \\ r_t &=\sigma(W_r \cdot [h_{t-1}, x_t]) \\ \tilde{h}_t &= tanh(W \cdot [r_t \circ h_{t-1}, x_t]) \\ h_t &= (1 - z_t) \circ h_{t-1} + z_t \circ \tilde{h}_t \end{aligned} %]]></script> <p>Some parts of R-NET architecture require to use tensors that are neither part of a GRU state nor part of an input at time <script type="math/tex">t</script>. These are “global” variables that are used in all timesteps. Following <a href="http://deeplearning.net/software/theano/library/scan.html">Theano’s terminology</a>, we call these global variables <em>non-sequences</em>.</p> <p>To make it easier to create GRU cells with additional features and operations we’ve created a <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/WrappedGRU.py">utility class called <strong>WrappedGRU</strong></a> which is a base class for all GRU modules. WrappedGRU supports operations with non-sequences and sharing weights between modules. Keras doesn’t directly support weight sharing, but instead it supports layer sharing and we use <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/SharedWeight.py">SharedWeight layer</a> to solve this problem (SharedWeight is a layer that has no inputs and returns tensor of weights). WrappedGRU supports taking SharedWeight as an input.</p> <h3 id="1-question-and-passage-encoder">1. Question and passage encoder</h3> <p>This step consists of two parts: <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/preprocessing.py">preprocessing</a> and text encoding. The preprocessing is done in a separate process and is not part of the neural network. First we preprocess the data by splitting it into parts, and then we convert all the words to corresponding vectors. Word-vectors are generated using <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/preprocessing.py#L35">gensim</a>.</p> <p>The next steps are already part of the model. Each word is represented by a concatenation of two vectors: its GloVe vector and another vector that holds character level information. To obtain character level embeddings we use an Embedding layer followed by a Bidirectional GRU cell wrapped inside a TimeDistributed layer. Basically, each character is embedded in <script type="math/tex">H</script> dimensional space, and a BiGRU runs over those embeddings to produce a vector for the word. The process is repeated for all the words using TimeDistributed layer.</p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L62">Code on GitHub</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TimeDistributed</span><span class="p">(</span><span class="n">Sequential</span><span class="p">([</span> <span class="n">InputLayer</span><span class="p">(</span><span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">C</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int32'</span><span class="p">),</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">input_dim</span><span class="o">=</span><span class="mi">127</span><span class="p">,</span> <span class="n">output_dim</span><span class="o">=</span><span class="n">H</span><span class="p">,</span> <span class="n">mask_zero</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span> <span class="n">Bidirectional</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="n">H</span><span class="p">))</span> <span class="p">]))</span> </code></pre></div></div> <p>When the word is missing from GloVe, we set its word vector to all zeros (as described in the technical report).</p> <p>Following the notation of the paper, we denote the vector representation of the question by <script type="math/tex">u^Q</script> and the representation of the passage by <script type="math/tex">u^P</script> (<script type="math/tex">Q</script> corresponds to the question and <script type="math/tex">P</script> corresponds to the passage).</p> <p>The network takes the preprocessed question <script type="math/tex">Q</script> and the passage <script type="math/tex">P</script>, applies masking on each one and then encodes them with 3 consecutive bidirectional GRU layers.</p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L81">Code on GitHub</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Encode the passage P</span> <span class="n">uP</span> <span class="o">=</span> <span class="n">Masking</span><span class="p">()</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span> <span class="n">uP</span> <span class="o">=</span> <span class="n">Bidirectional</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="n">H</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">))</span> <span class="p">(</span><span class="n">uP</span><span class="p">)</span> <span class="n">uP</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">rate</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'uP'</span><span class="p">)</span> <span class="p">(</span><span class="n">uP</span><span class="p">)</span> <span class="c"># Encode the question Q</span> <span class="n">uQ</span> <span class="o">=</span> <span class="n">Masking</span><span class="p">()</span> <span class="p">(</span><span class="n">Q</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span> <span class="n">uQ</span> <span class="o">=</span> <span class="n">Bidirectional</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="n">H</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">))</span> <span class="p">(</span><span class="n">uQ</span><span class="p">)</span> <span class="n">uQ</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">rate</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'uQ'</span><span class="p">)</span> <span class="p">(</span><span class="n">uQ</span><span class="p">)</span> </code></pre></div></div> <p>After encoding the passage and the question we finally have their vector representations <script type="math/tex">u^P</script> and <script type="math/tex">u^Q</script>. Now we can delve deeper in understanding the meaning of the passage having in mind the question.</p> <h3 id="2-obtain-question-aware-representation-for-the-passage">2. Obtain question aware representation for the passage</h3> <p>The next module computes another representation for the passage by taking into account the words inside the question sentence. We implement it using the following code:</p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L97">Code on GitHub</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vP</span> <span class="o">=</span> <span class="n">QuestionAttnGRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="n">H</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="p">([</span> <span class="n">uP</span><span class="p">,</span> <span class="n">uQ</span><span class="p">,</span> <span class="n">WQ_u</span><span class="p">,</span> <span class="n">WP_v</span><span class="p">,</span> <span class="n">WP_u</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="n">W_g1</span> <span class="p">])</span> </code></pre></div></div> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/QuestionAttnGRU.py">QuestionAttnGRU</a> is a complex extension of a recurrent layer (extends WrappedGRU and overrides the step method by adding additional operations before passing the input to the GRU cell).</p> <p><img src="https://rawgit.com/YerevaNN/yerevann.github.io/master/public/2017-08-22/QuestionAttnGRU.svg" alt="QuestionAttnGRU" title="Question Attention GRU" /></p> <p>The vectors of question aware representation of the passage are denoted by <script type="math/tex">v^P</script>. As a reminder <script type="math/tex">u^P_t</script> is the vector representation of the passage <script type="math/tex">P</script>, <script type="math/tex">u^Q</script> is the matrix representation of the question <script type="math/tex">Q</script> (each row corresponds to a single word).</p> <p>In QuestionAttnGRU first we combine three things:</p> <ol> <li>the previous state of the GRU (<script type="math/tex">v^P_{t-1}</script>)</li> <li>matrix representation of the question (<script type="math/tex">u^Q</script>)</li> <li>vector representation of the passage (<script type="math/tex">u^P_{t}</script>) at the <script type="math/tex">t</script>-th word.</li> </ol> <p>We compute the dot product of each input with the corresponding weights, then sum-up all together after broadcasting them into the same shape. The outputs of dot(<script type="math/tex">u^P_{t}</script>, <script type="math/tex">W^P_{u}</script>) and dot(<script type="math/tex">v^P_{t-1}</script>, <script type="math/tex">W^P_{v}</script>) are vectors, while the output of dot(<script type="math/tex">u^Q</script>, <script type="math/tex">W^Q_{u}</script>) is a matrix, therefore we broadcast (repeat several times) the vectors to match the shape of the matrix and then compute the sum of three matrices. Then we apply tanh activation on the result. The output of this operation is then multiplied (dot product) by a weight vector <script type="math/tex">V</script>, after which <script type="math/tex">softmax</script> activation is applied. The output of the <script type="math/tex">softmax</script> is a vector of non-negative numbers that represent the “importance” of each word in the question. This kind of vectors are often called <em>attention vectors</em>. When computing the dot product of <script type="math/tex">u^Q</script> (matrix representation of the question) and the attention vector, we obtain a single vector for the entire question which is a weighted average of question word vectors (weighted by the attention scores). The intuition behind this part is that we get a representation of the parts of the question that are relevant to the current word of the passage. This representation, denoted by <script type="math/tex">c_{t}</script>, depends on the current word, the whole question and the previous state of the recurrent cell (formula 4 on page 3 of the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">report</a>).</p> <p>These ideas seem to come from a paper by <a href="https://arxiv.org/abs/1509.06664">Rocktäschel et al.</a> from Deepmind. The authors suggested to pass this <script type="math/tex">c_{t}</script> vector as an input to the GRU cell. <a href="https://arxiv.org/abs/1512.08849">Wang and Jiang</a> from Singapore Management University argued that passing <script type="math/tex">c_{t}</script> is not enough, because we are losing information from the “original” input <script type="math/tex">u^P_{t}</script>. So they suggested to concatenate <script type="math/tex">c_{t}</script> and <script type="math/tex">u^P_{t}</script> before passing it to the GRU cell.</p> <p>The authors of R-NET did one more step. They applied an additional gate to the concatenated vector <script type="math/tex">[c_{t}, u^P_{t}]</script>. The gate is simply a dot product of some new weight matrix <script type="math/tex">W_{g}</script> and the concatenated vector, passed through a sigmoid activation function. The output of the gate is a vector of non-negative numbers, which is then (element-wise) multiplied by the original concatenated vector (see formula 6 on page 4 of the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">report</a>). The result of this multiplication is finally passed to the GRU cell as an input.</p> <h3 id="3-apply-self-matching-attention-on-the-passage-to-get-its-final-representation">3. Apply self-matching attention on the passage to get its final representation</h3> <p>Next, the authors suggest to add a self attention mechanism on the passage itself.</p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L105">Code on GitHub</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hP</span> <span class="o">=</span> <span class="n">Bidirectional</span><span class="p">(</span><span class="n">SelfAttnGRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="n">H</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span> <span class="p">([</span> <span class="n">vP</span><span class="p">,</span> <span class="n">vP</span><span class="p">,</span> <span class="n">WP_v</span><span class="p">,</span> <span class="n">WPP_v</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="n">W_g2</span> <span class="p">])</span> <span class="n">hP</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">rate</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'hP'</span><span class="p">)</span> <span class="p">(</span><span class="n">hP</span><span class="p">)</span> </code></pre></div></div> <p>The output of the previous step (Question attention) is denoted by <script type="math/tex">v^P</script>. It represents the encoding of the passage while taking into account the question. <script type="math/tex">v^P</script> is passed as an input to the self-matching attention module (top input, left input). The authors argue that the vectors <script type="math/tex">v^P_{t}</script> have very limited information about the context. <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/SelfAttnGRU.py">Self-matching attention module</a> attempts to augment the passage vectors by information from other relevant parts of the passage.</p> <p>The output of the self-matching GRU cell at time <script type="math/tex">t</script> is denoted by <script type="math/tex">h^P_{t}</script>.</p> <p><img src="https://rawgit.com/YerevaNN/yerevann.github.io/master/public/2017-08-22/SelfAttnGRU.svg" alt="SelfAttnGRU" title="Self-matching Attention GRU" /></p> <p>The implementation is very similar to the previous module. We compute dot products of weights <script type="math/tex">W^PP_{u}</script> with the current word vector <script type="math/tex">v^P_{t}</script>, and <script type="math/tex">W^P_{v}</script> with the entire <script type="math/tex">v^P</script> matrix, then add them up and apply <script type="math/tex">\tanh{}</script> activation. Next, the result is multiplied by a weight-vector <script type="math/tex">V</script> and passed through <script type="math/tex">softmax</script> activation, which produces an attention vector. The dot product of the attention vector and <script type="math/tex">v^P</script> matrix, again denoted by <script type="math/tex">c_{t}</script>, is the weighted average of all word vectors of the passage that are relevant to the current word <script type="math/tex">v^P_{t}</script>. <script type="math/tex">c_{t}</script> is then concatenated with <script type="math/tex">v^P_{t}</script> itself. The concatenated vector is passed through a gate and is given to GRU cell as an input.</p> <p>The authors consider this step as their main contribution to the architecture.</p> <p>It is interesting to note that the authors write <code class="highlighter-rouge">BiRNN</code> in Section 3.3 (Self-Matching Attention) and just <code class="highlighter-rouge">RNN</code> in Section 3.2 (which describes question-aware passage representation). For that reason we used BiGRU in SelfAttnGRU and unidirectional GRU in QuestionAttnGRU. Later we discovered a sentence in Section 4.1 which suggests that we were not correct: <code class="highlighter-rouge">the gated attention-based recurrent network for question and passage matching is also encoded bidirectionally in our experiment</code>.</p> <h3 id="4-predict-the-interval-which-contains-the-answer-of-a-question">4. Predict the interval which contains the answer of a question</h3> <p>Finally we’re ready to predict the interval of the passage which contains the answer of the question. To do this we use <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/QuestionPooling.py">QuestionPooling layer</a> followed by PointerGRU (<a href="https://arxiv.org/abs/1506.03134">Vinyals et al., Pointer networks, 2015</a>).</p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L118">Code on GitHub</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rQ</span> <span class="o">=</span> <span class="n">QuestionPooling</span><span class="p">()</span> <span class="p">([</span><span class="n">uQ</span><span class="p">,</span> <span class="n">WQ_u</span><span class="p">,</span> <span class="n">WQ_v</span><span class="p">,</span> <span class="n">v</span><span class="p">])</span> <span class="n">rQ</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">rate</span><span class="o">=</span><span class="n">dropout_rate</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'rQ'</span><span class="p">)</span> <span class="p">(</span><span class="n">rQ</span><span class="p">)</span> <span class="o">...</span> <span class="n">ps</span> <span class="o">=</span> <span class="n">PointerGRU</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">2</span> <span class="o">*</span> <span class="n">H</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">initial_state_provided</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'ps'</span><span class="p">)</span> <span class="p">([</span> <span class="n">fake_input</span><span class="p">,</span> <span class="n">hP</span><span class="p">,</span> <span class="n">WP_h</span><span class="p">,</span> <span class="n">Wa_h</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="n">rQ</span> <span class="p">])</span> <span class="n">answer_start</span> <span class="o">=</span> <span class="n">Slice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'answer_start '</span><span class="p">)</span> <span class="p">(</span><span class="n">ps</span><span class="p">)</span> <span class="n">answer_end</span> <span class="o">=</span> <span class="n">Slice</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'answer_end'</span><span class="p">)</span> <span class="p">(</span><span class="n">ps</span><span class="p">)</span> </code></pre></div></div> <p>QuestionPooling is the attention pooling of the whole question vector <script type="math/tex">u^Q</script>. Its purpose is to create the first hidden state of PointerGRU. It is similar to the other attention-based modules, but has a strange description in the report. Formula 11 on page 5 includes a product of two tensors <script type="math/tex">W_v^Q</script> and <script type="math/tex">V_r^Q</script>. Both these tensors are trainable parameters (as confirmed by Furu Wei, one of the coauthors of the technical report), and it is not clear why this dot product is not replaced by a single trainable vector.</p> <p><script type="math/tex">h^P</script> is the output of the previous module and it contains the final representation of the passage. It is passed to this module as an input to obtain the final answer.</p> <p>In Section 4.2 of the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">technical report</a> the authors write that after submitting their paper to ACL they made one more modification. They have added <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L114-L116">another bidirectional GRU</a> on top of <script type="math/tex">h^P</script> before feeding it to PointerGRU.</p> <p><img src="https://rawgit.com/YerevaNN/yerevann.github.io/master//public/2017-08-22/PointerGRU.svg" alt="PointerGRU" title="Pointer GRU" /></p> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/PointerGRU.py">PointerGRU</a> is a recurrent network that works for just two steps. The first step predicts the first word of the answer span, and the second step predicts the last word. Here is how it works. Both <script type="math/tex">h^P</script> and the previous state of the PointerGRU cell are multiplied by their corresponding weights <script type="math/tex">W</script> and <script type="math/tex">W^a_{v}</script>. Recall that the initial hidden state of the PointerGRU is the output of QuestionPooling. The products are then summed up and passed through <script type="math/tex">tanh</script> activation. The result is multiplied by the weight vector <script type="math/tex">V</script> and <script type="math/tex">softmax</script> activation is applied which outputs scores over <script type="math/tex">h^P</script>. These scores, denoted by <script type="math/tex">a^t</script> are probabilities over the words of the passage. Argmax of <script type="math/tex">a^1</script> vector is the predicted starting point, and argmax of <script type="math/tex">a^2</script> is the predicted final point of the answer (formula 9 on page 4 of the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">report</a>). The hidden state of PointerGRU is determined based on the dot product of <script type="math/tex">h^P</script> and <script type="math/tex">a^t</script>, which is passed as an input to a simple GRU cell (formula 10 on page 4 of the <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">report</a>). So, unlike all previous modules of R-NET, the <em>output</em> of PointerGRU (the red diamond at the top-right corner of the chart) is different from its hidden state.</p> <h2 id="implementation-details">Implementation details</h2> <p>We use Theano backend for Keras. It was faster than TensorFlow in our experiments. Our experience shows that TensorFlow is usually faster for simple network architectures. Probably Theano’s optimization process is more efficient for complex extensions of recurrent networks.</p> <h4 id="layers-with-masking-support">Layers with masking support</h4> <p>One of the most important challenges in training recurrent networks is to handle different lengths of data points in a single batch. Keras has a <a href="https://keras.io/layers/core/#masking">Masking layer</a> that handles the basic cases. We use it in the <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L81">encoding layer</a>. But R-NET has more complex scenarios for which we had to develop our own solutions. For example, in all attention pooling modules we use <script type="math/tex">softmax</script> which is applied along “time” axis (e.g. over the words of the passage). We don’t want to have positive probabilities after the last word of the sentence. So we have implemented a <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/helpers.py#L7">custom Softmax function</a> which supports masking:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">mask</span><span class="p">):</span> <span class="n">m</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">e</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">m</span><span class="p">)</span> <span class="o">*</span> <span class="n">mask</span> <span class="n">s</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">s</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">K</span><span class="o">.</span><span class="n">floatx</span><span class="p">(),</span> <span class="bp">None</span><span class="p">)</span> <span class="k">return</span> <span class="n">e</span> <span class="o">/</span> <span class="n">s</span> </code></pre></div></div> <p><code class="highlighter-rouge">m</code> is used for numerical stability. To support masking we <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/helpers.py#L15">multiply</a> <code class="highlighter-rouge">e</code> by the mask. We also clip <code class="highlighter-rouge">s</code> by a very small number, because in theory it is possible that all positive values of <code class="highlighter-rouge">e</code> are outside the mask.</p> <p>Note that details like this are not described in the technical report. Probably these are considered as commonly known tricks. But sometimes the details of the masking process can have critical effects on the results (we know this from the work on <a href="https://arxiv.org/abs/1703.07771">medical time series</a>).</p> <h4 id="slice-layer">Slice layer</h4> <p><a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/Slice.py">Slice layer</a> is supposed to slice and return the input tensor at the given indices. It also supports masking. The slice layer in R-NET model is needed to extract the final answer (i.e. the <code class="highlighter-rouge">interval_start</code> and <code class="highlighter-rouge">interval_end</code> numbers). The final output of the model is a tensor with shape <code class="highlighter-rouge">(batch x 2 x passage_length)</code>. The first row contains probabilities for <code class="highlighter-rouge">answer_start</code> and the second one for <code class="highlighter-rouge">answer_end</code>, that’s why we need to <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L134-L135">slice</a> the rows first and then extract the required information. Obviously we could accomplish the task without creating a new layer, yet it wouldn’t be a “Kerasic” solution.</p> <h4 id="generators">Generators</h4> <p>Keras supports <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/data.py#L46">batch generators</a> which are responsible for generating one batch per each iteration. One benefit of this approach is that the generator is working on a separate thread and is not waiting for the network to finish its training on the previous batch.</p> <h4 id="bidirectional-grus">Bidirectional GRUs</h4> <p>R-NET uses multiple bidirectional GRUs. The common way of implementing BiRNN is to take two copies of the same network (without sharing the weights) and then concatenate the hidden states to produce the output. One can take the sum of the vectors instead of concatenating them, but concatenation seems to be more popular (that’s the default version of <a href="https://keras.io/layers/wrappers/">Bidirectional layer</a> in Keras).</p> <h4 id="dropout">Dropout</h4> <p>The report indicates that dropout is applied “between layers with a dropout rate of 0.2”. We have applied dropout <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L85">before each of the three layers</a> of BiGRUs of both encoders, at the <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L87">outputs of both encoders</a>, right <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L103">after QuestionAttnGRU</a>, <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L112">after SelfAttnGRU</a> and <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/model.py#L119">after QuestionPooling</a> layer. We are not sure that this is exactly what the authors did.</p> <p>One more implementation detail is related to the way dropout is applied on the passage and question representation matrices. The rows of these matrices correspond to different words and the “vanilla” dropout will apply different masks on different words. These matrices are used as inputs to recurrent networks. But it is a common trick to apply the same mask at each “timestep”, i.e. each word. That’s how dropout is implemented in <a href="https://github.com/fchollet/keras/blob/master/keras/layers/recurrent.py#L15">recurrent layers in Keras</a>. The report doesn’t discuss these details.</p> <h4 id="weight-sharing">Weight sharing</h4> <p>The report doesn’t explicitly describe which weights are shared. We have decided to share those weights that are represented by the same symbol in the report. Note that the authors use the same symbol (e.g. <script type="math/tex">c_{t}</script>) for different variables (not weights) that obviously cannot be shared. But we hope that our assumption is true for weights. In particular, we share:</p> <ul> <li><script type="math/tex">W^Q_{u}</script> matrix between <code class="highlighter-rouge">QuestionAttnGRU</code> and <code class="highlighter-rouge">QuestionPooling</code> layers,</li> <li><script type="math/tex">W^P_{v}</script> matrix between <code class="highlighter-rouge">QuestionAttnGRU</code> and <code class="highlighter-rouge">SelfAttnGRU</code> layers,</li> <li><script type="math/tex">V</script> vector between all four instances (it is used right before applying softmax).</li> </ul> <p>We didn’t share the weights of the “attention gates”: <script type="math/tex">W_{g}</script>. The reason is that we have a mix of uni- and bidirectional GRUs that use this gate and require different dimensions.</p> <h4 id="hyperparameters">Hyperparameters</h4> <p>The authors of the report tell many details about hyperparameters. Hidden vector lengths are 75 for all layers. As we concatenate the hidden states of two GRUs in bidirectional, we effectively get 150 dimensional vectors. 75 is not an even number so it could not refer to the length of the concatenated vector :) <a href="http://ruder.io/optimizing-gradient-descent/index.html#adadelta">AdaDelta optimizer</a> is used to train the network with learning rate=1, <script type="math/tex">\rho=0.95</script> and <script type="math/tex">\varepsilon=1e^{-6}</script>. Nothing is written about the size of batches, or the way batches are sampled. We used <code class="highlighter-rouge">batch_size=50</code> in our experiments to fit in 4GB GPU memory.</p> <p>We couldn’t get good performance with <code class="highlighter-rouge">75</code> hidden units. The models were quickly overfitting. We got our best results using <code class="highlighter-rouge">45</code> dimensional hidden states.</p> <h4 id="weight-initialization">Weight initialization</h4> <p>The report doesn’t discuss weight initialization. We used default initialization schemes of Keras. In particular, Keras uses orthogonal initialization for recurrent connections of GRU, and uniform (<a href="http://www.jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Glorot, Bengio, 2010</a>) initialization for the connections that come from the inputs. We used Glorot initialization for <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/layers/SharedWeight.py#L12">all shared weights</a>. It is not obvious that this was the best solution.</p> <h4 id="training">Training</h4> <p>The <a href="https://github.com/YerevaNN/R-NET-in-Keras/blob/master/train.py">training script</a> is very simple. First we create the model:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">RNet</span><span class="p">(</span><span class="n">hdim</span><span class="o">=</span><span class="n">args</span><span class="o">.</span><span class="n">hdim</span><span class="p">,</span> <span class="c"># Defauls is 45</span> <span class="n">dropout_rate</span><span class="o">=</span><span class="n">args</span><span class="o">.</span><span class="n">dropout</span><span class="p">,</span> <span class="c"># Default is 0 (0.2 in the report)</span> <span class="n">N</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="c"># Size of passage</span> <span class="n">M</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="c"># Size of question</span> <span class="n">char_level_embeddings</span><span class="o">=</span><span class="n">args</span><span class="o">.</span><span class="n">char_level_embeddings</span><span class="p">)</span> <span class="c"># Default is false</span> </code></pre></div></div> <p>It is possible to slightly speed up computations by fixing <code class="highlighter-rouge">M</code> and <code class="highlighter-rouge">N</code>. It usually helps Theano’s compiler to further optimize the computational graph.</p> <p>We compile the model and fit it on the training set. Our training data is 90% of the original training set of SQuAD dataset. The other 10% is used as an internal validation dataset. We check the validation score after each epoch and save the current state of the model if it was better than the previous best one. The original <em>development set</em> of SQuAD is used as a test set, we don’t do model selection based on that.</p> <p>We had an idea to form the batches in a way that passages inside each batch have almost the same number of words. That would allow to train a little bit faster (as there would be many batches with short sequences), but we didn’t use this trick yet. We took maximum 300 words from passages and 30 words from questions to avoid very long sequences.</p> <p>Each epoch took around 100 minutes on a GTX980 GPU. We got our best results after 31 epochs.</p> <h2 id="results-and-comparison-with-r-net-technical-report">Results and comparison with <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf">R-NET technical report</a></h2> <p>R-NET is currently (August 2017) the <a href="https://rajpurkar.github.io/SQuAD-explorer/">best model on Stanford QA</a> benchmark among single models. SQuAD dataset uses two performance metrics: exact match (EM) and F1-score (F1). Human performance is estimated to be EM=82.3% and F1=91.2% on the test set.</p> <p>The report by Microsoft Research describes two versions of R-NET. The first one is called <em>R-NET (Wang et al., 2017)</em> (which refers to a paper which is not yet available online) and reaches EM=71.3% and F1=79.7% on the test set. It is the model we described above without the additional biGRU between SelfAttnGRU and PointerGRU. The second version called <em>R-NET (March 2017)</em> has the additional BiGRU and reaches EM=72.3% and F1=80.7%. The current best single model on SQuAD leaderboard has a higher score, which means R-NET development continued since the technical report was released. Ensemble models reach even higher scores.</p> <p>The best performance we got so far with our implementation is EM=57.52% and F1=67.42% on the development set. These results would put R-NET at the bottom of the SQuAD leaderboard. The model is available on <a href="https://github.com/YerevaNN/R-NET-in-Keras">GitHub</a>. We want to emphasize that R-NET’s technical report is pretty good in terms of the reported details of the architecture compared to many other papers. Probably we misunderstood several important details or have bugs in the code. Any feedback will be appreciated.</p> <h2 id="challenges-of-reproducibility">Challenges of reproducibility</h2> <p>Recently, ICML 2017 hosted a special <a href="https://sites.google.com/view/icml-reproducibility-workshop/home">workshop</a> devoted to the issues of reproducibility in machine learning. Hugo Larochelle shared the <a href="https://drive.google.com/file/d/0B8lLzpxgRHNQZ0paZWQ0cTcxMlNYYnc0TnpHekMxMjVBckVR/view">slides of his presentation</a>, where he discussed many aspects of the problem. He argues that the research should be considered as reproducible if the code is open-sourced. On the other hand he suggests that the community should not require researchers to compare their new models with a related published result if the code for the latter is not available.</p> <p>As a radical solution he suggests to use platforms like <a href="http://ai-on.org/">AI-ON</a>. AI-ON is open-sourcing not only the code, but the whole research process, including discussions and code experiments. We think about starting AI-ON projects just for reproducing the results of important papers that come without code.</p> <p>On the other hand, there are many simple tricks that can significantly improve reproducibility with little effort. For example, many papers report the number of parameters in the neural network. This number is a good checksum for other people. Another simple trick is to write the shapes of the tensors in the diagrams (just like we did in this post) or even in the text.</p> <p>The best open source model on SQuAD that we are aware of is the implementation of <a href="https://arxiv.org/abs/1704.00051">DrQA architecture</a> released in Facebook’s <a href="https://github.com/facebookresearch/ParlAI">ParlAI repository</a>. It <a href="https://github.com/facebookresearch/ParlAI/issues/109">reaches</a> EM=66.4% and F1=76.5%. We will continue to play with our codebase and try to improve the results.</p> Interpreting neurons in an LSTM network 2017-06-27T00:00:00+00:00 http://yerevann.github.io//2017/06/27/interpreting-neurons-in-an-LSTM-network <p>By <a href="https://github.com/TigranGalstyan">Tigran Galstyan</a> and <a href="https://github.com/Hrant-Khachatrian">Hrant Khachatrian</a>.</p> <p>A few months ago, we showed how effectively an LSTM network can perform text <a href="http://yerevann.github.io/2016/09/09/automatic-transliteration-with-lstm/"> transliteration</a>.</p> <p>For humans, transliteration is a relatively easy and interpretable task, so it’s a good task for interpreting what the network is doing, and whether it is similar to how humans approach the same task.</p> <p>In this post we’ll try to understand: What do individual neurons of the network actually learn? How are they used to make decisions?</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#transliteration" id="markdown-toc-transliteration">Transliteration</a></li> <li><a href="#network-architecture" id="markdown-toc-network-architecture">Network architecture</a></li> <li><a href="#analyzing-the-neurons" id="markdown-toc-analyzing-the-neurons">Analyzing the neurons</a> <ul> <li><a href="#how-does-t-become-ծ" id="markdown-toc-how-does-t-become-ծ">How does “t” become “ծ”?</a></li> <li><a href="#what-did-this-neuron-learn" id="markdown-toc-what-did-this-neuron-learn">What did this neuron learn?</a></li> </ul> </li> <li><a href="#visualizing-lstm-cells" id="markdown-toc-visualizing-lstm-cells">Visualizing LSTM cells</a></li> <li><a href="#concluding-remarks" id="markdown-toc-concluding-remarks">Concluding remarks</a></li> </ul> <h2 id="transliteration">Transliteration</h2> <p>About half of the billions of internet users speak languages written in non-Latin alphabets, like Russian, Arabic, Chinese, Greek and Armenian. Very often, they haphazardly use the Latin alphabet to write those languages.</p> <p><code class="highlighter-rouge">Привет</code>: <code class="highlighter-rouge">Privet</code>, <code class="highlighter-rouge">Privyet</code>, <code class="highlighter-rouge">Priwjet</code>, …<br /> <code class="highlighter-rouge">كيف حالك</code>: <code class="highlighter-rouge">kayf halk</code>, <code class="highlighter-rouge">keyf 7alek</code>, …<br /> <code class="highlighter-rouge">Բարև Ձեզ</code>: <code class="highlighter-rouge">Barev Dzez</code>, <code class="highlighter-rouge">Barew Dzez</code>, …</p> <p>So a growing share of user-generated text content is in these “Latinized” or “romanized” formats that are difficult to parse, search or even identify. Transliteration is the task of automatically converting this content into the native canonical format.</p> <p><code class="highlighter-rouge">Aydpes aveli sirun e.</code>: <code class="highlighter-rouge">Այդպես ավելի սիրուն է:</code></p> <p>What makes this problem non-trivial?</p> <ol> <li> <p>Different users romanize in different ways, as we saw above. For example, <code class="highlighter-rouge">v</code> or <code class="highlighter-rouge">w</code> could be Armenian <code class="highlighter-rouge">վ</code>.</p> </li> <li> <p>Multiple letters can be romanized to the same Latin letter. For example, <code class="highlighter-rouge">r</code> could be Armenian <code class="highlighter-rouge">ր</code> or <code class="highlighter-rouge">ռ</code>.</p> </li> <li> <p>A single letter can be romanized to a combination of multiple Latin letters. For example, <code class="highlighter-rouge">ch</code> could be Cyrillic <code class="highlighter-rouge">ч</code> or Armenian <code class="highlighter-rouge">չ</code>, but <code class="highlighter-rouge">c</code> and <code class="highlighter-rouge">h</code> by themselves are for other letters.</p> </li> <li> <p>English words and translingual Latin tokens like URLs occur in non-Latin text. For example, the letters in <code class="highlighter-rouge">youtube.com</code> or <code class="highlighter-rouge">MSFT</code> should not be changed.</p> </li> </ol> <p>Humans are great at resolving these ambiguities. We showed that LSTMs can also learn to resolve all these ambiguities, at least for Armenian. For example, our model correctly transliterated <code class="highlighter-rouge">es sirum em Deep Learning</code> into <code class="highlighter-rouge">ես սիրում եմ Deep Learning</code> and not <code class="highlighter-rouge">ես սիրում եմ Դեեփ Լէարնինգ</code>.</p> <h2 id="network-architecture">Network architecture</h2> <p>We took lots of Armenian text from Wikipedia and used <a href="https://github.com/YerevaNN/translit-rnn/blob/master/languages/hy-AM/transliteration.json">probabilistic rules</a> to obtain romanized text. The rules are chosen in a way that they cover most of the romanization rules people use for Armenian.</p> <p>We encode Latin characters as one-hot vectors and apply character level bidirectional LSTM. At each time-step the network tries to guess the next character of the original Armenian sentence. Sometimes a single Armenian character is represented by multiple Latin letters, so it is very helpful to align the romanized and original texts before giving them to LSTM (otherwise we should use sequence-to-sequence networks, which are harder to train). Fortunately we can do the alignment, because the romanized version was generated by ourselves. For example, <code class="highlighter-rouge">dzi</code> should be transliterated into <code class="highlighter-rouge">ձի</code>, where <code class="highlighter-rouge">dz</code> corresponds to <code class="highlighter-rouge">ձ</code> and <code class="highlighter-rouge">i</code> to <code class="highlighter-rouge">ի</code>. So we add a placeholder character in the Armenian version: <code class="highlighter-rouge">ձի</code> becomes <code class="highlighter-rouge">ձ_ի</code>, so that now <code class="highlighter-rouge">z</code> should be transliterated into <code class="highlighter-rouge">_</code>. After the inference we just remove <code class="highlighter-rouge">_</code>s from the output string.</p> <p>Our network consists of two LSTMs (228 cells) going forward and backward on the Latin sequence. The outputs of the LSTMs are concatenated at each step (<em>concat layer</em>), then a dense layer with 228 neurons is applied on top of it (<em>hidden layer</em>), and another dense layer (<em>output layer</em>) with softmax activations is used to get the output probabilities. We also concatenate the input vector to the hidden layer, so it has 300 neurons. This is a more simplified version of the network described in our <a href="http://yerevann.github.io/2016/09/09/automatic-transliteration-with-lstm/#network-architecture">previous post</a> on this topic (the main difference is that we don’t use the second layer of biLSTM).</p> <h2 id="analyzing-the-neurons">Analyzing the neurons</h2> <p>We tried to answer the following questions:</p> <ul> <li>How does the network handle interesting cases with several possible outcomes (e.g. <code class="highlighter-rouge">r</code> =&gt; <code class="highlighter-rouge">ր</code> vs <code class="highlighter-rouge">ռ</code> etc.)?</li> <li>What are the problems particular neurons are helping solve?</li> </ul> <h3 id="how-does-t-become-ծ">How does “t” become “ծ”?</h3> <p>First, we fixed one particular character for the input and one for the output. For example we are interested in how <code class="highlighter-rouge">t</code> becomes <code class="highlighter-rouge">ծ</code> (we know <code class="highlighter-rouge">t</code> can become <code class="highlighter-rouge">տ</code>, <code class="highlighter-rouge">թ</code> or <code class="highlighter-rouge">ծ</code>). We now that it usually happens when <code class="highlighter-rouge">t</code> appears in a bigram <code class="highlighter-rouge">ts</code>, which should be converted to <code class="highlighter-rouge">ծ_</code>.</p> <p>For every neuron, we draw the histograms of its activations in cases where the correct output is <code class="highlighter-rouge">ծ</code>, and where the correct output is <em>not</em> <code class="highlighter-rouge">ծ</code>. For most of the neurons these two histograms are pretty similar, but there are cases like this:</p> <table> <thead> <tr> <th style="text-align: center">Input = <code class="highlighter-rouge">t</code>, Output = <code class="highlighter-rouge">ծ</code></th> <th style="text-align: center">Input = <code class="highlighter-rouge">t</code>, Output != <code class="highlighter-rouge">ծ</code></th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><img src="http://yerevann.github.io/public/2017-06-27/ts.png" alt="" /></td> <td style="text-align: center"><img src="http://yerevann.github.io/public/2017-06-27/chts.png" alt="" /></td> </tr> </tbody> </table> <p>These histograms show that by looking at the activation of this particular neuron we can guess with high accuracy whether the output for <code class="highlighter-rouge">t</code> is <code class="highlighter-rouge">ծ</code>. To quantify the difference between the two histograms we used <a href="https://en.wikipedia.org/wiki/Hellinger_distance">Hellinger distance</a> (we take the minimum and maximum values of neuron activations, split the range into 1000 bins and apply discrete Hellinger distance formula on two histograms). We calculated this distance for all neurons and visualized the most interesting ones in a single image:</p> <p><img src="http://yerevann.github.io/public/2017-06-27/t-ծ.png" alt="t=&gt;ծ" /></p> <p>The color of a neuron indicates the distance between its two histograms (darker colors correspond to larger distances). The width of a line between two neurons indicate the mean of the value that the neuron on the lower end of the connection contributes to the neuron on the higher end. Orange and green lines correspond to positive and negative signals, respectively.</p> <p>The neurons at the top of the image are from the output layer, the neurons below the output layer are from the hidden layer (top 12 neurons in terms of the distance between histograms). Concat layer comes under the hidden layer. The neurons of the concat layer are split into two parts: the left half of the neurons are the outputs of the LSTM that goes forward on the input sequence and the right half contains the neurons from the LSTM that goes backwards. From each LSTM we display top 10 neurons in terms of the distance between histograms.</p> <p>In the case of <code class="highlighter-rouge">t</code> =&gt; <code class="highlighter-rouge">ծ</code>, it is obvious that all top 12 neurons of the hidden layer pass positive signals to <code class="highlighter-rouge">ծ</code> and <code class="highlighter-rouge">ց</code> (another Armenian character that is often romanized as <code class="highlighter-rouge">ts</code>), and pass negative signals to <code class="highlighter-rouge">տ</code>, <code class="highlighter-rouge">թ</code> and others.</p> <p><img src="http://yerevann.github.io/public/2017-06-27/t-ծ-concat.png" alt="t=&gt;ծ - concat layer" /></p> <p>We can also see that the outputs of the right-to-left LSTM are darker, which implies that these neurons “have more knowledge” about whether to predict <code class="highlighter-rouge">ծ</code>. On the other hand, the lines between those neurons and the hidden layer are thicker, which means that they have more contribution in activating the top 12 neurons in the hidden layer. This is a very natural result, because we know that <code class="highlighter-rouge">t</code> usually becomes <code class="highlighter-rouge">ծ</code> when the <em>next</em> symbol is <code class="highlighter-rouge">s</code>, and only the right-to-left LSTM is aware of the next character.</p> <p>We did the same analysis for the neurons and gates inside the LSTMs. The results are visualized as six rows of neurons at the bottom of the image. In particular, it is interesting to note that the most “confident” neurons are the so called <em>cell inputs</em>. Recall that cell inputs, as well as all the gates, depend on the input at the current step and the hidden state of the previous step (which is the hidden state at the <em>next</em> character as we talk about the right-to-left LSTM), so all of them are “aware” of the next <code class="highlighter-rouge">s</code>, but for some reason cell inputs are more confident than others.</p> <p>In the cases where <code class="highlighter-rouge">s</code> should be transliterated into <code class="highlighter-rouge">_</code> (the placeholder), the useful information is more likely to come from the LSTM that goes forward, as <code class="highlighter-rouge">s</code> becomes <code class="highlighter-rouge">_</code> mainly in case of <code class="highlighter-rouge">ts</code> =&gt; <code class="highlighter-rouge">ծ_</code>. We see that in the next plot:</p> <p><img src="http://yerevann.github.io/public/2017-06-27/s-_.png" alt="s=&gt;placeholder" /></p> <h3 id="what-did-this-neuron-learn">What did this neuron learn?</h3> <p>In the second part of our analysis we tried to figure out in which ambiguous cases each of the neurons is most helpful. We took the set of Latin characters that can be transliterated into more than one Armenian letters. Then we removed the cases where one of the possible outcomes appears less than 300 times in our 5000 sample sentences, because our distance metric didn’t seem to work well with few samples. And we analyzed every fixed neuron for every possible input-output pair.</p> <p>For example, here is the analysis of the neuron #70 of the output layer of the left-to-right LSTM. We have seen in the previous visualization that it helps determining whether <code class="highlighter-rouge">s</code> should be transliterated into <code class="highlighter-rouge">_</code>. We see that the top input-output pairs for this neuron are the following:</p> <table> <thead> <tr> <th style="text-align: center">Hellinger distance</th> <th style="text-align: center">Latin character</th> <th style="text-align: center">Armenian character</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">0.9482</td> <td style="text-align: center">s</td> <td style="text-align: center">_</td> </tr> <tr> <td style="text-align: center">0.8285</td> <td style="text-align: center">h</td> <td style="text-align: center">հ</td> </tr> <tr> <td style="text-align: center">0.8091</td> <td style="text-align: center">h</td> <td style="text-align: center">_</td> </tr> <tr> <td style="text-align: center">0.6125</td> <td style="text-align: center">o</td> <td style="text-align: center">օ</td> </tr> </tbody> </table> <p>So this neuron is most helpful when predicting <code class="highlighter-rouge">_</code> from <code class="highlighter-rouge">s</code> (as we already knew), but it also helps to determine whether Latin <code class="highlighter-rouge">h</code> should be transliterated as Armenian <code class="highlighter-rouge">հ</code> or the placeholder <code class="highlighter-rouge">_</code> (e.g. Armenian <code class="highlighter-rouge">չ</code> is usually romanized as <code class="highlighter-rouge">ch</code>, so <code class="highlighter-rouge">h</code> sometimes becomes <code class="highlighter-rouge">_</code>).</p> <p>We visualize Hellinger distances of the histograms of neuron activations when the input is <code class="highlighter-rouge">h</code> and the output is <code class="highlighter-rouge">_</code>, and see that the neuron #70 is among the top 10 neurons of the left-to-right LSTM for the <code class="highlighter-rouge">h</code>=&gt;<code class="highlighter-rouge">_</code> pair.</p> <p><img src="http://yerevann.github.io/public/2017-06-27/h-_.png" alt="h=&gt;placeholder" /></p> <h2 id="visualizing-lstm-cells">Visualizing LSTM cells</h2> <p>Inspired by <a href="https://arxiv.org/abs/1506.02078">this paper</a> by Andrej Karpathy, Justin Johnson and Fei-Fei Li, we tried to find neurons or LSTM cells specialised in some language specific patterns in the sequences. In particular, we tried to find the neurons that react most to the suffix <code class="highlighter-rouge">թյուն</code> (romanized as <code class="highlighter-rouge">tyun</code>).</p> <p><img src="http://yerevann.github.io/public/2017-06-27/utyun1.png" alt="tyun" /></p> <p>The first row of this visualization is the output sequence. Rows below show the activations of the most interesting neurons:</p> <ol> <li>Cell #6 in the LSTM that goes backwards,</li> <li>Cell #147 in the LSTM that goes forward,</li> <li>37th neuron in the hidden layer,</li> <li>78th neuron in the concat layer.</li> </ol> <p><img src="http://yerevann.github.io/public/2017-06-27/utyun2.png" alt="tyun" /></p> <p>We can see that Cell #6 is active on <code class="highlighter-rouge">tyun</code>s and is not active on the other parts of the sequence. Cell #144 of the forward LSTM behaves the opposite way, it is active on everything except <code class="highlighter-rouge">tyun</code>s.</p> <p>We know that <code class="highlighter-rouge">t</code> in the suffix <code class="highlighter-rouge">tyun</code> should always become <code class="highlighter-rouge">թ</code> in Armenian, so we thought that if a neuron is active on <code class="highlighter-rouge">tyun</code>s, it may help in determining whether the Latin <code class="highlighter-rouge">t</code> should be transliterated as <code class="highlighter-rouge">թ</code> or <code class="highlighter-rouge">տ</code>. So we visualized the most important neurons for the pair <code class="highlighter-rouge">t</code> =&gt; <code class="highlighter-rouge">թ</code>.</p> <p><img src="http://yerevann.github.io/public/2017-06-27/t-թ.png" alt="t-&gt;թ" /></p> <p>Indeed, Cell #147 in the forward LSTM is among the top 10.</p> <h2 id="concluding-remarks">Concluding remarks</h2> <p>Interpretability of neural networks remains an important challenge in machine learning. CNNs and LSTMs perform well for many learning tasks, but there are very few tools to understand the inner workings of these systems. Transliteration is a pretty good problem for analyzing the impact of particular neurons.</p> <p>Our experiments showed that too many neurons are involved in the “decision making” even for the simplest cases, but it is possible to identify a subset of neurons that have more influence than the rest. On the other hand, most neurons are involved in multiple decision making processes depending on the context. This is expected, since nothing in the loss functions we use when training neural nets forces the neurons to be independent and interpretable. Recently, there have been <a href="https://arxiv.org/abs/1606.03657">some attempts</a> to apply information-theoretic regularization terms in order to obtain more interpretability. It would be interesting to test those ideas in the context of transliteration.</p> <p>We would like to thank Adam Mathias Bittlingmayer and Zara Alaverdyan for helpful comments and discussions.</p> Announcing YerevaNN non-profit foundation 2016-10-17T00:00:00+00:00 http://yerevann.github.io//2016/10/17/announcing-yerevann-non-profit-foundation <p>Today we have officially registered YerevaNN scientific educational foundation, which aims to promote world-class AI research in Armenia and develop high quality educational programs in machine learning and related disciplines. The board members of the foundation are Gor Vardanyan, founder of FimeTech, Vazgen Hakobjanyan, cofounder of Teamable, and Rouben Meschian, founder of Arminova Technologies. Hrant Khachatrian is the director of the foundation.</p> <!--more--> <p><img src="http://yerevann.github.io/public/2016-10-17/cover.jpg" alt="YerevaNN" /></p> <p>The core project of the foundation is to support an AI research lab based in Yerevan, Armenia. Inspired by <a href="https://openai.com/about/">OpenAI</a>, the lab focuses on non-commercial machine learning research and is committed to publish all obtained results and release all the code on GitHub. The three initial members of YerevaNN lab, Tigran Galstyan, Karen Hambardzumyan and Hrayr Harutyunyan, currently work on projects ranging from generative models to natural language processing.</p> <p>Follow us on our <a href="http://yerevann.github.io/">blog</a>, on <a href="https://www.facebook.com/YerevaNNlab/">Facebook</a>, <a href="https://twitter.com/YerevaNN">Twitter</a> and <a href="https://plus.google.com/110195306327238545309">Google Plus</a>.</p> Sentence representations and question answering (slides) 2016-09-21T00:00:00+00:00 http://yerevann.github.io//2016/09/21/presentation-sentence-representations-and-question-answering <p>The success of neural word embedding models like <a href="https://en.wikipedia.org/wiki/Word2vec">word2vec</a> and <a href="http://nlp.stanford.edu/projects/glove/">GloVe</a> motivated research on representing sentences in an n-dimensional space. <a href="https://github.com/mike1808">Michael Manukyan</a> and <a href="https://github.com/Harhro94">Hrayr Harutyunyan</a> reviewed several sentence representation algorithms and their applications in state-of-the-art <a href="http://arxiv.org/abs/1608.07905">automated question answering</a> systems during a talk at the Armenian NLP meetup. The slides of the talk are below. Follow us on <a href="https://www.slideshare.net/YerevaNN/">SlideShare</a> to get the latest slides from YerevaNN.</p> <!--more--> <iframe src="//www.slideshare.net/slideshow/embed_code/key/NkfvTBRSIjKEW0" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe> <div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/YerevaNN/sentence-representations-and-question-answering-yerevann" title="Sentence representations and question answering (YerevaNN)" target="_blank">Sentence representations and question answering (YerevaNN)</a> </strong> from <strong><a href="//www.slideshare.net/YerevaNN" target="_blank">YerevaNN</a></strong> </div> Automatic transliteration with LSTM 2016-09-09T00:00:00+00:00 http://yerevann.github.io//2016/09/09/automatic-transliteration-with-lstm <p>By <a href="https://github.com/TigranGalstyan">Tigran Galstyan</a>, <a href="https://github.com/Harhro94">Hrayr Harutyunyan</a> and <a href="https://github.com/Hrant-Khachatrian">Hrant Khachatrian</a>.</p> <p>Many languages have their own non-Latin alphabets but the web is full of content in those languages written in Latin letters, which makes it inaccessible to various NLP tools (e.g. automatic translation). Transliteration is the process of converting the romanized text back to the original writing system. In theory every language has a strict set of romanization rules, but in practice people do not follow the rules and most of the romanized content is hard to transliterate using rule based algorithms. We believe this problem is solvable using the state of the art NLP tools, and we demonstrate a high quality solution for Armenian based on recurrent neural networks. We invite everyone to <a href="https://github.com/YerevaNN/translit-rnn">adapt our system</a> for more languages.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#problem-description" id="markdown-toc-problem-description">Problem description</a></li> <li><a href="#data-processing" id="markdown-toc-data-processing">Data processing</a> <ul> <li><a href="#source-of-the-data" id="markdown-toc-source-of-the-data">Source of the data</a></li> <li><a href="#romanization-rules" id="markdown-toc-romanization-rules">Romanization rules</a></li> <li><a href="#geographic-dependency" id="markdown-toc-geographic-dependency">Geographic dependency</a></li> <li><a href="#filtering-out-large-non-armenian-chunks" id="markdown-toc-filtering-out-large-non-armenian-chunks">Filtering out large non-Armenian chunks</a></li> </ul> </li> <li><a href="#network-architecture" id="markdown-toc-network-architecture">Network architecture</a> <ul> <li><a href="#encoding-the-characters" id="markdown-toc-encoding-the-characters">Encoding the characters</a></li> <li><a href="#aligning" id="markdown-toc-aligning">Aligning</a></li> <li><a href="#bidirectional-lstm-with-residual-like-connections" id="markdown-toc-bidirectional-lstm-with-residual-like-connections">Bidirectional LSTM with residual-like connections</a></li> </ul> </li> <li><a href="#results" id="markdown-toc-results">Results</a></li> <li><a href="#future-work" id="markdown-toc-future-work">Future work</a></li> </ul> <h2 id="problem-description">Problem description</h2> <p>Since early 1990s computers became widespread in many countries, but the operating systems did not fully support different alphabets out of the box. Most keyboards had only latin letters printed on them, and people started to invent romanization rules for their languages. Every language has its own story, and these stories are usually not known outside their own communities. In case of Armenian, <a href="https://en.wikipedia.org/wiki/ArmSCII">some solutions</a> have been developed, but even those who knew how to write in Armenian characters, were not sure that the readers (r.g. the recipient of the email) would be able to read that.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/armenian-unicode.jpg" alt="Armenian alphabet in the Unicode space. Source: Wikipedia" /></th> </tr> </thead> <tbody> <tr> <td>Armenian alphabet in the Unicode space. Source: <a href="https://en.wikipedia.org/wiki/Armenian_alphabet#Character_encodings">Wikipedia</a></td> </tr> </tbody> </table> <p>In the Unicode era all major OSes started to support displaying <a href="https://en.wikipedia.org/wiki/Armenian_alphabet">Armenian characters</a>. But the lack of keyboard layouts was still a problem. In late 2000s mobile internet penetration exploded in Armenia, and most of the early mobile phones did not support writing in Armenian. For example, iOS doesn’t include Armenian keyboard and started to officially support custom keyboards <a href="http://www.theverge.com/2014/6/2/5773504/developers-already-at-work-on-alternate-ios-8-keyboards/in/6116530">only in 2014</a>! The result was that lots of people entered the web (mostly through social networks) without having access to Armenian letters. So everyone started to use some sort of romanization (obviously no one was aware that there are fixed standards for the <a href="https://en.wikipedia.org/wiki/Romanization_of_Armenian">romanization of Armenian</a>).</p> <p>Currently there are many attempts to fight romanized Armenian on forums and social networks. Armenian keyboard layouts are developed for every popular platform. But still lots of content is produced in non-Armenian letters (maybe only Facebook knows the exact scale of the problem), and such content remains inaccessible for search indexing, automated translation, text-to-speech, etc. Recently the problem started to flow outside the web, people use romanized Armenian on the streets.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/translit-in-the-wild.jpg" alt="Romanized Armenian on the street. Source: VKontakte social network" /></th> </tr> </thead> <tbody> <tr> <td>Romanized Armenian on the street. Source: VKontakte social network</td> </tr> </tbody> </table> <p>There are some online tools that correctly transliterate romanized Armenian if its written using strict rules. <a href="https://hayeren.am/?p=convertor">Hayeren.am</a> is the most famous example. Facebook’s search box also recognizes some romanizations (but not all). But for many practical cases these tools do not give a reasonable output. The algorithm must be able to use the context to correctly predict the Armenian character.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/facebook-translit.jpg" alt="Facebook's search box recognizes some romanized Armenian" /></th> </tr> </thead> <tbody> <tr> <td>Facebook’s search box recognizes some romanized Armenian. Note that the spelling suggestion is not for Armenian.</td> </tr> </tbody> </table> <p>Finally, there are debates whether these tools actually help fighting the “translit” problem. Some argue that people will not be forced to use Armenian keyboard if there are very good tools to transliterate. We believe that the goal of making this content available for the NLP tools is extremely important, as no one will (and should) develop, say, language translation tools for romanized alphabets.</p> <p>Wikipedia has similar stories for <a href="https://en.wikipedia.org/wiki/Greeklish">Greek</a>, <a href="https://en.wikipedia.org/wiki/Fingilish">Persian</a> and <a href="https://en.wikipedia.org/wiki/Translit">Cyrillic</a> alphabets. The problem exists for many writing systems and is mostly overlooked by the NLP community, although it’s definitely not the hardest problem in NLP. We hope that the solution we develop for Armenian might become helpful for other languages as well.</p> <h2 id="data-processing">Data processing</h2> <p>We are using a recurrent neural network that takes a sequence of characters (romanized Armenian) at its input and outputs a sequence of Armenian characters. In order to train such a system we take a lot of text in Armenian, romanize it using probabilistic rules and give them to the network.</p> <h3 id="source-of-the-data">Source of the data</h3> <p>We chose Armenian Wikipedia as the easiest available large corpus of Armenian text. The dumps are available <a href="https://dumps.wikimedia.org/hywiki/">here</a>. These dumps are in a very complicated XML format, but they can be parsed by the <a href="https://github.com/attardi/wikiextractor">WikiExtractor tool</a>. The details are in the <a href="https://github.com/YerevaNN/translit-rnn">Readme file</a> of the repository we released today.</p> <p>The disadvantage of Wiki is that it doesn’t contain very diverse texts. For example, it doesn’t contain any dialogs or non formal speech (while social networks are full of them). On the other hand it’s very easy to parse and it’s quite large (356MB). We splitted this into training (284MB), validation (36MB) and test (36MB) sets, but then we understood that the overlap between training and validation sets can be very high. Finally we decided to use some <a href="http://grapaharan.org/index.php/Պատը">fiction text</a> with lots of dialogs as a validation set.</p> <h3 id="romanization-rules">Romanization rules</h3> <p>To generate the input sequences for the network we need to romanize the texts. We use probabilistic rules, as different people prefer different romanizations. Armenian alphabet has 39 characters, while Latin has only 26. Some of the Armenian letters are romanized in a unique way, like <code class="highlighter-rouge">ա</code>-<code class="highlighter-rouge">a</code>, <code class="highlighter-rouge">բ</code>-<code class="highlighter-rouge">b</code>, <code class="highlighter-rouge">դ</code>-<code class="highlighter-rouge">d</code>, <code class="highlighter-rouge">ի</code>-<code class="highlighter-rouge">i</code>, <code class="highlighter-rouge">մ</code>-<code class="highlighter-rouge">m</code>, <code class="highlighter-rouge">ն</code>-<code class="highlighter-rouge">n</code>. Some letters require a combination of two Latin letters: <code class="highlighter-rouge">շ</code>-<code class="highlighter-rouge">sh</code>, <code class="highlighter-rouge">ժ</code>-<code class="highlighter-rouge">zh</code>, <code class="highlighter-rouge">խ</code>-<code class="highlighter-rouge">kh</code>. The latter is also romanized to <code class="highlighter-rouge">gh</code> or even <code class="highlighter-rouge">x</code> (because this one looks like Russian <code class="highlighter-rouge">х</code> which is pronounced the same way as Armenian <code class="highlighter-rouge">խ</code>).</p> <p>But the main obstacle is that the same Latin character can correspond to different Armenian letters. For example <code class="highlighter-rouge">c</code> can come from both <code class="highlighter-rouge">ց</code> and <code class="highlighter-rouge">ծ</code>, <code class="highlighter-rouge">t</code> can come from both <code class="highlighter-rouge">տ</code> and <code class="highlighter-rouge">թ</code>, and so on. This is what the network has to learn to infer from the context.</p> <p>We have created a probabilistic mapping, so that each Armenian letter is romanized according to the given probabilities. For example, <code class="highlighter-rouge">ծ</code> is replaced by <code class="highlighter-rouge">ts</code> in 60% of cases, <code class="highlighter-rouge">c</code> in 30% of cases, and <code class="highlighter-rouge">&amp;</code> in 10% of cases. The full set of rules are here and can be browsed <a href="http://jsoneditoronline.org/?id=ef9f135c1a0b4f3ad4724f5fa628fb00">here</a>.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/hy-rules.jpg" alt="Some of the romanization rules for Armenian" /></th> </tr> </thead> <tbody> <tr> <td>Some of the romanization rules for Armenian</td> </tr> </tbody> </table> <h3 id="geographic-dependency">Geographic dependency</h3> <p>The romanization rules vary a lot in different countries. For example, Armenian letter <code class="highlighter-rouge">շ</code> is mostly romanized as <code class="highlighter-rouge">sh</code>, but Armenians in Germany prefer <code class="highlighter-rouge">sch</code>, Armenians in France sometimes use <code class="highlighter-rouge">ch</code>, and Armenians in Russia use <code class="highlighter-rouge">w</code> (because <code class="highlighter-rouge">w</code> is visually similar to Russian <code class="highlighter-rouge">ш</code> which sounds like <code class="highlighter-rouge">sh</code>). There are many other similar differences that might require separate analysis.</p> <p>Finally, Armenian language has two branches: Eastern and Western Armenian. These branches have crucial differences in romanization rules. Here we focus only on the rules for Eastern Armenian and those that are commonly used in Armenia.</p> <h3 id="filtering-out-large-non-armenian-chunks">Filtering out large non-Armenian chunks</h3> <p>Wikidumps contain some large regions where there are no Armenian characters. We noticed that these regions were confusing the network. So now when generating a chunk to give to the system we drop the ones that do not contain at least 33% Armenian characters.</p> <p>This is a difficult decision, as one might want the system to recognize English words in the text and leave them without transliteration. For example, the word <code class="highlighter-rouge">You Tube</code> should not be transliterated to Armenian. We hope that such small cases of English words/names will remain in the training set.</p> <h2 id="network-architecture">Network architecture</h2> <p>Our search for a good network architecture started from <a href="https://github.com/Lasagne/Recipes/blob/master/examples/lstm_text_generation.py">Lasagne implementation</a> of <a href="https://github.com/karpathy/char-rnn">Karpathy’s popular char-rnn network</a>. Char-rnn is a language model, it predicts the next character given the previous ones and is based on 2 layers of LSTMs going from left to right. The context from the right is also important in our case, so we replaced simple LSTMs with <a href="http://www.cs.toronto.edu/~graves/asru_2013.pdf">bidirectional LSTMs</a> (introduced <a href="ftp://ftp.idsia.ch/pub/juergen/nn_2005.pdf">here</a> back in 1995).</p> <p>We have also added a shortcut connection from the input to the output of the 2nd biLSTM layer. This should help to learn the “easy” transliteration rules on this short way and leave LSTMs for the complex stuff.</p> <p>Just like char-rnn, our network works on character level data and has no access to dictionaries.</p> <h3 id="encoding-the-characters">Encoding the characters</h3> <p>First we define the set of possible characters (“vocabularies”) for the input and the output. The input “vocabulary” contains all the characters that appear in the right hand sides of the romanization rules, the digits and some punctuation (that can provide useful context). Then a special program runs over the entire corpus, generates the romanized version, and every symbol outside the input vocabulary is replaced by some placeholder symbol (<code class="highlighter-rouge">#</code>) in both original and romanized versions. The symbols that are left in the original version form the “output vocabulary”.</p> <p>All symbols are encoded as one-hot vectors and are passed to the network. In our case the input vectors are 72 dimensional and the output vectors are 152 dimensional.</p> <h3 id="aligning">Aligning</h3> <p>After some experiments we noticed that LSTMs are really struggling when the characters are not aligned in inputs and outputs. As one Armenian character can be replaced by 2 or 3 Latin characters, the input and output sequences usually have different lengths, and the network has to “remember” by how many characters the romanized sequence is ahead of the Armenian sequence in order to print the next character in the correct place. This turned to be extremely difficult, and we decided to explicitly align the Armenian sequence by <a href="https://github.com/YerevaNN/translit-rnn/blob/master/utils.py#L227-L232">adding some placeholder symbols</a> after those characters that are romanized to multi-character Latin.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/aligning.png" alt="Character level alignment of Armenian text with the romanization" /></th> </tr> </thead> <tbody> <tr> <td>Character level alignment of Armenian text with the romanization</td> </tr> </tbody> </table> <p>Also there is one exceptional case in Armenian: the Latin letter ‘u’ should be transliterated to 2 Armenian symbols: <code class="highlighter-rouge">ու</code>. This is another source of misalignment. We <a href="https://github.com/YerevaNN/translit-rnn/blob/master/utils.py#L160-L166">explicitly replace</a> all <code class="highlighter-rouge">ու</code> pairs with some placeholder symbol to avoid the problem.</p> <h3 id="bidirectional-lstm-with-residual-like-connections">Bidirectional LSTM with residual-like connections</h3> <p>LSTM network expects a sequence of vectors at its input. In our case it is a sequence of one-hot vectors, and the sequence length is a hyperparameter. We used <code class="highlighter-rouge">--seq_len 30</code> for the final model. This means that the network reads 30 characters in Armenian, transforms to Latin characters (it usually becomes a bit longer than 30), then crops up to the latest whitespace before the 30th symbol. The remaining cells are filled with another placeholder symbol. This ensures that the words are not split in the middle.</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/bilstm-network.png" alt="Network architecture" /></th> </tr> </thead> <tbody> <tr> <td>Network architecture. Green boxes encapsulate all the magic inside LSTM. Grey trapezoids denote dense connections. Dotted line is an identity connection without trainable parameters.</td> </tr> </tbody> </table> <p>These 30 one-hot vectors are passed to the first layer of bidirectional LSTM. Basically it is a combination of two separate LSTMs, first one is passing over the sequence from left to right, and the other is passing from right to left. We use 1024 neurons in all LSTMs. Both LSTMs output some 1024-dimensional vectors at every position. These outputs are <a href="https://github.com/YerevaNN/translit-rnn/blob/master/utils.py#L283">concatenated</a> into a 2048 dimensional vector and are passed through another dense layer that outputs a 1024 dimensional vector. That’s what we call one layer of a bidirectional LSTM. The number of such layers is another hyperparameter (<code class="highlighter-rouge">--depth</code>). Our experiments showed that 2 layers learn better than 1 or 3 layers.</p> <p>At every position the output of the last bidirectional LSTM is <a href="https://github.com/YerevaNN/translit-rnn/blob/master/utils.py#L292">concatenated with the one-hot vector of the input</a> forming a 1096 dimensional vector. Then it is densely connected to the final layer with 152 neurons on which softmax is applied. The total loss is the mean of the cross entropy losses of the current sequence.</p> <p>The concatenation of the input vector to the output of the LSTM is similar to the residual connections introduced in <a href="https://arxiv.org/abs/1512.03385">deep residual networks</a>. Some of the transliteration rules are very easy and deterministic, so they can be learned by a diagonal-like matrix between input and output vectors. For more complex rules the output of LSTMs will become important. One important difference from deep residual networks is that instead of adding the input vector to the output of LSTMs, we just concatenate them. Also, our residual connections do not help fighting the vanishing/exploding gradient problem, we have LSTM for that.</p> <h2 id="results">Results</h2> <p>We have trained this network using <code class="highlighter-rouge">adagrad</code> algorithm with gradient clipping (learning rate was set to <code class="highlighter-rouge">0.01</code> and was not modified). Training is not very stable and it’s very hard to wait until it overfits on our hardware (NVidia GTX 980). We use <code class="highlighter-rouge">--batch_size 350</code> and it consumes more than 2GB of GPU memory.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python -u train.py --hdim 1024 --depth 2 --seq_len 30 --batch_size 350 &amp;&gt; log </code></pre></div></div> <p>The model we got for Armenian was trained for 42 hours. Here are the plots of training and validation sets:</p> <table> <thead> <tr> <th><img src="http://yerevann.github.io/public/2016-09-09/loss.png" alt="Loss functions" /></th> </tr> </thead> <tbody> <tr> <td>Loss functions. Green is the validation loss, blue is the training loss.</td> </tr> </tbody> </table> <p>The loss quickly drops in the first quarter of the first epoch, then continues to slowly decrease. We stopped after 5.1 epochs. The <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> between the original Armenian text and the output of the network on the validation test is 405 (the length is 36694). For example, hayeren.am’s converter output has more than 2500 edit distance.</p> <p>Here are some results.</p> <table> <thead> <tr> <th>Romanized snippet from Wikipedia (test set)</th> <th>Transliteration by translit-rnn</th> </tr> </thead> <tbody> <tr> <td>Belgiayi gyuxatntesutyuny Belgiayi tntesutyan jyuxeric mekn e։ Gyuxatntesutyany bnorosh e bardzr intyensivutyune, sakayn myec che nra der@ erkri tntesutyan mej։ Byelgian manr ev mijin agrarayin tntesutyunneri erkir e։ Gyuxatntyesutyan mej ogtagortsvox hoghataracutyan mot kese patkanum e 5-ic 20 ha unecox fermernerin, voronq masnagitacats yen qaxaknerin mterqner matakararelu gorcum, talis en apranqayin artadranqi himnakan zangvatse։</td> <td>Բելգիայի գյուղատնտեսությունը Բելգիայի տնտեսության ճյուղերից մեկն է։ Գյուղատնտեսությանը բնորոշ է բարձր ինտենսիվությունը, սակայն մեծ չէ նրա դերը երկրի տնտեսության մեջ։ Բելգիան մանր և միջին ագրարային տնտեսությունների երկիր է։ Գյուղատնտեսության մեջ օգտագործվող հողատարածության մոտ կեսը պատկանում է 5-ից 20 հա ունեցող ֆերմերներին, որոնք մասնագիտացած են քաղաքներին մթերքներ մատակարարելու գործում, տալիս են ապրանքային արտադրանքի հիմնական զանգվածը։</td> </tr> </tbody> </table> <p>Edit distance between this output and the original text is 0. Next we try some legal text in Armenian:</p> <table> <thead> <tr> <th>Romanized snippet from Armenian constitution</th> <th>Transliteration by translit-rnn</th> </tr> </thead> <tbody> <tr> <td>Zhoghovurdn ir ishkhanutyunn irakanatsnum e azat yntrutyunneri, hanraqveneri, inchpyes naev Sahmanadrutyamb naghatesvac petakan ev teghakan inqnakaravarman marminnyeri u pashtonatar anzanc midjocov:</td> <td>հողովուրդն իր իշխանությունն իրականացնում է ազատ ընտրությունների, հանրաքվեների, ինչպես նաև Սահմանադրությամբ նախատեսված պետական և տեղական ինքնակառավարման մարմինների ու պաշտոնատար անձանց միջոցով:</td> </tr> </tbody> </table> <p>There is only one error here. The first word should start by <code class="highlighter-rouge">Ժ</code> and not <code class="highlighter-rouge">հ</code>. The possible reason for this is that the network doesn’t have a left-side context for that character.</p> <p>An interesting feature of this system is that it also tries to learn when the Latin letters should not be converted to Armenian. Next example comes from a random Facebook group:</p> <table> <thead> <tr> <th>Random post from a Facebook group</th> <th>Transliteration by translit-rnn</th> </tr> </thead> <tbody> <tr> <td>aysor aravotyan jamy 10;40–11;00 ynkac hatvacum 47 hamari yertuxayini miji txa,vor qez pahecir txamardavari u vori hamar MERSI.,xndrum em ete kardas PM gri. p.s.anlurj, animast u antexi commentner chgreq,karevor e u lurj.</td> <td>այսօր առավոտյան ժամը 10;40–11;00 ընկած հատվածում 47 համարի երթուղայինի միջի տղա,որ քեզ պահեցիր տղամարդավարի ու որի համար ՄԵՐSI.,խնդրում եմ եթե կարդաս ՊՄ գրի. p.s.անլուրջ, անիմաստ ու անտեղի ցոմմենտներ չգրեք,կարևոր է ու լուրջ.</td> </tr> </tbody> </table> <p>It is interesting that the sequence <code class="highlighter-rouge">p.s.</code> is not transliterated. Also it decided to leave half of the letters of <code class="highlighter-rouge">MERSI</code> in Latin which is probably because it’s written in all caps (Wikipedia doesn’t contain a lot of text in all caps, maybe except some abbreviations). Also, the word <code class="highlighter-rouge">commentner</code> is transliterated as <code class="highlighter-rouge">ցոմմենտներ</code> (instead of <code class="highlighter-rouge">քոմենթներ</code>), because it’s not really a romanized Armenian word, it just includes the English word <code class="highlighter-rouge">comment</code> (and it definitely doesn’t appear in Wiki).</p> <h2 id="future-work">Future work</h2> <p>First we plan to understand what the system actually learned by visualizing its behavior on different cases. It is interesting to see how the residual connection performed and also if the network managed to discover some rules known from Armenian orthography.</p> <p>Next, we want to bring this tool to the web. We will have to make much smaller/faster model, translate it to Javascript, and probably wrap it in a Chrome extension.</p> <p>Finally, we would like to see this tool applied to more languages. We have released all the code in the <a href="https://github.com/YerevaNN/translit-rnn">translit-rnn repository</a> and prepared instructions on how to add a new language. Basically a large corpus and probabilistic romanization rules are required.</p> <p>We would like to thank Adam Mathias Bittlingmayer for many valuable discussions.</p> Combining CNN and RNN for spoken language identification 2016-06-26T00:00:00+00:00 http://yerevann.github.io//2016/06/26/combining-cnn-and-rnn-for-spoken-language-identification <p>By <a href="https://github.com/Harhro94">Hrayr Harutyunyan</a> and <a href="https://github.com/Hrant-Khachatrian">Hrant Khachatrian</a></p> <p>Last year Hrayr used <a href="/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/">convolutional networks to identify spoken language</a> from short audio recordings for a <a href="https://community.topcoder.com/longcontest/?module=ViewProblemStatement&amp;rd=16555&amp;compid=49304">TopCoder contest</a> and got 95% accuracy. After the end of the contest we decided to try recurrent neural networks and their combinations with CNNs on the same task. The best combination allowed to reach 99.24% and an ensemble of 33 models reached 99.67%. This work became Hrayr’s bachelor’s thesis.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#inputs-and-outputs" id="markdown-toc-inputs-and-outputs">Inputs and outputs</a></li> <li><a href="#network-architecture" id="markdown-toc-network-architecture">Network architecture</a> <ul> <li><a href="#convolutional-networks-cnn" id="markdown-toc-convolutional-networks-cnn">Convolutional networks (CNN)</a></li> <li><a href="#recurrent-neural-networks-rnn" id="markdown-toc-recurrent-neural-networks-rnn">Recurrent neural networks (RNN)</a></li> <li><a href="#combinations-of-cnn-and-rnn" id="markdown-toc-combinations-of-cnn-and-rnn">Combinations of CNN and RNN</a></li> </ul> </li> <li><a href="#ensembling" id="markdown-toc-ensembling">Ensembling</a></li> <li><a href="#final-remarks" id="markdown-toc-final-remarks">Final remarks</a></li> </ul> <h2 id="inputs-and-outputs">Inputs and outputs</h2> <p>As before, the inputs of the networks are spectrograms of speech recordings. It seems spectrograms are the standard way to represent audio for deep learning systems (see <a href="http://arxiv.org/abs/1508.01211">“Listen, Attend and Spell”</a> and <a href="http://arxiv.org/abs/1512.02595">“Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”</a>).</p> <p>Some networks use up to 11khz frequencies (858 x 256 image) and others use up to 5.5khz frequencies (858 x 128 image). In general the networks which use up to 5.5khz frequencies perform a little bit better (probably because the higher frequencies do not contain much useful information and just make overfitting easier).</p> <p>The output layer of all networks is a fully connected softmax layer with 176 units.</p> <p>We didn’t augment the data using <a href="/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/#data-augmentation"><em>vocal tract length augmentation</em></a>.</p> <h2 id="network-architecture">Network architecture</h2> <p>We have tested several network architectures. First set of architectures are plain AlexNet-like convolutional networks. The second set contains no convolutions and interprets the columns of the spectrogram as a sequence of inputs to a recurrent network. The third set applies RNN on top of the features extracted by a convolutional network. All models are implemented in <a href="http://deeplearning.net/software/theano/">Theano</a> and <a href="http://lasagne.readthedocs.io/en/latest/">Lasagne</a>.</p> <p>Almost all networks easily reach 100% accuracy on the training set. In the following tables we describe all architectures we tried and report accuracy on the validation set.</p> <h3 id="convolutional-networks-cnn">Convolutional networks (CNN)</h3> <p>The network consists of 6 blocks of 2D convolution, ReLU nonlinearity, 2D max pooling and batch normalization. We use 7x7 filters for the first convoluational layer, 5x5 for the second and 3x3 for the rest. Pooling size is always 3x3 with a stride 2.</p> <p><a href="https://arxiv.org/abs/1502.03167">Batch normalization</a> significantly increases the training speed (this fact is reported in lots of recent papers). Finally we use only 1 fully connected layer between the last pooling layer and the softmax layer, and apply 50% dropout on that.</p> <table> <thead> <tr> <th>Network</th> <th>Accuracy</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net.py">tc_net</a></td> <td>&lt;80%</td> <td>The difference between this network and the CNN descibed in the previous work is that this network has only one fully connected layer. We didn’t train this network much because of <code class="highlighter-rouge">ignore_border=False</code>, which slows down the training</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_mod.py">tc_net_mod</a></td> <td>97.14</td> <td>This network is the same as <code class="highlighter-rouge">tc_net</code> but instead of <code class="highlighter-rouge">ignore_border=False</code>, we put <code class="highlighter-rouge">pad=2</code></td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_mod_5khz_small.py">tc_net_mod_5khz_small</a></td> <td>96.49</td> <td>This network is a smaller copy of <code class="highlighter-rouge">tc_net_mod</code> network and works with up to 5.5khz frequencies</td> </tr> </tbody> </table> <p>The Lasagne setting <code class="highlighter-rouge">ignore_border=False</code> <a href="http://lasagne.readthedocs.io/en/latest/modules/layers/pool.html#lasagne.layers.MaxPool2DLayer">prevents</a> Theano from using CuDNN. Setting it to <code class="highlighter-rouge">True</code> significantly increased the speed.</p> <p>Here is the detailed description of the best network of this set: <a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_mod.py">tc_net_mod</a>.</p> <table> <thead> <tr> <th>Nr</th> <th>Type</th> <th>Channels</th> <th>Width</th> <th>Height</th> <th>Kernel size / stride</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>Input</td> <td>1</td> <td>858</td> <td>256</td> <td> </td> </tr> <tr> <td>1</td> <td>Conv</td> <td>16</td> <td>852</td> <td>250</td> <td>7x7 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>16</td> <td>852</td> <td>250</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>16</td> <td>427</td> <td>126</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>16</td> <td>427</td> <td>126</td> <td> </td> </tr> <tr> <td>2</td> <td>Conv</td> <td>32</td> <td>423</td> <td>122</td> <td>5x5 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>32</td> <td>423</td> <td>122</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>32</td> <td>213</td> <td>62</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>32</td> <td>213</td> <td>62</td> <td> </td> </tr> <tr> <td>3</td> <td>Conv</td> <td>64</td> <td>211</td> <td>60</td> <td>3x3 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>64</td> <td>211</td> <td>60</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>64</td> <td>107</td> <td>31</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>64</td> <td>107</td> <td>31</td> <td> </td> </tr> <tr> <td>4</td> <td>Conv</td> <td>128</td> <td>105</td> <td>29</td> <td>3x3 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>128</td> <td>105</td> <td>29</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>128</td> <td>54</td> <td>16</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>128</td> <td>54</td> <td>16</td> <td> </td> </tr> <tr> <td>5</td> <td>Conv</td> <td>128</td> <td>52</td> <td>14</td> <td>3x3 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>128</td> <td>52</td> <td>14</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>128</td> <td>27</td> <td>8</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>128</td> <td>27</td> <td>8</td> <td> </td> </tr> <tr> <td>6</td> <td>Conv</td> <td>256</td> <td>25</td> <td>6</td> <td>3x3 / 1</td> </tr> <tr> <td> </td> <td>ReLU</td> <td>256</td> <td>25</td> <td>6</td> <td> </td> </tr> <tr> <td> </td> <td>MaxPool</td> <td>256</td> <td>14</td> <td>3</td> <td>3x3 / 2, pad=2</td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>256</td> <td>14</td> <td>3</td> <td> </td> </tr> <tr> <td>7</td> <td>Fully connected</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>ReLU</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>BatchNorm</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>Dropout</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>8</td> <td>Fully connected</td> <td>176</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>Softmax Loss</td> <td>176</td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p>During the training we accidentally discovered a <a href="https://github.com/Theano/Theano/issues/4534">bug in Theano</a>, which was quickly fixed by Theano developers.</p> <h3 id="recurrent-neural-networks-rnn">Recurrent neural networks (RNN)</h3> <p>The spectrogram can be viewed as a sequence of column vectors that consist of 256 (or 128, if only &lt;5.5KHz frequencies are used) numbers. We apply <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent networks</a> with 500 <a href="https://arxiv.org/abs/1412.3555">GRU cells</a> in each layer on these sequences.</p> <p><img src="/public/2016-06-26/rnn.png" alt="GRU runs directly on the spectrogram" title="GRU runs directly on the spectrogram" /></p> <table> <thead> <tr> <th>Network</th> <th>Accuracy</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/rnn.py">rnn</a></td> <td>93.27</td> <td>One GRU layer on top ot the input layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/rnn_2layers.py">rnn_2layers</a></td> <td>95.66</td> <td>Two GRU layers on top ot the input layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/rnn_2layers_5khz.py">rnn_2layers_5khz</a></td> <td>98.42</td> <td>Two GRU layers on top ot the input layer, maximum frequency: 5.5khz</td> </tr> </tbody> </table> <p>The second layer of GRU cells improved the performance. Cropping out frequencies above 5.5KHz helped fight overfitting. We didn’t use dropout for RNNs.</p> <p>Both RNNs and CNNs were trained using <a href="http://lasagne.readthedocs.io/en/latest/modules/updates.html#lasagne.updates.adadelta">adadelta</a> for a few epochs, then by <a href="http://lasagne.readthedocs.io/en/latest/modules/updates.html#lasagne.updates.momentum">SGD with momentum</a> (0.003 or 0.0003) until overfitting. If SGD with momentum is applied from the very beginning, the convergence is very slow. Adadelta converges faster but usually doesn’t reach high validation accuracy.</p> <h3 id="combinations-of-cnn-and-rnn">Combinations of CNN and RNN</h3> <p>The general architecture of these combinations is a convolutional feature extractor applied on the input, then some recurrent network on top of the CNN’s output, then an optional fully connected layer on RNN’s output and finally a softmax layer.</p> <p>The output of the CNN is a set of several channels (also known as <em>feature maps</em>). We can have separate GRUs acting on each channel (with or without weight sharing) as described in this picture:</p> <p><img src="/public/2016-06-26/cnn-multi-rnn.png" alt="Multiple GRUs run on CNN output" title="Multiple GRUs run on CNN output" /></p> <p>Another option is to interpret CNN’s output as a 3D-tensor and run a single GRU on 2D slices of that tensor:</p> <p><img src="/public/2016-06-26/cnn-one-rnn.png" alt="Single GRU runs on CNN output" title="Single GRU runs on CNN output" /></p> <p>The latter option has more parameters, but the information from different channels is mixed inside the GRU, and it seems to improve performance. This architecture is similar to the one described in <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43455.pdf">this paper</a> on speech recognition, except that they also use some residual connections (“shortcuts”) from input to RNN and from CNN to fully connected layers. It is interesting to note that recently it was shown that similar architectures work well for <a href="http://arxiv.org/abs/1602.00367">text classification</a>.</p> <table> <thead> <tr> <th>Network</th> <th>Accuracy</th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn.py">tc_net_rnn</a></td> <td>92.4</td> <td>CNN consists of 3 convolutional blocks and outputs 32 channels of size 104x13. Each of these channels is fed to a separate GRU as a sequence of 104 vectors of size 13. The outputs of GRUs are combined and fed to a fully connected layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_nodense.py">tc_net_rnn_nodense</a></td> <td>91.94</td> <td>Same as above, except there is no fully connected layer on top of GRUs. Outputs of GRU are fed directly to the softmax layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_shared.py">tc_net_rnn_shared</a></td> <td>96.96</td> <td>Same as above, but the 32 GRUs share weights. This helped to fight overfitting</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_shared_pad.py">tc_net_rnn_shared_pad</a></td> <td>98.11</td> <td>4 convolutional blocks in CNN using <code class="highlighter-rouge">pad=2</code> instead of <code class="highlighter-rouge">ignore_broder=False</code> (which enabled CuDNN and the training became much faster). The output of CNN is a set of 32 channels of size 54x8. 32 GRUs are applied (one for each channel) with shared weights and there is no fully connected layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_deeprnn_shared_pad.py">tc_net_deeprnn_shared_pad</a></td> <td>95.67</td> <td>4 convolutional block as above, but 2-layer GRUs with shared weights are applied on CNN’s outputs. Overfitting became stronger because of this second layer</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_shared_pad_augm.py">tc_net_shared_pad_augm</a></td> <td>98.68</td> <td>Same as <a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_shared_pad.py">tc_net_rnn_shared_pad</a>, but the network randomly crops the input and takes 9s interval. The performance became a bit better due to this</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_onernn.py">tc_net_rnn_onernn</a></td> <td>99.2</td> <td>The outputs of a CNN with 4 convolutional blocks are grouped into a 32x54x8 3D-tensor and a single GRU runs on a sequence of 54 vectors of size 32*8</td> </tr> <tr> <td><a href="https://github.com/YerevaNN/Spoken-language-identification/blob/master/theano/networks/tc_net_rnn_onernn_notimepool.py">tc_net_rnn_onernn_notimepool</a></td> <td>99.24</td> <td>Same as above, but the stride along the time axis is set to 1 in every pooling layer. Because of this the CNN outputs 32 channels of size 852x8</td> </tr> </tbody> </table> <p>The second layer of GRU in this setup didn’t help due to the overfitting.</p> <p>It seems that subsampling in the time dimension is not a good idea. The information that is lost during subsampling can be better used by the RNN. In the <a href="http://arxiv.org/abs/1602.00367v1">paper on text classification</a> by Yijun Xiao and Kyunghyun Cho, the authors even suggest that maybe all pooling/subsampling layers can be replaced by recurrent layers. We didn’t experiment with this idea, but it looks very promising.</p> <p>These networks were trained using SGD with momentum only. The learning rate was set to 0.003 for around 10 epochs, then it was manually decreased to 0.001 and then to 0.0003. On average, it took 35 epochs to train these networks.</p> <h1 id="ensembling">Ensembling</h1> <p>The best single model had 99.24% accuracy on the validation set. We had 33 predictions by all these models (there were more than one predictions for some models, taken after different epochs) and we just summed up the predicted probabilities and got 99.67% accuracy. Surprisingly, our other attempts of ensembling (e.g. <a href="http://www.scholarpedia.org/article/Ensemble_learning#Voting_based_methods">majority voting</a>, ensemble only on some subset of all models) didn’t give better results.</p> <h1 id="final-remarks">Final remarks</h1> <p>The number of hyperparameters in these CNN+RNN mixtures is huge. Because of the limited hardware we covered only a very small fraction of possible configurations.</p> <p>The organizers of the original contest <a href="http://apps.topcoder.com/forums//?module=Thread&amp;threadID=866217&amp;start=0&amp;mc=3">did not publicly release</a> the dataset. Nevertheless we release the full source code <a href="https://github.com/YerevaNN/Spoken-language-identification/tree/master/theano">on GitHub</a>. We couldn’t find many Theano/Lasagne implementations of CNN+RNN networks on GitHub, and we hope these scripts will partially fill that gap.</p> <p>This work was part of Hrayr’s bachelor’s thesis, which is available on <a href="http://www.academia.edu/25722629/%D4%BD%D5%B8%D5%BD%D6%84%D5%AB%D6%81_%D5%AC%D5%A5%D5%A6%D5%BE%D5%AB_%D5%B3%D5%A1%D5%B6%D5%A1%D5%B9%D5%B8%D6%82%D5%B4_%D5%AD%D5%B8%D6%80%D5%A8_%D5%B8%D6%82%D5%BD%D5%B8%D6%82%D6%81%D5%B4%D5%A1%D5%B6_%D5%B4%D5%A5%D5%A9%D5%B8%D5%A4%D5%B6%D5%A5%D6%80%D5%B8%D5%BE">academia.edu</a> (the text is in Armenian).</p> Playground for bAbI tasks 2016-02-23T00:00:00+00:00 http://yerevann.github.io//2016/02/23/playground-for-babi-tasks <p>Recently we have <a href="/2016/02/05/implementing-dynamic-memory-networks/">implemented</a> Dynamic memory networks in Theano and trained it on Facebook’s bAbI tasks which are designed for testing basic reasoning abilities. Our implementation now solves 8 out of 20 bAbI tasks which is still behind state-of-the-art. Today we release a <a href="http://yerevann.com/dmn-ui/">web application</a> for testing and comparing several network architectures and pretrained models.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#attention-module" id="markdown-toc-attention-module">Attention module</a></li> <li><a href="#architecture-extensions" id="markdown-toc-architecture-extensions">Architecture extensions</a></li> <li><a href="#results" id="markdown-toc-results">Results</a></li> <li><a href="#visualizing-dynamic-memory-networks" id="markdown-toc-visualizing-dynamic-memory-networks">Visualizing Dynamic memory networks</a></li> <li><a href="#looking-for-feedback" id="markdown-toc-looking-for-feedback">Looking for feedback</a></li> </ul> <h2 id="attention-module">Attention module</h2> <p>One of the key parts in the DMN architecture, as described in the <a href="http://arxiv.org/abs/1506.07285">original paper</a>, is its attention system. DMN obtains internal representations of input sentences and question and passes these to the episodic memory module. Episodic memory passes over all the facts, generates <em>episodes</em>, which are finally combined into a <em>memory</em>. Each episode is created by looking at all input sentences according to some <em>attention</em>. Attention system gives a score for each of the sentences, and if the score is low for some sentence, it will be ignored when constructing the episode.</p> <p>Attention system is a simple 2 layer neural network where input is a vector of features computed based on input sentence, question and current state of the memory. This vector of features is described in the paper as follows:</p> <p><img src="/public/2016-02-23/attention-vector.png" alt="attention module input" title="attention module input" /></p> <p>where <code class="highlighter-rouge">c</code> is an input sentence, <code class="highlighter-rouge">q</code> is the question, <code class="highlighter-rouge">m</code> is the current state of the memory. We tried to stay as close to the original as possible in our first implementation, but probably we understood these expressions too literally. We implemented <code class="highlighter-rouge">|c-q|</code> as an <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/blob/master/dmn_basic.py#L217">absolute value</a> of a difference of two vectors, which caused lots of trouble, as Theano’s implementation of (the gradient of) <code class="highlighter-rouge">abs</code> function gave <code class="highlighter-rouge">NaN</code>s at random during training. Then, the terms <code class="highlighter-rouge">cWq</code> and <code class="highlighter-rouge">cWm</code> actually produce <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/blob/master/dmn_basic.py#L215">just two numbers</a>, and they do not affect anything in a large vector.</p> <p>Later we implemented another version called <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/blob/master/dmn_smooth.py#L223"><code class="highlighter-rouge">dmn_smooth</code></a> which uses Euclidean distance between two vectors (instead of <code class="highlighter-rouge">abs</code>). This version is much more stable and gives better results. It is interesting to note that this version trains faster on CPU than on our GPU (GTX 980). It could be because of our not so optimal code or some <a href="https://github.com/Theano/Theano/issues/1168">issue</a> in Theano’s <code class="highlighter-rouge">scan</code> function.</p> <h2 id="architecture-extensions">Architecture extensions</h2> <p>The only significant difference between our implementation and the original DMN, as we understand it, is the fixed number of episodes. In the paper the authors describe a stop condition, so that the network decides if it needs to compute more episodes. We did not implement it yet.</p> <p>Our implementations heavily overfit on many tasks. We tried several techniques to fight that, but with little luck. First, we have implemented a version of <code class="highlighter-rouge">dmn_smooth</code> which supports <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/blob/master/dmn_batch.py">mini-batch training</a>. Then we applied <a href="https://en.wikipedia.org/wiki/Dropout_(neural_networks)">dropout</a> and <a href="http://arxiv.org/abs/1502.03167">batch normalization</a> on top of the memory module (before passing to the answer module). All of these tricks help for some tasks for some hyperparameters, but still we could not beat the results obtained using simple <code class="highlighter-rouge">dmn_smooth</code> trained without mini-batches.</p> <p>We plan to bring some ideas from the <a href="http://arxiv.org/abs/1508.05508">Neural Reasoner paper</a>, especially the idea of recovering the input sentences based on the outputs of the input module.</p> <h2 id="results">Results</h2> <p>We train our implementations on bAbI tasks in a weakly supervised setting, as described in our <a href="http://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/#memory-networks">previous post</a>. Here we compare our results to <a href="http://arxiv.org/abs/1410.3916">End-to-end memory networks</a> (MemN2N).</p> <p>So far our best results are obtained by training <code class="highlighter-rouge">dmn_smooth</code> with 100 neurons for internal representations, 5 memory hops, using simple gradient descent for 11 epochs. We train jointly on all 20 bAbI tasks.</p> <table> <thead> <tr> <th>Task</th> <th>MemN2N best version</th> <th>Joint100 75.05%</th> </tr> </thead> <tbody> <tr> <td>1. Single supporting fact</td> <td><strong>99.9%</strong></td> <td><strong>100%</strong></td> </tr> <tr> <td>2. Two supporting facts</td> <td>81.2%</td> <td>39.7%</td> </tr> <tr> <td>3. Three supporting facts</td> <td>68.3%</td> <td>41.5%</td> </tr> <tr> <td>4. Two argument relations</td> <td>82.5%</td> <td>75.5%</td> </tr> <tr> <td>5. Three arguments relations</td> <td>87.1%</td> <td>50.1%</td> </tr> <tr> <td>6. Yes/no questions</td> <td><strong>98%</strong></td> <td><strong>97.7%</strong></td> </tr> <tr> <td>7. Counting</td> <td>89.9%</td> <td>91.4%</td> </tr> <tr> <td>8. Lists/sets</td> <td>93.9%</td> <td><strong>95.2%</strong></td> </tr> <tr> <td>9. Simple negation</td> <td><strong>98.5%</strong></td> <td><strong>99%</strong></td> </tr> <tr> <td>10. Indefinite knowledge</td> <td><strong>97.4%</strong></td> <td>87.3%</td> </tr> <tr> <td>11. Basic coreference</td> <td><strong>96.7%</strong></td> <td><strong>100%</strong></td> </tr> <tr> <td>12. Conjuction</td> <td><strong>100%</strong></td> <td>87%</td> </tr> <tr> <td>13. Compound coreference</td> <td><strong>99.5%</strong></td> <td><strong>96.4%</strong></td> </tr> <tr> <td>14. Time reasoning</td> <td><strong>98%</strong></td> <td>73.1%</td> </tr> <tr> <td>15. Basic deduction</td> <td><strong>98.2%</strong></td> <td>53.9%</td> </tr> <tr> <td>16. Basic induction</td> <td>49%</td> <td>49.5%</td> </tr> <tr> <td>17. Positional reasoning</td> <td>57.4%</td> <td>59.3%</td> </tr> <tr> <td>18. Size reasoning</td> <td>90.8%</td> <td><strong>98.3%</strong></td> </tr> <tr> <td>19. Path finding</td> <td>9.4%</td> <td>9%</td> </tr> <tr> <td>20. Agent’s motivations</td> <td><strong>99.8%</strong></td> <td><strong>97.1%</strong></td> </tr> <tr> <td><strong>Average accuracy</strong></td> <td><strong>84.775%</strong></td> <td><strong>75.05%</strong></td> </tr> <tr> <td><strong>Solved tasks</strong></td> <td><strong>10</strong></td> <td><strong>8</strong></td> </tr> </tbody> </table> <p>We solve (obtain &gt;95% accuracy) 8 tasks. Our system outperforms MemN2N on some tasks, but on average stays behind by 10 percentage points. Experiments show that our networks do not manage to find connections between several sentences at once (tasks 2, 3 etc.). Task 19 (path finding) remains the most difficult one. It is actually the only task on which none of our implementations overfit. The authors of <a href="http://arxiv.org/abs/1508.05508">Neural Reasoner</a> claim some success on that task when training on 10 000 examples. We use only 1000 samples per task for all experiments.</p> <h2 id="visualizing-dynamic-memory-networks">Visualizing Dynamic memory networks</h2> <p>We have created a web application / playground for Dynamic memory networks focused on bAbI tasks. It allows to choose a pretrained model and send custom input sentences and questions. The app shows the predicted answer and visualizes attention scores for each memory step.</p> <table> <thead> <tr> <th><img src="/public/2016-02-23/dmn-ui.png" alt="Playground for bAbI tasks" title="Playground for bAbI tasks" /></th> </tr> </thead> <tbody> <tr> <td>Web-based <a href="http://yerevann.com/dmn-ui/">playground for bAbI tasks</a></td> </tr> </tbody> </table> <p>These visualizations show that the network does not significantly change its attention for different episodes, so it is very hard to correctly answer the questions from tasks 2 or 3.</p> <p>Web app is accessible at <strong><a href="http://yerevann.com/dmn-ui/">http://yerevann.com/dmn-ui/</a></strong>. Note that the vocabulary of bAbI tasks is quite limited, and our implementation of DMN cannot process out-of-vocabulary words. <code class="highlighter-rouge">Sample</code> button is a good starting point, it gives a random sample from bAbI test set.</p> <h2 id="looking-for-feedback">Looking for feedback</h2> <p>Everything described in this post is available on Github. DMN implementations are <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano">here</a>, Flask-based restful server of the web app is in the <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/tree/master/server">/server/ folder</a>, UI is in <a href="https://github.com/YerevaNN/dmn-ui">another repository</a>. Feel free to fork, report issues, and please share your thoughts.</p> Implementing Dynamic memory networks 2016-02-05T00:00:00+00:00 http://yerevann.github.io//2016/02/05/implementing-dynamic-memory-networks <p>The Allen Institute for Artificial Intelligence has organized a 4 month <a href="https://www.kaggle.com/c/the-allen-ai-science-challenge">contest</a> in Kaggle on question answering. The aim is to create a system which can correctly answer the questions from the 8th grade science exams of US schools (biology, chemistry, physics etc.). DeepHack Lab organized a <a href="http://qa.deephack.me/">scientific school + hackathon</a> devoted to this contest in Moscow. Our team decided to use this opportunity to explore the deep learning techniques on question answering (although they seem to be far behind traditional systems). We tried to implement Dynamic memory networks described <a href="http://arxiv.org/abs/1506.07285">in a paper by A. Kumar et al</a>. Here we report some preliminary results. In the next blog post we will describe the techniques we used to get to top 5% in the contest.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#babi-tasks" id="markdown-toc-babi-tasks">bAbI tasks</a></li> <li><a href="#memory-networks" id="markdown-toc-memory-networks">Memory networks</a></li> <li><a href="#dynamic-memory-networks" id="markdown-toc-dynamic-memory-networks">Dynamic memory networks</a> <ul> <li><a href="#semantic-memory" id="markdown-toc-semantic-memory">Semantic memory</a></li> <li><a href="#input-module" id="markdown-toc-input-module">Input module</a></li> <li><a href="#episodic-memory" id="markdown-toc-episodic-memory">Episodic memory</a></li> </ul> </li> <li><a href="#initial-experiments" id="markdown-toc-initial-experiments">Initial experiments</a></li> <li><a href="#next-steps" id="markdown-toc-next-steps">Next steps</a></li> </ul> <h2 id="babi-tasks">bAbI tasks</h2> <p>The questions of this contest are quite hard, they not only require lots of knowledge in natural sciences, but also abilities to make inferences, generalize the concepts, apply the general ideas to the examples and so on. The methods based on deep learning do not seem to be mature enough to handle all of these difficulties. On the other hand these questions have 4 answer candidates. That’s why, as was noted by <a href="https://www.youtube.com/watch?v=lM2-Mi-2egM">Dr. Vorontsov</a>, simple search engine indexed on lots of documents will perform better as a question answering system than any “intelligent” system.</p> <p>But there is already some work on creating question answering / reasoning systems using neural approaches. As another lecturer of the DeepHack event, <a href="https://www.youtube.com/watch?v=gi4Zf59_IcU">Tomas Mikolov</a>, told us, we should start from easy, even synthetic questions and try to gradually increase the difficulty. This roadmap towards building intelligent question answering systems is described in <a href="http://arxiv.org/abs/1502.05698">a paper</a> by Facebook researchers Weston, Bordes, Chopra, Rush, Merriënboer and Mikolov, where the authors introduce a benchmark of toy questions called <a href="http://fb.ai/babi">bAbI tasks</a> which test several basic reasoning capabilities of a QA system.</p> <p>Questions in the bAbI dataset are grouped into 20 types, each of them has 1000 samples for training and another 1000 samples for testing. A system is said to have passed a given task, if it correctly answers at least 95% of the questions in the test set. There is also a version with 10K samples, but as Mikolov told during the lecture, deep learning is not necessarily about large datasets, and in this setting it is more interesting to see if the systems can learn answering questions by looking at a few training samples.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/babi1.png" alt="some of the bAbI tasks" title="some of the bAbI tasks" /></th> </tr> </thead> <tbody> <tr> <td><img src="/public/2016-02-06/babi2.png" alt="some of the bAbI tasks" title="some of the bAbI tasks" /></td> </tr> </tbody> <tbody> <tr> <td>Some of the bAbI tasks. More examples can be found in the <a href="http://arxiv.org/pdf/1502.05698v10.pdf">paper</a>.</td> </tr> </tbody> </table> <h2 id="memory-networks">Memory networks</h2> <p>bAbI tasks were first evaluated on an LSTM-based system, which achieve 50% performance on average and do not pass any task. Then the authors of the paper try <a href="http://arxiv.org/abs/1410.3916">Memory Networks</a> by Weston et al. It is a recurrent network which has a long-term memory component where it can learn to write some data (the input sentences) and read them later.</p> <p>bAbI tasks include not only the answers to the questions but also the numbers of those sentences which help answer the question. This information is taken into account when training MemNN, they not only get the correct answers but also an information about which input sentences affect the answer. Under this so called <em>strongly supervised</em> setting “plain” Memory networks pass 7 of the 20 tasks. Then the authors apply some modifications to them and pass 16 tasks.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/memn2n.png" alt="End-to-end memory networks" title="End-to-end memory networks" /></th> </tr> </thead> <tbody> <tr> <td>The structure of MemN2N from the <a href="http://arxiv.org/abs/1410.3916">paper</a>.</td> </tr> </tbody> </table> <p>We are mostly interested in <em>weakly supervised</em> setting, because the additional information on important sentences is not available in many real scenarios. This was investigated in a paper by Sukhbaatar, Szlam, Weston and Fergus (from New York University and Facebook AI Research) where they introduce <a href="http://arxiv.org/abs/1503.08895">End-to-end memory networks</a> (MemN2N). They investigate many different configurations of these systems and the best version passes 9 tasks out of 20. Facebook’s MemN2N repository on GitHub lists <a href="https://github.com/facebook/MemNN">some implementations of MemN2N</a>.</p> <h2 id="dynamic-memory-networks">Dynamic memory networks</h2> <p>Another advancement in the direction of memory networks was made by Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani and Socher from Metamind. By the way, Richard Socher is the author of <a href="http://cs224d.stanford.edu/">an excellent course on deep learning and NLP</a> at Stanford, which helped us a lot to get into the topic. Their <a href="http://arxiv.org/abs/1506.07285">paper</a> introduces a new system called Dynamic memory networks (DMN) which passes 18 bAbI tasks in the strongly supervised setting. The paper does not talk about weakly supervised setting, so we decided to implement DMN from scratch in <a href="http://deeplearning.net/software/theano/">Theano</a>.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/dmn-high-level.png" alt="High-level structure of DMN" title="High-level structure of DMN" /></th> </tr> </thead> <tbody> <tr> <td>High-level structure of DMN from the <a href="http://arxiv.org/abs/1506.07285">paper</a>.</td> </tr> </tbody> </table> <h3 id="semantic-memory">Semantic memory</h3> <p>The input of the DMN is a sequence of word vectors of input sentences. We followed the paper and used pretrained <a href="http://nlp.stanford.edu/projects/glove/">GloVe vectors</a> and added the dimensionality of word vectors to the list of hyperparamaters (controlled by the command line argument <code class="highlighter-rouge">--word_vector_size</code>). DMN architecture treats these vectors as part of a so called <em>semantic memory</em> (in contrast to the <em>episodic memory</em>) which may contain other knowledge as well. Our implementation uses only word vectors and does <em>not</em> fine tune them during the training, so we don’t consider it as a part of the neural network.</p> <h3 id="input-module">Input module</h3> <p>The first module of DMN is an <em>input module</em> that is a <a href="http://arxiv.org/abs/1412.3555">gated recurrent unit</a> (GRU) running on the sequence of word vectors. GRU is a recurrent unit with 2 gates that control when its content is updated and when its content is erased. The hidden state of the input module is meant to represent the input processed so far in a vector. Input module outputs its hidden states either after every word (<code class="highlighter-rouge">--input_mask word</code>) or after every sentence (<code class="highlighter-rouge">--input_mask sentence</code>). These outputs are called <code class="highlighter-rouge">facts</code>.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/gru.png" alt="Formal definition of GRU" title="Formal definition of GRU" /></th> </tr> </thead> <tbody> <tr> <td>Formal definition of GRU. <code class="highlighter-rouge">z</code> is the <em>update gate</em> and <code class="highlighter-rouge">r</code> is the <em>reset gate</em>. More details and images can be found <a href="http://deeplearning4j.org/lstm.html">here</a>.</td> </tr> </tbody> </table> <p>Then there is a <em>question module</em> that processes the question word by word and outputs one vector at the end. This is done by using the same GRU as in the input module using the same weights.</p> <h3 id="episodic-memory">Episodic memory</h3> <p>The fact and question vectors extracted from the input enter the <em>episodic memory</em> module. Episodic memory is basically a composition of two nested GRUs. The outer GRU generates the final memory vector working over a sequence of so called <em>episodes</em>. This GRU state is initialized by the question vector. The inner GRU generates the episodes.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/dmn-details.png" alt="Details of DMN architecture" title="Details of DMN architecture" /></th> </tr> </thead> <tbody> <tr> <td>Details of DMN architecture from the <a href="http://arxiv.org/abs/1506.07285">paper</a>.</td> </tr> </tbody> </table> <p>The inner GRU generates the episodes by passing over the facts from the input module. But when updating its inner state, the GRU takes into account the output of some <code class="highlighter-rouge">attention function</code> on the current fact. Attention function gives a score (between 0 and 1) to each of the fact, and GRU (softly) ignores the facts having low scores. Attention function is a simple 2 layer neural network depending on the question vector, current fact, and current state of the memory. After each full pass on all facts the inner GRU outputs an <em>episode</em> which is fed into the outer GRU which on its turn updates the memory. Then because of the updated memory the attention may give different scores to the facts. So new episodes can be created. The number of steps of the outer GRU, that is the number of the episodes, can be determined dynamically, but we fix it to simplify the implementation. It is configured by <code class="highlighter-rouge">--memory_hops</code> setting.</p> <p>All facts, episodes and memories are in the same n-dimensional space, which is controlled by the command line argument <code class="highlighter-rouge">--dim</code>. Inner and outer GRUs share their weights.</p> <p>###</p> <p>The final state of the memory is being fed into the <em>answer module</em>, which produces the answer. We have implemented two kinds of answer modules. First is a simple linear layer on top of the memory vector with softmax activation (<code class="highlighter-rouge">--answer_module feedforward</code>). This is useful if each answer is just one word (like in the bAbI dataset). The second kind of answer module is another GRU that can produce multiple words (<code class="highlighter-rouge">--answer_module recurrent</code>). Its implementation is half baked now, as we didn’t need it for bAbI.</p> <p>The whole system is end-to-end differentiable and is trained using stochastic gradient descent. We use <a href="http://arxiv.org/abs/1212.5701"><code class="highlighter-rouge">adadelta</code></a> by default. More formulas and details of architecture can be found in the original paper. But the paper does not contain many implementation details, so we may have diverged from the original implementation.</p> <h2 id="initial-experiments">Initial experiments</h2> <p>We have tested this system on bAbI tasks with a few randomly selected hyperparameters. We initialized the word vectors by using 50-dimensional GloVe vectors trained on Wikipedia. Answer module is a simple feedforward classifier over the vocabulary (which is <em>very</em> limited in bAbI tasks). Here are the results.</p> <table> <thead> <tr> <th><img src="/public/2016-02-06/results.png" alt="Results" title="Results" /></th> </tr> </thead> <tbody> <tr> <td>First two columns are for strongly supervised systems <a href="http://arxiv.org/abs/1410.3916">MemNN</a> and <a href="http://arxiv.org/abs/1506.07285">DMN</a>. Third column is the best results of <a href="http://arxiv.org/abs/1410.3916">MemN2N</a>. The last 3 columns are our results with different dimensions of the memory.</td> </tr> </tbody> </table> <p>First basic observation is that weakly supervised systems are generally worse than the strongly supervised ones. When compared to MemN2N, our system performs much worse on the tasks 2, 3 and 16. As a result we pass only 7 tasks out of 20. On the other hand, our results on tasks 5, 6, 8, 9, 10 and 18 are better than MemN2N. Surprisingly what we got on the 17th task is better than in strongly supervised systems!</p> <p>Our system converges very fast on some of the tasks (like the first one), overfits on many other tasks and does not converge on tasks 2, 3 and 19.</p> <p>19th task (path finding) is not solved by any of these systems. Wojciech Zaremba from OpenAI informed us <a href="https://www.youtube.com/watch?v=ezE-13X0UoM">during his lecture</a> about one system which managed to solve it using 10K training samples. This remains a very interesting challenge for us. We need to carefully experiment with various parameters to reach some meaningful conclusions.</p> <p>We have tried to test on the full shuffled list of 20000 bAbI tasks. We couldn’t reach 60% average accuracy after 50 hours of training on an Amazon instance, while MemN2N authors report 87.6% accuracy.</p> <p>This implementation of DMN is available on <a href="https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano">Github</a>. We really need lots of feedback on this code.</p> <h2 id="next-steps">Next steps</h2> <ul> <li>We need a good way to visualize the attention in the episodic memory. This will help us understand what is exactly going on inside the system. Many papers now include such visualizations on some examples.</li> <li>Our model overfits on many of the tasks even with 25-dimensional memory. We briefly experimented with L2 regularization but it didn’t help much (<code class="highlighter-rouge">--l2</code>).</li> <li>Currently we are working on a slightly modified architecture which will be optimized for multiple choice questions. Basically it will include one more input module which will read the answer choices and will provide another input for the attention mechanism.</li> <li>Then we will be able to evaluate our code on more complex QA datasets like <a href="http://research.microsoft.com/en-us/um/redmond/projects/mctest/">MCTest</a>.</li> <li>Training with batches is not properly implemented yet. There are several technical challenges related to the variable length of input sequences. It becomes much harder to keep in control because of this kind of <a href="https://github.com/Theano/Theano/issues/1772">bugs</a> in Theano.</li> </ul> <p>We would like to thank the organizers of DeepHack.Q&amp;A for the really amazing atmosphere here in <a href="https://mipt.ru/">PhysTech</a>.</p> Generating Constitution with recurrent neural networks 2015-11-12T00:00:00+00:00 http://yerevann.github.io//2015/11/12/generating-constitution-with-recurrent-neural-networks <p>By <a href="https://github.com/hnhnarek">Narek Hovsepyan</a> and <a href="https://github.com/Hrant-Khachatrian">Hrant Khachatrian</a></p> <p>Few months ago <a href="http://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a> wrote a <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">great blog post</a> about recurrent neural networks. He explained how these networks work and implemented a character-level RNN language model which learns to generate Paul Graham essays, <a href="http://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt">Shakespeare works</a>, <a href="http://cs.stanford.edu/people/karpathy/char-rnn/wiki.txt">Wikipedia articles</a>, <a href="http://cs.stanford.edu/people/jcjohns/fake-math/4.pdf">LaTeX articles</a> and even C++ code. He also released the code of the network on <a href="https://github.com/karpathy/char-rnn">Github</a>. Lots of people did experiments, like generating <a href="https://gist.github.com/nylki/1efbaa36635956d35bcc">recipes</a>, <a href="http://cpury.github.io/learning-holiness/">Bible</a> or <a href="https://soundcloud.com/seaandsailor/sets/char-rnn-composes-irish-folk-music">Irish folk music</a>. We decided to test it on some legal texts in Armenian.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#character-level-rnn-language-model" id="markdown-toc-character-level-rnn-language-model">Character-level RNN language model</a></li> <li><a href="#data" id="markdown-toc-data">Data</a></li> <li><a href="#network-parameters" id="markdown-toc-network-parameters">Network parameters</a></li> <li><a href="#analysis" id="markdown-toc-analysis">Analysis</a></li> <li><a href="#generated-samples" id="markdown-toc-generated-samples">Generated samples</a></li> <li><a href="#nanogenmo" id="markdown-toc-nanogenmo">NaNoGenMo</a></li> </ul> <h2 id="character-level-rnn-language-model">Character-level RNN language model</h2> <p>Andrej did a great job explaining how the recurrent networks learn and even visualized how they work on text input in <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">his blog</a>. The program, called <code class="highlighter-rouge">char-rnn</code>, treats the input as a sequence of characters and has no prior knowledge about them. For example, it doesn’t know that the text is in English, that there are words and there are sentences, that the space character has a special meaning and so on. After some training it manages to figure out that some character combinations appear more often than the others, learns to predict English words, uses proper punctuation, and even understands that open parentheses must be closed. When trained on Wikipedia articles it can generate text in MediaWiki format without syntax errors, although the text has little or no meaning.</p> <h2 id="data">Data</h2> <p>We decided to test Karpathy’s RNN on Armenian text. Armenian language has a <a href="https://en.wikipedia.org/wiki/Armenian_alphabet">unique alphabet</a>, and the characters are encoded in the Unicode space by the codes <a href="http://www.unicode.org/charts/PDF/U0530.pdf">U+0530 - U+058F</a>. In UTF-8 these symbols use two bytes where the first byte is always <code class="highlighter-rouge">0xD4</code>, <code class="highlighter-rouge">0xD5</code> or <code class="highlighter-rouge">0xD6</code>. So the neural net has to look at almost 2 times larger distances (when compared to English) in order to be able to learn the words. Also, the Armenian alphabet contains 39 letters, 50% more than Latin.</p> <p>Recently the main political topic in Armenia is the Constitutional reform. This helped us to choose the corpus for training. We took all three versions of the Constitution of Armenia (the <a href="http://www.arlis.am/documentview.aspx?docID=1">first version</a> voted in 1995, the <a href="http://www.arlis.am/documentview.aspx?docID=75780">updated version</a> of 2005, and the <a href="http://moj.am/storage/uploads/Sahmanadrakan_1-15.docx">new proposal</a> which will be voted later this year) and concatenated them in a <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/data/input.txt">single text file</a>. The size of the corpus is just 440 KB, which is roughly 224 000 Unicode symbols (all non-Armenian symbols, including spaces and numbers use 1 byte). Andrej suggests to use at least 1MB data, so our corpus is very small. On the other hand the text is quite specific, the vocabulary is very small and the structure of the text is fairly simple.</p> <p>All articles are of the following form:</p> <blockquote> <p>Հոդված 1. Հայաստանի Հանրապետությունը ինքնիշխան, ժողովրդավարական, սոցիալական, իրավական պետություն է:</p> </blockquote> <p>The first word, <code class="highlighter-rouge">Հոդված</code>, means “Article”. Sentences end with the symbol <code class="highlighter-rouge">:</code>.</p> <h2 id="network-parameters">Network parameters</h2> <p><code class="highlighter-rouge">char-rnn</code> works with basic recurrent neural networks, LSTM networks and GRU-RNNs. In our experiments we only used LSTM network with 2 layers. Actually we don’t really understand how LSTM networks work in details, but we hope to improve our understanding by watching the videos of Richard Socher’s excellent <a href="http://cs224d.stanford.edu/index.html">NLP course</a>.</p> <p>We trained the network for 50 epochs with the default learning rate parameters (base rate is <code class="highlighter-rouge">2e-3</code>, which decays by a factor of <code class="highlighter-rouge">0.97</code> after each <code class="highlighter-rouge">10</code> epochs). We wanted to understand how the size of LSTM internal state (<code class="highlighter-rouge">rnn_size</code>), <a href="https://www.youtube.com/watch?v=UcKPdAM8cnI">dropout</a> and batch size affect the performance. We used <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search">grid search</a> over the following values:</p> <ul> <li><code class="highlighter-rouge">rnn_size</code>: <code class="highlighter-rouge">128</code>, <code class="highlighter-rouge">256</code>, <code class="highlighter-rouge">512</code></li> <li><code class="highlighter-rouge">batch_size</code>: <code class="highlighter-rouge">25</code>, <code class="highlighter-rouge">50</code>, <code class="highlighter-rouge">100</code></li> <li><code class="highlighter-rouge">dropout</code>: <code class="highlighter-rouge">0</code>, <code class="highlighter-rouge">0.2</code>, <code class="highlighter-rouge">0.4</code> and at the end we tried <code class="highlighter-rouge">0.6</code></li> </ul> <p>After installing Lua, Torch and CUDA (as described on <a href="https://github.com/karpathy/char-rnn#requirements"><code class="highlighter-rouge">char-rnn</code> page</a>) we have moved our mini-corpus to <code class="highlighter-rouge">/data/input.txt</code> and ran the <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/run.sh"><code class="highlighter-rouge">run.sh</code> file</a>, which contains commands like this:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">th train.lua <span class="nt">-data_dir</span> data/ <span class="nt">-batch_size</span> 50 <span class="nt">-dropout</span> 0.4 <span class="nt">-rnn_size</span> 512 <span class="nt">-gpuid</span> 0 <span class="nt">-savefile</span> bs50s512d0.4 | tee log_bs50s512d0.4</code></pre></figure> <p>File names encode the hyperparameters, and the output of <code class="highlighter-rouge">char-rnn</code> is logged using <a href="https://en.wikipedia.org/wiki/Tee_(command)"><code class="highlighter-rouge">tee</code> command</a>.</p> <h2 id="analysis">Analysis</h2> <p>We have adapted <a href="https://github.com/YerevaNN/Caffe-python-tools/blob/master/plot_loss.py">this script</a> written by Hrayr to plot the behavior of loss functions during the 50 epochs. The script, which runs on <code class="highlighter-rouge">char-rnn</code> output is available on <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/plot_loss.py">Github</a>. These graphs show, for example, that we practically do not gain anything after 25 epochs.</p> <table> <thead> <tr> <th><img src="/public/2015-11-11/plot_bs50s256all.png" alt="Training and validation loss" title="Training and validation loss" /></th> </tr> </thead> <tbody> <tr> <td>Training (blue to aqua) and validation (red to green) loss over 50 epochs. RNN size was set to 256 and the batch size was 50. In particular, this graph shows that when no dropout is used, validation loss actually increases after 20 epochs. Plotted using <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/plot_loss.py">this script</a>.</td> </tr> </tbody> </table> <p>Experiments showed that, unsuprisingly, training loss is better (after 50 epochs) when RNN size is increased and when dropout ratio is decreased. Under all configurations we got the lowest train losses using batch size 50 (compared to 25 and 100) and we don’t have explanation for this.</p> <p>For validation loss, we have the following tables.</p> <table> <tbody> <tr> <td> </td> <td><strong>Dropout</strong></td> <td><strong>0</strong></td> <td><strong>0.2</strong></td> <td><strong>0.4</strong></td> <td><strong>0.6</strong></td> </tr> <tr> <td><strong>Batch size</strong></td> <td><strong>RNN Size</strong></td> <td> </td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td><strong>25</strong></td> <td>128</td> <td>0.5060</td> <td>0.4307</td> <td>0.4813</td> <td>0.5373</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- </code>256</td> <td><code class="highlighter-rouge">- </code>0.5322</td> <td><code class="highlighter-rouge">- </code>0.4185</td> <td><code class="highlighter-rouge">- </code>0.4021</td> <td><code class="highlighter-rouge">- </code>0.4261</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- - </code>512</td> <td><code class="highlighter-rouge">- - </code>0.5596</td> <td><code class="highlighter-rouge">- - </code>0.4495</td> <td><code class="highlighter-rouge">- - </code>0.4380</td> <td><code class="highlighter-rouge">- - </code>0.4126</td> </tr> <tr> <td><strong>50</strong></td> <td>128</td> <td>0.4883</td> <td>0.4452</td> <td>0.4813</td> <td>0.5373</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- </code>256</td> <td><code class="highlighter-rouge">- </code>0.5249</td> <td><code class="highlighter-rouge">- </code>0.3887</td> <td><code class="highlighter-rouge">- </code>0.3996</td> <td><code class="highlighter-rouge">- </code>0.4280</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- - </code>512</td> <td><code class="highlighter-rouge">- - </code>0.5340</td> <td><code class="highlighter-rouge">- - </code>0.4420</td> <td><code class="highlighter-rouge">- - </code>0.3997</td> <td><code class="highlighter-rouge">- - </code>0.3800</td> </tr> <tr> <td><strong>100</strong></td> <td>128</td> <td>0.5341</td> <td>0.5144</td> <td>0.5454</td> <td>0.6094</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- </code>256</td> <td><code class="highlighter-rouge">- </code>0.5660</td> <td><code class="highlighter-rouge">- </code>0.4464</td> <td><code class="highlighter-rouge">- </code>0.4500</td> <td><code class="highlighter-rouge">- </code>0.4723</td> </tr> <tr> <td> </td> <td><code class="highlighter-rouge">- - </code>512</td> <td><code class="highlighter-rouge">- - </code>0.6032</td> <td><code class="highlighter-rouge">- - </code>0.4804</td> <td><code class="highlighter-rouge">- - </code>0.4599</td> <td><code class="highlighter-rouge">- - </code>0.4399</td> </tr> </tbody> </table> <p>When RNN size is only 128, we notice that the best performance is achieved when dropout is 20%. Larger dropout values do not allow the network to learn enough. When RNN size is increased to 256, the optimal dropout value is somewhere between 20% and 40%. For RNN size 512, the best performance we observed used 60% dropout. We didn’t try to go any further.</p> <p>As for the batch sizes, we see the best performance on 25 if RNN size is only 128. For larger networks, batch size 50 performs better. Overall we obtained the lowest validation score, 0.38, using 60% dropout, 50 batch size and 512 RNN size.</p> <h2 id="generated-samples">Generated samples</h2> <p>When the trained models are ready, we can generate text samples by using <code class="highlighter-rouge">sample.lua</code> script included in the repository. It accepts one important parameter called <code class="highlighter-rouge">temperature</code> which determines how much the network can “fantasize”. Higher temperature gives more diversity but at a cost of making more mistakes, as Andrej explains in his blog post. The command looks like this</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">th sample.lua cv/lm_bs50s128d0_epoch50.00_0.4883.t7 <span class="nt">-length</span> 3000 <span class="nt">-temperature</span> 0.5 <span class="nt">-gpuid</span> 0 <span class="nt">-primetext</span> <span class="s2">"Հոդված"</span></code></pre></figure> <p><code class="highlighter-rouge">primetext</code> parameter allows to predefine the first characters of the generated sequence. Also it makes the output fully reproducible. Here is a snippet from <code class="highlighter-rouge">bs50s128d0</code> model, which is available <a href="https://github.com/YerevaNN/char-rnn-constitution/tree/master/models">on Github</a> (validation loss is 0.4883, sampled with 0.5 temperature).</p> <blockquote> <p>Հոդված 111. Սահմանադրական դատարանի կազմավորումը, եթե այլ չեն <em>հասատատիրի</em> <em>առնչամի</em> կարելի սահմանափակվել միայն օրենքով, եթե դա անհրաժեշտ է հանցագործությունների իրավունք: Յուրաքանչյուր ոք ունի Հայաստանի Հանրապետության քաղաքացիությունը որոշում է կայացնում դատավորին կազմավորման կարգը</p> </blockquote> <p>There are 2 nonexistent words here (marked by italic), others are fine. The sentences have no meaning, some parts are quite unnatural, making them difficult to read.</p> <p>The network easily (even with 128 RNN size) learns to separate the articles by new line and starts them by the word <code class="highlighter-rouge">Հոդված</code> followed by some number. But even the best one doesn’t manage to use increasing numbers for consecutive articles. Actually, very often the article number starts with <code class="highlighter-rouge">1</code>, because more than one third of the articles in the corpus have numbers starting with <code class="highlighter-rouge">1</code>. It also understands some basic punctuation. It correctly puts commas before the word <code class="highlighter-rouge">եթե</code>, which is the Armenian word for “if”.</p> <p>With 256 RNN size and 40% dropout the result is much more readable.</p> <blockquote> <p>Հոդված 14. Պատգամավոր կարող է դնել իր վստահության հարցը: Կառավարության անդամների լիազորությունները <em>համապատասխանական</em> կազմակերպությունների կամ միավորման և գործունեության կարգը սահմանվում է օրենքով: <br /> Հոդված 107. Պատգամավորի լիազորությունները դադարեցնում է Սահմանադրությամբ և օրենքներով: Այդ իրավունքը կարող է սահմանափակվել միայն օրենքով: <br /> Հոդված 126. Հանրապետության նախագահի հրամանագրերը և կարգադրությունները կամ այլ պետությունը միասնական կառավարման մարմինների կողմից հանցագործության կատարման պահին գործող դատարանների նախագահների թեկնածությունների և առաջարկությամբ սահմանադրական դատարանի նախագահ:<br /> Հայաստանի Հանրապետության իրավունքը<br /></p> <ol> <li>Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների պաշտպանության նպատակով:<br /></li> <li>Ազգային ժողովի նախագահի վերահսկողության կամ Սահմանադրության 190-րդ հոդվածի 1-ին կետով նախատեսված դեպքերում և կարգով ընդունված որոշումները սահմանվում են օրենքով:<br /></li> <li>Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների սահմանափակումների հետ <em>չապահողական</em> կամ այլ դեպքերում վարչապետի նախագահների նախնական հանձնաժողովներն ստեղծվում են Սահմանադրությամբ և օրենքներով: <br /></li> <li>Յուրաքանչյուր ոք ունի իր ազգային որոշումները սահմանվում են օրենքով:<br /></li> <li>Յուրաքանչյուր ոք ունի իր իրավունքների և ազատությունների պաշտպանության նպատակով:</li> </ol> </blockquote> <p>Only 2 of the 140 words are nonexistent, but both are syntactically correct. For example there is no such word <code class="highlighter-rouge">չապահողական</code> in Armenian, but <code class="highlighter-rouge">չ</code> and <code class="highlighter-rouge">ապա</code> are prefixes, <code class="highlighter-rouge">հող</code> means “soil” and <code class="highlighter-rouge">ական</code> is a suffix. Sentences still do not have valid structure.</p> <p>The network learned that sometimes ordered lists appear in the articles, but couldn’t learn to properly enumerate the points. Sometimes it counts up to 2 only :) It would be interesting to see on what kind of corpora it will be able to count a bit more.</p> <p>Here is one more snippet using the best performing model <code class="highlighter-rouge">bs50s512d0.6</code> (temperature is again 0.5).</p> <blockquote> <p>Հոդված 21. Յուրաքանչյուր ոք ունի ազատ տեղաշարժվելու և բնակություն է կառավարության անդամներին: Հանրապետության Նախագահը պաշտոնն ստանձնում է Հանրապետության Նախագահը չի կարող զբաղվել ձեռնարկատիրական գործունեությամբ: <br /> Հոդված 50. Հանրապետության Նախագահը պաշտոնն ստանձնում է Հանրապետության Նախագահի պաշտոնը թափուր մնալու դեպքում Հանրապետության Նախագահի արտահերթ ընտրությունը կազմված է վարչապետի առաջարկությամբ վերահսկողությունը <br /></p> <ol> <li>Յուրաքանչյուր ոք ունի ազատ տեղաշարժվելու և բնակավայր ընտրելու իրավունք:</li> </ol> </blockquote> <p>There are virtually no invalid words anymore (less than 0.5%, and most are one character typos). Sentences are better formed. Sometimes a sentence is composed of two exact copies of different sentences that actually occur in the corpus. For example the combination <code class="highlighter-rouge">Հանրապետության Նախագահը պաշտոնն ստանձնում է</code> appears <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/data/input.txt#L130">7 times</a> in the corpus, and <code class="highlighter-rouge">Հանրապետության Նախագահը չի կարող զբաղվել ձեռնարկատիրական գործունեությամբ</code> appears <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/data/input.txt#L1501">once</a>. So the generated samples are often boring. Although sometimes the combination of such two parts does have a meaning. The following <a href="https://github.com/YerevaNN/char-rnn-constitution/blob/master/samples/sample_bs50s512d0.6t0.5.txt#L222">article</a> is a very good example, and doesn’t appear in the corpus.</p> <blockquote> <p>Հոդված 151. Հանրապետության Նախագահի հրամանագրերը և կարգադրությունները կատարում է Ազգային ժողովի նախագահը:</p> </blockquote> <p>When the temperature is increased to 0.75, the samples become more interesting.</p> <blockquote> <p>Հոդված 52. Հանրապետության Նախագահի լիազորությունները սահմանվում են Սահմանադրությամբ և սահմանադրական դատարանի դատավորների մեկ մտնում առաջին ատյանի դատարանները: <br /> Հոդված 107. Ազգային ժողովի լիազորությունների ժամկետը <em>կեղերով</em> բացասական տեղեկատվության ազատության ենթարկելու հարց հարուցելու կամ այլ գործադիր իշխանության, տեղական ինքնակառավարման մարմինների անկախության մասին.<br /> 7) եզրակացություն է տալիս իր լիազորությունների երաշխավորվում է միջազգային իրավունքի սկզբունքները և նախարարներից, ներկայացնում է Ազգային ժողովին եզրակացություններ ներկայացնելու համար:</p> </blockquote> <p>Typos are a bit more common. An “ordered list” is generated here which starts with <code class="highlighter-rouge">7</code> and has only one entry. Article numbers are not tied to <code class="highlighter-rouge">1</code>s anymore. Higher temperatures produce more nonexistent words.</p> <h2 id="nanogenmo">NaNoGenMo</h2> <p>Since 1999 every November is declared a <a href="http://nanowrimo.org/">National Novel Writing Month</a>, when people are encouraged to write a novel in one month. Since 2013, similar event is organized for algorithms. It’s called <a href="https://github.com/dariusk/NaNoGenMo-2015">National Novel Generating Month</a>. The rules are very simple, each participant must share one generated novel (at least 50 000 words) and release the source code. <a href="http://www.theverge.com/2014/11/25/7276157/nanogenmo-robot-author-novel">The Verge</a> wrote about last year’s results.</p> <p><a href="https://www.linkedin.com/in/armen-khachikyan-ba969218">Armen Khachikyan</a> told us about this, and we thought that we can take part in it with a long enough generated Constitution. Here is <a href="https://github.com/dariusk/NaNoGenMo-2015/issues/154">our entry</a>. It was generated by the following command:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">th sample.lua cv/lm_bs50s512d0.6_epoch50.00_0.3800.t7 <span class="nt">-length</span> 900000 <span class="nt">-temperature</span> 0.5 <span class="nt">-gpuid</span> 0 <span class="nt">-primetext</span> <span class="s2">"Գ Լ ՈՒ Խ 1"</span> <span class="o">&gt;</span> sample_bs50s512d0.6t0.5.txt</code></pre></figure> <p>The model was generated by the following command:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">th train.lua <span class="nt">-data_dir</span> data/ <span class="nt">-batch_size</span> 50 <span class="nt">-dropout</span> 0.6 <span class="nt">-rnn_size</span> 512 <span class="nt">-gpuid</span> 0 <span class="nt">-savefile</span> bs50s512d0.6 | tee log_bs50s512d0.6 </code></pre></figure> <p>All related files are in our <a href="https://github.com/YerevaNN/char-rnn-constitution">Github repository</a>.</p> Spoken language identification with deep convolutional networks 2015-10-11T00:00:00+00:00 http://yerevann.github.io//2015/10/11/spoken-language-identification-with-deep-convolutional-networks <p>By <a href="https://github.com/Harhro94">Hrayr Harutyunyan</a></p> <p>Recently <a href="https://topcoder.com/">TopCoder</a> announced a <a href="https://community.topcoder.com/longcontest/?module=ViewProblemStatement&amp;rd=16555&amp;compid=49304">contest</a> to identify the spoken language in audio recordings. I decided to test how well deep convolutional networks will perform on this kind of data. In short I managed to get around 95% accuracy and finished at the 10th place. This post reveals all the details.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#dataset-and-scoring" id="markdown-toc-dataset-and-scoring">Dataset and scoring</a></li> <li><a href="#preprocessing" id="markdown-toc-preprocessing">Preprocessing</a></li> <li><a href="#network-architecture" id="markdown-toc-network-architecture">Network architecture</a></li> <li><a href="#data-augmentation" id="markdown-toc-data-augmentation">Data augmentation</a></li> <li><a href="#ensembling" id="markdown-toc-ensembling">Ensembling</a></li> <li><a href="#what-we-learned-from-this-contest" id="markdown-toc-what-we-learned-from-this-contest">What we learned from this contest</a></li> <li><a href="#unexplored-options" id="markdown-toc-unexplored-options">Unexplored options</a></li> </ul> <h2 id="dataset-and-scoring">Dataset and scoring</h2> <p>The recordings were in one of the 176 languages. Training set consisted of 66176 <code class="highlighter-rouge">mp3</code> files, 376 per language, from which I have separated 12320 recordings for validation (Python script is <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/choose_val_set.py">available on GitHub</a>). Test set consisted of 12320 <code class="highlighter-rouge">mp3</code> files. All recordings had the same length (~10 sec) and seemed to be noise-free (at least all the samples that I have checked).</p> <p>Score was calculated the following way: for every <code class="highlighter-rouge">mp3</code> top 3 guesses were uploaded in a CSV file. 1000 points were given if the first guess is correct, 400 points if the second guess is correct and 160 points if the third guess is correct. During the contest the score was calculated only on 3520 recordings from the test set. After the contest the final score was calculated on the remaining 8800 recordings.</p> <h2 id="preprocessing">Preprocessing</h2> <p>I entered the contest just 14 days before the deadline, so didn’t have much time to investigate audio specific techniques. But we had a deep convolutional network developed few months ago, and it seemed to be a good idea to test a pure CNN on this problem. Some Google search revealed that the idea is not new. The earliest attempt I could find was a <a href="http://research.microsoft.com/en-us/um/people/dongyu/nips2009/papers/montavon-paper.pdf">paper by G. Montavon</a> presented in NIPS 2009 conference. The author used a network with 3 convolutional layers trained on <a href="https://en.wikipedia.org/wiki/Spectrogram">spectrograms</a> of audio recordings, and the output of convolutional/subsampling layers was given to a <a href="https://en.wikipedia.org/wiki/Time_delay_neural_network">time-delay neural network</a>.</p> <p>I found a <a href="http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html?ckattempt=1">Python script</a> which creates a spectrogram of a <code class="highlighter-rouge">wav</code> file. I used <a href="http://www.mpg123.de/index.shtml"><code class="highlighter-rouge">mpg123</code> library</a> to convert <code class="highlighter-rouge">mp3</code> files to <code class="highlighter-rouge">wav</code> format.</p> <p>The preprocessing script is available on <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/augment_data.py">GitHub</a>.</p> <h2 id="network-architecture">Network architecture</h2> <p>I took the network architecture designed for the Kaggle’s <a href="/2015/08/17/diabetic-retinopathy-detection-contest-what-we-did-wrong/">diabetic retinopathy detection contest</a>. It has 6 convolutional layers and 2 fully connected layers with 50% dropout. Activation function is always ReLU. Learning rates are set to be higher for the first convolutional layers and lower for the top convolutional layers. The last fully connected layer has 176 neurons and is trained using a softmax loss.</p> <p>It is important to note that this network does not take into account the sequential characteristics of the audio data. Although recurrent networks perform well on speech recognition tasks (one notable example is <a href="http://arxiv.org/abs/1303.5778">this paper</a> by A. Graves, A. Mohamed and G. Hinton, cited by 272 papers according to the Google Scholar), I didn’t have time to learn how they work.</p> <p>I trained the CNN on <a href="http://caffe.berkeleyvision.org">Caffe</a> with 32 images in a batch, its description in Caffe prototxt format is available <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/prototxt/main_32r-2-64r-2-64r-2-128r-2-128r-2-256r-2-1024rd0.5-1024rd0.5_DLR.prototxt">here</a>.</p> <table> <tbody> <tr> <td>Nr</td> <td>Type</td> <td>Batches</td> <td>Channels</td> <td>Width</td> <td>Height</td> <td>Kernel size / stride</td> </tr> <tr> <td>0</td> <td>Input</td> <td>32</td> <td>1</td> <td>858</td> <td>256</td> <td> </td> </tr> <tr> <td>1</td> <td>Conv</td> <td>32</td> <td>32</td> <td>852</td> <td>250</td> <td>7x7 / 1</td> </tr> <tr> <td>2</td> <td>ReLU</td> <td>32</td> <td>32</td> <td>852</td> <td>250</td> <td> </td> </tr> <tr> <td>3</td> <td>MaxPool</td> <td>32</td> <td>32</td> <td>426</td> <td>125</td> <td>3x3 / 2</td> </tr> <tr> <td>4</td> <td>Conv</td> <td>32</td> <td>64</td> <td>422</td> <td>121</td> <td>5x5 / 1</td> </tr> <tr> <td>5</td> <td>ReLU</td> <td>32</td> <td>64</td> <td>422</td> <td>121</td> <td> </td> </tr> <tr> <td>6</td> <td>MaxPool</td> <td>32</td> <td>64</td> <td>211</td> <td>60</td> <td>3x3 / 2</td> </tr> <tr> <td>7</td> <td>Conv</td> <td>32</td> <td>64</td> <td>209</td> <td>58</td> <td>3x3 / 1</td> </tr> <tr> <td>8</td> <td>ReLU</td> <td>32</td> <td>64</td> <td>209</td> <td>58</td> <td> </td> </tr> <tr> <td>9</td> <td>MaxPool</td> <td>32</td> <td>64</td> <td>104</td> <td>29</td> <td>3x3 / 2</td> </tr> <tr> <td>10</td> <td>Conv</td> <td>32</td> <td>128</td> <td>102</td> <td>27</td> <td>3x3 / 1</td> </tr> <tr> <td>11</td> <td>ReLU</td> <td>32</td> <td>128</td> <td>102</td> <td>27</td> <td> </td> </tr> <tr> <td>12</td> <td>MaxPool</td> <td>32</td> <td>128</td> <td>51</td> <td>13</td> <td>3x3 / 2</td> </tr> <tr> <td>13</td> <td>Conv</td> <td>32</td> <td>128</td> <td>49</td> <td>11</td> <td>3x3 / 1</td> </tr> <tr> <td>14</td> <td>ReLU</td> <td>32</td> <td>128</td> <td>49</td> <td>11</td> <td> </td> </tr> <tr> <td>15</td> <td>MaxPool</td> <td>32</td> <td>128</td> <td>24</td> <td>5</td> <td>3x3 / 2</td> </tr> <tr> <td>16</td> <td>Conv</td> <td>32</td> <td>256</td> <td>22</td> <td>3</td> <td>3x3 / 1</td> </tr> <tr> <td>17</td> <td>ReLU</td> <td>32</td> <td>256</td> <td>22</td> <td>3</td> <td> </td> </tr> <tr> <td>18</td> <td>MaxPool</td> <td>32</td> <td>256</td> <td>11</td> <td>1</td> <td>3x3 / 2</td> </tr> <tr> <td>19</td> <td>Fully connected</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>20</td> <td>ReLU</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>21</td> <td>Dropout</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>22</td> <td>Fully connected</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>23</td> <td>ReLU</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>24</td> <td>Dropout</td> <td>20</td> <td>1024</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>25</td> <td>Fully connected</td> <td>20</td> <td>176</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>26</td> <td>Softmax Loss</td> <td>1</td> <td>176</td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p><a href="https://github.com/Hrant-Khachatrian">Hrant</a> suggested to try the <a href="http://arxiv.org/abs/1212.5701"><code class="highlighter-rouge">ADADELTA</code> solver</a>. It is a method which dynamically calculates learning rate for every network parameter, and the training process is said to be independent of the initial choice of learning rate. Recently it was <a href="https://github.com/BVLC/caffe/pull/2782">implemented in Caffe</a>.</p> <p>In practice, the base learning rate set in the Caffe solver did matter. At first I tried to use <code class="highlighter-rouge">1.0</code> learning rate, and the network didn’t learn at all. Setting the base learning rate to <code class="highlighter-rouge">0.01</code> helped a lot and I trained the network for 90 000 iterations (more than 50 epochs). Then I switched to <code class="highlighter-rouge">0.001</code> base learning rate for another 60 000 iterations. The solver is available <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/prototxt/solver.main.adadelta.prototxt">here</a>. Not sure why the base learning rate mattered so much at the early stages of the training. One possible reason could be the large learning rate coefficients on the lower convolutional layers. Both tricks (dynamically updating the learning rates in <code class="highlighter-rouge">ADADELTA</code> and large learning rate coefficients) aim to fight the gradient vanishing problem, and maybe their combination is not a very good idea. This should be carefully analysed.</p> <table> <thead> <tr> <th><img src="/public/2015-10-11/no-augm-loss.jpg" alt="Training (blue) and validation (red) loss" title="Training (blue) and validation (red) loss" /></th> </tr> </thead> <tbody> <tr> <td>Training (blue) and validation (red) loss over the 150 000 iterations on the non-augmented dataset. The sudden drop of training loss corresponds to the point when the base learning rate was changed from <code class="highlighter-rouge">0.01</code> to <code class="highlighter-rouge">0.001</code>. Plotted using <a href="https://github.com/YerevaNN/Caffe-python-tools/blob/master/plot_loss.py">this script</a>.</td> </tr> </tbody> </table> <p>The signs of overfitting were getting more and more visible and I stopped at 150 000 iterations. The softmax loss got to 0.43 and it corresponded to 3 180 000 score (out of 3 520 000 possible). Some ensembling with other models of the same network allowed to get a bit higher score (3 220 000), but it was obvious that data augmentation is needed to overcome the overfitting problem.</p> <h2 id="data-augmentation">Data augmentation</h2> <p>The most important weakness of our team in the <a href="/2015/08/17/diabetic-retinopathy-detection-contest-what-we-did-wrong/">previous contest</a> was that we didn’t augment the dataset well enough. So I was looking for ways to augment the set of spectrograms. One obvious idea was to crop random, say, 9 second intervals of the recordings. Hrant suggested another idea: to warp the frequency axis of the spectrogram. This process is known as <em>vocal tract length perturbation</em>, and is generally used for speaker normalization at least <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;arnumber=650310&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel4%2F89%2F14168%2F00650310">since 1998</a>. In 2013 <a href="https://www.cs.toronto.edu/~hinton/absps/perturb.pdf">N. Jaitly and G. Hinton</a> used this technique to augment the audio dataset. I <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/augment_data.py#L32">used this formula</a> to linearly scale the frequency bins during spectrogram generation:</p> <table> <thead> <tr> <th><img src="/public/2015-10-11/frequency-warp-formula.png" alt="Frequency warping formula" title="Frequency warping formula" /></th> </tr> </thead> <tbody> <tr> <td>Frequency warping formula from the <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;arnumber=650310&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel4%2F89%2F14168%2F00650310">paper by L. Lee and R. Rose</a>. α is the scaling factor. Following Jaitly and Hinton I <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/augment_data.py#L92">chose it uniformly</a> between 0.9 and 1.1</td> </tr> </tbody> </table> <p>I also <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/augment_data.py#L77">randomly cropped</a> the spectrograms so they had <code class="highlighter-rouge">768x256</code> size. Here are the results:</p> <table> <tbody> <tr> <td><img src="/public/2015-10-11/spectrogram.jpg" alt="Spectrogram without modifications" title="Spectrogram without modifications" /></td> </tr> <tr> <td>Spectrogram of one of the recordings</td> </tr> <tr> <td><img src="/public/2015-10-11/spectrogram-warped-cropped.jpg" alt="Cropped spectrogram with warped frequency axis" title="Cropped spectrogram with warped frequency axis" /></td> </tr> <tr> <td>Cropped spectrogram of the same recording with warped frequency axis</td> </tr> </tbody> </table> <p>For each <code class="highlighter-rouge">mp3</code> I have created 20 random spectrograms, but trained the network on 10 of them. It took more than 2 days to create the augmented dataset and convert it to LevelDB format (the format Caffe suggests). But training the network proved to be even harder. For 3 days I couldn’t significantly decrease the train loss. After removing the dropout layers the loss started to decrease but it would take weeks to reach reasonable levels. Finally, Hrant suggested to try to reuse the weights of the model trained on the non-augmented dataset. The problem was that due to the cropping, the image sizes in the two datasets were different. But it turned out that convolutional and pooling layers in Caffe <a href="https://github.com/BVLC/caffe/issues/189#issuecomment-36754479">work with images of variable sizes</a>, only the fully connected layers couldn’t reuse the weights from the first model. So I just <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/prototxt/augm_32r-2-64r-2-64r-2-128r-2-128r-2-256r-2-1024r-1024r_DLR_nolrcoef.prototxt#L292">renamed the FC layers</a> in the prototxt file and <a href="http://caffe.berkeleyvision.org/tutorial/interfaces.html#command-line">initialized</a> the network (convolution filters) by the weights of the first model:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">./build/tools/caffe train <span class="nt">--solver</span><span class="o">=</span>solver.prototxt <span class="nt">--weights</span><span class="o">=</span>models/main_32r-2-64r-2-64r-2-128r-2-128r-2-256r-2-1024rd0.5-1024rd0.5_DLR_72K-adadelta0.01_iter_153000.caffemodel</code></pre></figure> <p>This helped a lot. I used standard stochastic gradient descent (inverse decay learning rate policy) with base learning rate <code class="highlighter-rouge">0.001</code> for 36 000 iterations (less than 2 epochs), then increased the base learning rate to <code class="highlighter-rouge">0.01</code> for another 48 000 iterations (due to the inverse decay policy the rate decreased seemingly too much). These trainings were done without any regularization techniques, weight decay or dropout layers, and there were clear signs of overfitting. I tried to add 50% dropout layers on fully connected layers, but the training was extremely slow. To improve the speed I used 30% dropout, and trained the network for 120 000 more iterations using <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/prototxt/solver.augm.nolrcoef.prototxt">this solver</a>. Softmax loss on the validation set reached 0.21 which corresponded to 3 390 000 score. The score was calculated by averaging softmax outputs over 20 spectrograms of each recording.</p> <h2 id="ensembling">Ensembling</h2> <p>30 hours before the deadline I had several models from the same network. And even simple ensembling (just the sum of softmax activations of different models) performed better than any individual model. Hrant suggested to use <a href="https://github.com/dmlc/xgboost">XGBoost</a>, which is a fast implementation of <a href="https://en.wikipedia.org/wiki/Gradient_boosting">gradient boosting</a> algorithm and is very popular among Kagglers. XGBoost has a good documentation and all parameters are <a href="https://github.com/dmlc/xgboost/blob/master/doc/parameter.md">well explained</a>.</p> <p>To perform the ensembling I was creating a CSV file containing softmax activations (or the average of softmax activations among <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/ensembling/get_output_layers.py#L40">20</a> augmented versions of the same recording) using <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/ensembling/get_output_layers.py">this script</a>. Then I was running XGBoost on these CSV files. The submission file (which was requested by TopCoder) was generated using <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/make_submission.py">this script</a>.</p> <p>I also tried to train a <a href="https://github.com/YerevaNN/Spoken-language-identification-CNN/blob/master/ensembling/ensemble.theano.py">simple neural network</a> with one hidden layer on the same CSV files. The results were significantly better than with XGBoost.</p> <p>The best result was obtained by ensembling the following two models: snapshots of the last network (the one with 30% dropout) after 90 000 iterations and 105 000 iterations. Final score was 3 401 840 and it was the <a href="http://community.topcoder.com/longcontest/stats/?module=ViewOverview&amp;rd=16555">10th result</a> of the contest.</p> <h2 id="what-we-learned-from-this-contest">What we learned from this contest</h2> <p>This was a quite interesting contest, although too short when compared with Kaggle’s contests.</p> <ul> <li>Plain, <a href="http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf">AlexNet</a>-like convolutional networks work quite well for fixed length audio recordings</li> <li>Vocal tract length perturbation works well as an augmentation technique</li> <li>Caffe supports sharing weights between convolutional networks having different input sizes</li> <li>Single layer neural network sometimes performs better than XGBoost for ensembling (although I had just one day to test the both)</li> </ul> <h2 id="unexplored-options">Unexplored options</h2> <ul> <li>It is interesting to see if a network with 50% dropout layers will improve the accuracy</li> <li>Maybe larger convolutional networks, like <em>OxfordNet</em> will perform better. They require much more memory, and it was risky to play with them under a tough deadline</li> <li><a href="http://www.cs.toronto.edu/~asamir/papers/icassp12_cnn.pdf">Hybrid methods</a> combining CNN and Hidden Markov Models should work better</li> <li>We believe it is possible to squeeze more from these models with better ensembling methods</li> <li><a href="https://apps.topcoder.com/forums/?module=Thread&amp;threadID=866734&amp;start=0&amp;mc=4">Other contestants report</a> better results based on careful mixing of the results of more traditional techniques, including <a href="https://en.wikipedia.org/wiki/N-gram">n-gram</a> and <a href="https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model">Gaussian Mixture Models</a>. We believe the combination of these techniques with the deep models will provide very good results on this dataset</li> </ul> <p>One important issue is that the organizers of this contest <a href="http://apps.topcoder.com/forums//?module=Thread&amp;threadID=866217&amp;start=0&amp;mc=3">do not allow</a> to use the dataset outside the contest. We hope this decision will be changed eventually.</p> Diabetic retinopathy detection contest. What we did wrong 2015-08-17T00:00:00+00:00 http://yerevann.github.io//2015/08/17/diabetic-retinopathy-detection-contest-what-we-did-wrong <p>After watching the <a href="https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH">awesome video course by Hugo Larochelle</a> on neural nets (more on this in the <a href="/2015/07/30/getting-started-with-neural-networks/">previous post</a>) we decided to test our knowledge on some computer vision contest. We looked at <a href="https://www.kaggle.com/competitions">Kaggle</a> and the only active competition related to computer vision (except for the <a href="https://www.kaggle.com/c/digit-recognizer">digit recognizer contest</a>, for which lots of perfect out-of-the-box solutions exist) was the <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection">Diabetic retinopathy detection contest</a>. This was probably quite hard to become our very first project, but nevertheless we decided to try. The team included <a href="https://www.linkedin.com/in/mahnerak">Karen</a>, <a href="https://www.linkedin.com/in/galstyantik">Tigran</a>, <a href="https://github.com/Harhro94">Hrayr</a>, <a href="https://www.linkedin.com/pub/narek-hovsepyan/86/b35/380">Narek</a> (1st to 3rd year bachelor students) and <a href="https://github.com/Hrant-Khachatrian">me</a> (PhD student). Long story short, we finished at the <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/leaderboard">82nd place</a> out of 661 participants, and in this post I will describe in details what we did and what mistakes we made. All required files are on these 2 <a href="https://github.com/YerevaNN/Caffe-python-tools">github</a> <a href="https://github.com/YerevaNN/Kaggle-diabetic-retinopathy-detection">repositories</a>. We hope this will be interesting for those who just start to play with neural networks. Also we hope to get feedback from experts and other participants.</p> <!--more--> <h2 class="no_toc" id="contents">Contents</h2> <ul id="markdown-toc"> <li><a href="#the-contest" id="markdown-toc-the-contest">The contest</a></li> <li><a href="#software-and-hardware" id="markdown-toc-software-and-hardware">Software and hardware</a></li> <li><a href="#image-preprocessing" id="markdown-toc-image-preprocessing">Image preprocessing</a></li> <li><a href="#data-augmentation" id="markdown-toc-data-augmentation">Data augmentation</a></li> <li><a href="#choosing-training--validation-sets" id="markdown-toc-choosing-training--validation-sets">Choosing training / validation sets</a></li> <li><a href="#convolutional-network-architecture" id="markdown-toc-convolutional-network-architecture">Convolutional network architecture</a></li> <li><a href="#loss-function" id="markdown-toc-loss-function">Loss function</a></li> <li><a href="#preparing-submissions" id="markdown-toc-preparing-submissions">Preparing submissions</a></li> <li><a href="#attempts-to-ensemble" id="markdown-toc-attempts-to-ensemble">Attempts to ensemble</a></li> <li><a href="#more-on-this-contest" id="markdown-toc-more-on-this-contest">More on this contest</a></li> <li><a href="#acknowledgements" id="markdown-toc-acknowledgements">Acknowledgements</a></li> </ul> <h2 id="the-contest">The contest</h2> <p><a href="https://en.wikipedia.org/wiki/Diabetic_retinopathy">Diabetic retinopathy</a> is a disease when the retina of the eye is damaged due to diabetes. It is one of the leading causes of blindness in the world. The contest’s aim was to see if computer programs can diagnose the disease automatically from the image of the retina. <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/forums/t/15605/human-performance-on-the-competition-data-set">It seems</a> the winners slightly surpassed the performance of general ophthalmologists.</p> <p>Each eye of the patient can be in one of the 5 levels: from 0 to 4, where 0 corresponds to the healthy state and 4 is the most severe state. Different eyes of the same person can be at different levels (although some contestants managed to leverage the fact that two eyes are not completely independent). Contestants <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/data">were given</a> 35126 JPEG images of retinas for training (32.5GB), 53576 images for testing (49.6GB) and a CSV file where level of the disease is written for the train images. The goal was to create another CSV file where disease levels are written for each of the test images. Contestants could submit maximum 5 CSV files per day for evaluation.</p> <table> <thead> <tr> <th><img src="/public/2015-08-17/eye-0.jpeg" alt="Healthy eye: level 0" title="Healthy eye: level 0" /></th> <th><img src="/public/2015-08-17/eye-4.jpeg" alt="Severe state: level 4" title="Severe state: level 4" /></th> </tr> </thead> <tbody> <tr> <td>Healthy eye: level 0</td> <td>Severe state: level 4</td> </tr> </tbody> </table> <p>The score was evaluated using a metric called <strong>quadratic weighted kappa</strong>. It is <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/details/evaluation">described</a> as being an <em>agreement</em> between two raters: the agreement between the scores assigned by human rater (which is unknown to contestants) and the predicted scores. If the agreement is random, the score is close 0 (sometimes it can even be negative). In case of a perfect agreement the score is 1. It is <em>quadratic</em> in a sense that, for example, if you predict level 4 for a healthy eye, it is 16 times worse than if you predict level 1. Winners achieved a score <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/leaderboard">more than 0.84</a>. Our best result was around 0.50.</p> <h2 id="software-and-hardware">Software and hardware</h2> <p>It was obvious that we were going to use a <a href="https://www.youtube.com/watch?v=rxKrCa4bg1I&amp;index=69&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH">convolutional neural network</a> for predicting. Not only because of its <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network#Applications">awesome performance</a> on many computer vision problems, including another Kaggle competition on <a href="https://www.kaggle.com/c/datasciencebowl">plankton classification</a>, but also because it was the only technique we knew for image classification. We were aware of several libraries that implement convolutional networks, namely Python-based <a href="http://deeplearning.net/software/theano/">Theano</a>, <a href="http://caffe.berkeleyvision.org/">Caffe</a> written in C++, <a href="https://github.com/dmlc/cxxnet">cxxnet</a> (developed by the <a href="https://www.kaggle.com/c/datasciencebowl/forums/t/12887/brief-describe-method-and-cxxnet-v2/69545">2nd place winners</a> of the plankton contest) and <a href="https://github.com/torch/nn/">Torch</a>. We chose Caffe because it seemed to be the simplest one for beginners: it allows to define the neural network by a simple text file (like <a href="https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet.prototxt">this</a>) and train a network without writing a single line of code.</p> <p>We didn’t have a computer with CUDA-enabled GPU in the university, but our friends at <a href="http://cyclopstudio.com/">Cyclop Studio</a> donated us an Intel Core i5 computer with 4GB RAM and <a href="http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-550ti/specifications">NVidia GeForce GTX 550 TI</a> card. 550 TI has a 1GB of memory which forced us to use very small batch sizes for the neural network. Later we switched to <a href="http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications">GeForce GTX 980</a> with 4GB memory, which was completely fine for us.</p> <p>Karen and Tigran managed to <a href="http://caffe.berkeleyvision.org/install_apt.html">install Caffe on Ubuntu</a> and make it work with CUDA, which was enough to start the training. Later Narek and Hrayr found out how to play with Caffe models <a href="https://github.com/BVLC/caffe/tree/master/python/caffe">using Python</a>, so we can run our models on the test set. Karen has <a href="https://docs.c9.io/docs/running-your-own-ssh-workspace">connected Cloud9 to the server</a>, and we could work remotely through a web interface.</p> <h2 id="image-preprocessing">Image preprocessing</h2> <p>Images from the training and test datasets have very different resolutions, aspect ratios, colors, are cropped in various ways, some are of very low quality, are out of focus etc. Neural networks require a fixed input size, so we had to resize / crop all of them to some fixed dimensions. Karen and Tigran looked at many sample images and decided that the optimal resolution which preserves the details required for classification is 512x512. We thought that in 256x256 we might lose the small details that differ healthy eye images from level 1 images. In fact, by the end of the competition we saw that our networks cannot differentiate between level 0 and 1 images even with 512x512, so probably we could safely work on 256x256 from the very beginning (which would be much faster to train). All preprocessing was done using <a href="http://www.imagemagick.org/">imagemagick</a>.</p> <p>We tried three methods to preprocess the images. First, as suggested by Karen and Tigran, we resized the images and then applied the so called <em><a href="http://www.imagemagick.org/Usage/transform/#charcoal">charcoal</a></em> effect which is basically an edge detector. This highlighted the signs of blood on the retina. One of the challenging problems throughout the contest was to define a naming convention for everything: databases of preprocessed images, convnet descriptions, models, CSV files etc. We used the prefix <code class="highlighter-rouge">edge</code> for anything which was based on the images preprocessed this way. The best kappa score achieved on this dataset was 0.42.</p> <table> <thead> <tr> <th><img src="/public/2015-08-17/eye-edge-0.jpg" alt="`edge` level 0" title="`edge` level 0" /></th> <th><img src="/public/2015-08-17/eye-edge-3.jpg" alt="`edge` level 3" title="`edge` level 3" /></th> </tr> </thead> <tbody> <tr> <td>Preprocessed image <em>(edge)</em> level 0</td> <td>Preprocessed image <em>(edge)</em> level 3</td> </tr> </tbody> </table> <p>But later we noticed that this method makes the dirt on lens or other optical issues appear similar to a blood sign, and it really confused our neural networks. The following two images are of healthy eyes (level 0), but both were recognized by almost all our models as level 4.</p> <table> <thead> <tr> <th><img src="/public/2015-08-17/orig-35297_left-0.jpeg" alt="healthy eye" title="healthy eye" /></th> <th><img src="/public/2015-08-17/edge-35297_left-0.jpeg" alt="`edge`, recognized as level 4" title="`edge`, recognized as level 4" /></th> </tr> <tr> <th><img src="/public/2015-08-17/orig-44330_left-0.jpeg" alt="healthy eye" title="healthy eye" /></th> <th><img src="/public/2015-08-17/edge-44330_left-0.jpeg" alt="`edge`, recognized as level 4" title="`edge`, recognized as level 4" /></th> </tr> </thead> <tbody> <tr> <td>Original images of healthy eyes</td> <td>Preprocessed versions <code class="highlighter-rouge">edge</code> recognized as level 4</td> </tr> </tbody> </table> <p>So we decided to avoid using filters on the images, and leave all the work to the convolutional network: just resize and convert to one channel image (to save space and memory). We thought that the color information is not very important to detect the disease, although this could be one of our mistakes. Following the discussion at <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/forums/t/13147/rgb-or-grayscale/69138">Kaggle forums</a> we decided to use the green channel only. We got our best results (kappa = 0.5) on this dataset. We used prefix <code class="highlighter-rouge">g</code> for these images.</p> <p>Finally we tried to apply the <a href="http://www.imagemagick.org/Usage/color_mods/#equalize"><em>equalize</em></a> filter on top of the green channel, which makes the histogram of the image uniform. The best kappa score we managed to get on the dataset preprocessed this way was only 0.4. We used prefix <code class="highlighter-rouge">ge</code> for these images.</p> <table> <thead> <tr> <th><img src="/public/2015-08-17/g-99_left-3.jpeg" alt="Just the green channel: g" title="Just the green channel: g" /></th> <th><img src="/public/2015-08-17/ge-99_left-3.jpeg" alt="Histogram equalization on top of the green channel: ge" title="Histogram equalization on top of the green channel: ge" /></th> </tr> </thead> <tbody> <tr> <td>Just the green channel: <code class="highlighter-rouge">g</code></td> <td>Histogram equalization on top of the green channel: <code class="highlighter-rouge">ge</code></td> </tr> </tbody> </table> <h2 id="data-augmentation">Data augmentation</h2> <p>One of the problems of neural networks is that they are extremely powerful. They learn so well that they usually learn something that degrades their performance on other (previously unseen) data. One (made-up) example: the images in the training set are taken by different cameras and have different characteristics. If for some reason, say, the percentage of images of level 2 in dark images is higher than in general, the network may start to predict level 2 more often for dark images. We are not aware of any way to detect such “misleading” correlations by looking at neuron activations of convolution filters. But, fortunately, it is possible to train the network on one subset of data and test it on another, and if the performance on these subsets are different, then the network has learned something very specific to the training data, it has <strong>overfit</strong> the training data, and we should try to avoid it.</p> <p>One of the solutions to this problem is to enlarge the dataset in order to minimize the chances of such correlations to happen. This is called <em><a href="https://www.youtube.com/watch?v=Km1Q5VcSKAg&amp;index=77&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH">data augmentation</a></em>. The organizers of this contest explicitly <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/rules">forbid</a> to use data outside the dataset they provided. But it’s obvious that if you take an image, zoom it, rotate it, flip it, change the brightness etc. the level of the disease will not be changed. So it is possible to apply these transformations to the images and obtain much larger and “more random” training dataset. One approach is to take all versions of all images into the training set, another approach is to randomly choose one transformation for each of the images. The mixture of these approaches helps to solve another problem which will be discussed in the next section.</p> <p>We applied very limited transformations only. For every image we created 4 samples: original, rotated by 180 degrees, and the vertical flipped versions of these two. This helped to avoid the problem, that some of the images in the dataset <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/data">were flipped</a>.</p> <p>We believe that we spent way too little time on data augmentation. All other contestants we have seen use much more sophisticated transformations. Probably this was our most important mistake.</p> <h2 id="choosing-training--validation-sets">Choosing training / validation sets</h2> <p>There are two reasons to train the networks only on a subset of the train dataset provided by Kaggle. First reason is to be able to compare different models. We need to choose the model which generalizes best to the unseen data, not the one which performs best on the data it has been trained on. So we train various models on some subset of the dataset (again called a <em>training set</em>), then compare their performance on the other subset (called a <em>validation set</em>) and pick the one which works better on the latter.</p> <p>The second reason is to detect overfitting while training. During the training we sometimes (in Caffe this is configured by the <a href="http://caffe.berkeleyvision.org/tutorial/solver.html"><em>test_interval</em> parameter</a>) run the network on the validation set and calculate the loss. When we see that the loss on the validation set does not decrease anymore, we know that overfitting happens. This is best illustrated in this <a href="https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting_svg.svg">image from Wikipedia</a>.</p> <p>The distribution of images of different levels in the training set provided by Kaggle was very uneven. More than half of the images were of healthy eyes:</p> <table> <tbody> <tr> <td>Level</td> <td>Number of images</td> <td>Percentage</td> </tr> <tr> <td>0</td> <td>25810</td> <td>73.48%</td> </tr> <tr> <td>1</td> <td>2443</td> <td>6.95%</td> </tr> <tr> <td>2</td> <td>5292</td> <td>15.07%</td> </tr> <tr> <td>3</td> <td>873</td> <td>2.49%</td> </tr> <tr> <td>4</td> <td>708</td> <td>2.02%</td> </tr> </tbody> </table> <p>Neural networks seem to be very sensitive to this kind of distributions. Our very first neural network (using softmax classification) was randomly giving labels 0 and 2 to almost all images (which brought a kappa score 0.138). So we had to make the classes more or less equal. Here we did couple of trivial mistakes.</p> <p>At first we augmented the dataset by creating lots of rotations (multiples of 30 degrees, 12 versions of each image) and created a dataset of around 100K images with equally distributed classes. So we took 36 times more versions of images of level 4 than of images of level 0. As we had only 12 versions of each image, we took every image 3 times. Finally, we separated the training and validation sets <em>after</em> these augmentations. After training 88000 iterations (with batch size 2, we were still on GeForce 550 Ti) we had 0.55 kappa score on our validation set. But on Kaggle’s test set the score was only 0.23. So we had a terrible overfitting and didn’t detect it locally.</p> <p>The most important point here, as I understand it, is that the separation of training and validation sets should have been done <em>before</em> the data augmentation. In our case we had different rotations of the same image in both sets, which didn’t allow us to detect overfitting.</p> <p>So later we took 7472 images (21%) as a validation set, and performed the data augmentation on the remaining 27654 images. Validation set had the same ratio of classes as the Kaggle’s test set. This is important for choosing the best model: validation set should be similar to the test set as much as possible.</p> <p>Also we decided to get rid off the rotations by multiples of 30 degrees, as the images were being distorted (we applied rotations <em>after</em> resizing the images). Although, after the competition we saw that <a href="http://jeffreydf.github.io/diabetic-retinopathy-detection/">other contestants</a> have used such rotations. So maybe this was another mistake.</p> <p>Then, it turned out that the idea of taking copies of the same image is terrible, because the network overfits the smaller classes (like level 3 and level 4) and it is hard to notice that just by looking at validation loss values, because the corresponding classes are very small in the validation set. We identified this problem by carefully visualizing neuron activations on training and validation sets (just 2 weeks before the competition deadline):</p> <table> <thead> <tr> <th><img src="/public/2015-08-17/3-4-overfit.png" alt="Blue dots are from the training set, orange dots are from the validation set. x axis is the activation of a top layer neuron. y axis is the original label (0 to 4)" title="Blue dots are from the training set, orange dots are from the validation set. x axis is the activation of a top layer neuron. y axis is the original label (0 to 4)" /></th> </tr> </thead> <tbody> <tr> <td>Every dot corresponds to one image. Blue dots are from the training set, orange dots are from the validation set. <code class="highlighter-rouge">x</code> axis is the activation of a top layer neuron. <code class="highlighter-rouge">y</code> axis is the original label (0 to 4). Basically there is no overfitting for the images of level 0, 1 or 2: the activations are very similar. But the overfitting of the images of level 3 and 4 is obvious. Training samples are concentrated around fixed values, while validation samples are spread widely</td> </tr> </tbody> </table> <p>Finally we decided to train a network to differentiate between two classes only: images of level 0 and 1 versus images of level 2, 3 and 4. The ratio of the images in these classes was 4:1. We augmented the training set only by vertical flipping and rotating by 180 degrees. We took all 4 versions of each image of the second class and we randomly took one of the 4 versions of each image of the first class. This way we ended up with a training set of two equal classes. This gave us our best kappa score 0.50.</p> <p>Later we wanted to train a classifier which would differentiate level 0 images from level 1 images only, but the networks we tried didn’t work at all. Another classifier we used to differentiate between level 2 and level 3 + level 4 images actually learned something, but we couldn’t increase the overall kappa score based on that.</p> <p>After preparing the list of files for the training and validation sets, we used a tool bundled with Caffe to create a <a href="http://leveldb.org/">LevelDB</a> database from the directory of images. Caffe <a href="http://caffe.berkeleyvision.org/tutorial/data.html">prefers</a> to read from LevelDB rather than from directory:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">./build/tools/convert_imageset <span class="nt">-backend</span><span class="o">=</span>leveldb <span class="nt">-gray</span><span class="o">=</span><span class="nb">true</span> <span class="nt">-shuffle</span><span class="o">=</span><span class="nb">true </span>data/train.g/ train.g.01v234.txt leveldb/train.g.01v234</code></pre></figure> <p><code class="highlighter-rouge">gray</code> is set to <code class="highlighter-rouge">true</code> because we use single-channel images and <code class="highlighter-rouge">shuffle</code> is required to properly shuffle the images before importing into the database.</p> <h2 id="convolutional-network-architecture">Convolutional network architecture</h2> <p>Our best performing <a href="https://github.com/YerevaNN/Kaggle-diabetic-retinopathy-detection/blob/master/g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5.prototxt">neural network architecture</a> and corresponding <a href="https://github.com/YerevaNN/Kaggle-diabetic-retinopathy-detection/blob/master/best-performing-solver.prototxt">solver</a> are on Github. <code class="highlighter-rouge">Batch size</code> was always fixed to 20 (on GTX 980 card). We used a simple <em>stochastic gradient descent</em> with 0.9 <code class="highlighter-rouge">momentum</code> and didn’t touch learning rate policy at all (it didn’t decrease the rate significantly). We started at 0.001 <code class="highlighter-rouge">learning rate</code>, and sometimes manually decreased it (but not in this particular case which brought the best kappa score). Also in this best performing case we started with 0 <code class="highlighter-rouge">weight decay</code>, and after the first signs of overfitting (after 48K iterations, which is almost 20 epochs) increased it to 0.0015.</p> <p>Convolution was done similar to the “traditional” <a href="http://caffe.berkeleyvision.org/gathered/examples/mnist.html">LeNet architecture</a> (developed by <a href="http://yann.lecun.com/">Yann LeCun</a>, who invented the convolutional networks): one max pooling layer after every convolution layer, with fully connected layers at the end.</p> <p>Almost all other contestants used the other famous approach, with multiple consecutive convolutional layers with small kernels before a pooling layer. This was developed by <a href="http://www.robots.ox.ac.uk/~vgg/research/very_deep/">Karen Simonyan and Andrew Zisserman</a> at Visual Geometry Group, University of Oxford (that’s why it is called <em>VGGNet</em> or <em>OxfordNet</em>) for the <a href="http://www.image-net.org/challenges/LSVRC/2014/results#clsloc">ImageNet 2014 contest</a> where they took 1st and 2nd places for localization and classification tasks, respectively. Their approach was popularized by <a href="http://cs231n.github.io/convolutional-networks/#case">Andrej Karpathy</a> and was successfully used in the <a href="http://benanne.github.io/2015/03/17/plankton.html#architecture">plankton classification contest</a>. I have tried this approach once, but it required significantly more memory and time, so I quickly abandoned it.</p> <p>Here is the structure of our network:</p> <table> <tbody> <tr> <td>Nr</td> <td>Type</td> <td>Batches</td> <td>Channels</td> <td>Width</td> <td>Height</td> <td>Kernel size / stride</td> </tr> <tr> <td>0</td> <td>Input</td> <td>20</td> <td>1</td> <td>512</td> <td>512</td> <td> </td> </tr> <tr> <td>1</td> <td>Conv</td> <td>20</td> <td>40</td> <td>506</td> <td>506</td> <td>7x7 / 1</td> </tr> <tr> <td>2</td> <td>ReLU</td> <td>20</td> <td>40</td> <td>506</td> <td>506</td> <td> </td> </tr> <tr> <td>3</td> <td>MaxPool</td> <td>20</td> <td>40</td> <td>253</td> <td>253</td> <td>3x3 / 2</td> </tr> <tr> <td>4</td> <td>Conv</td> <td>20</td> <td>40</td> <td>249</td> <td>249</td> <td>5x5 / 1</td> </tr> <tr> <td>5</td> <td>ReLU</td> <td>20</td> <td>40</td> <td>249</td> <td>249</td> <td> </td> </tr> <tr> <td>6</td> <td>MaxPool</td> <td>20</td> <td>40</td> <td>124</td> <td>124</td> <td>3x3 / 2</td> </tr> <tr> <td>7</td> <td>Conv</td> <td>20</td> <td>40</td> <td>120</td> <td>120</td> <td>5x5 / 1</td> </tr> <tr> <td>8</td> <td>ReLU</td> <td>20</td> <td>40</td> <td>120</td> <td>120</td> <td> </td> </tr> <tr> <td>9</td> <td>MaxPool</td> <td>20</td> <td>40</td> <td>60</td> <td>60</td> <td>3x3 / 2</td> </tr> <tr> <td>10</td> <td>Conv</td> <td>20</td> <td>40</td> <td>56</td> <td>56</td> <td>5x5 / 1</td> </tr> <tr> <td>11</td> <td>ReLU</td> <td>20</td> <td>40</td> <td>56</td> <td>56</td> <td> </td> </tr> <tr> <td>12</td> <td>MaxPool</td> <td>20</td> <td>40</td> <td>14</td> <td>14</td> <td>4x4 / 4</td> </tr> <tr> <td>13</td> <td>Fully connected</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>14</td> <td>ReLU</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>15</td> <td>Dropout</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>16</td> <td>Fully connected</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>17</td> <td>ReLU</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>18</td> <td>Dropout</td> <td>20</td> <td>256</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>19</td> <td>Fully connected</td> <td>20</td> <td>1</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td>20</td> <td>Euclidean Loss</td> <td>1</td> <td>1</td> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <p>Some observations related to the network architecture:</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU activations</a> on all convolutional and fully connected layers helped a lot, kappa score increased by almost 0.1. It’s interesting to note that Christian Szegedy, one of the GoogLeNet developers (winner of the classification contest at ImageNet 2014), <a href="https://www.youtube.com/watch?v=ySrj_G5gHWI">expressed an opinion</a> that the main reason for the deep learning revolution happening now is the ReLU function :)</li> <li>2 fully connected layers (256 neurons each) at the end is better than one fully connected layer. Kappa was increased by almost 0.03</li> <li>Number of filters in the convolutional layers are not very important. Difference between, say, 20 and 40 filters is very little</li> <li>Dropout helps fight overfitting (we used 50% probability everywhere)</li> <li>We didn’t notice any difference with Local response normalization layers</li> </ul> <p>Below are the 40 filters of the first convolutional layer of our best model (visualization code is adapted from <a href="http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb">here</a>). They don’t seem to be very meaningful:</p> <p><img src="/public/2015-08-17/convolutional-filters.png" alt="Filters of the 1st convolutional layer" title="Filters of the 1st convolutional layer" /></p> <p>I tried to use dropout on convolutional layers as well, but couldn’t make the network learn anything. The loss was quickly becoming <code class="highlighter-rouge">nan</code>. Probably the learning rate should have been very different…</p> <h2 id="loss-function">Loss function</h2> <p>Submissions of this contest were evaluated by the metric called <strong>quadratic weighted kappa</strong>. We found an <a href="http://www.real-statistics.com/reliability/weighted-cohens-kappa/">Excel code</a> that implements it which helped us to get some intuition.</p> <p>At the beginning we started to use <a href="http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1SoftmaxWithLossLayer.html">softmax loss</a> on top of the 5 neurons of the final fully connected layer. Later we decided to use something that will take into account the fact that the order of the labels matters (0 and 1 are closer than 0 and 4). We left only one neuron in the last layer and tried to use <a href="http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1EuclideanLossLayer.html">Euclidean loss</a>. We even tried to “scale” the labels of the images in a way that will make it closer to being “quadratic”: we replaced the labels [0,1,2,3,4] with [0,2,3,4,6].</p> <p>Ideally we would like to have a loss function that implements the kappa metric. But we didn’t risk to implement a new layer in Caffe. <a href="http://jeffreydf.github.io/diabetic-retinopathy-detection/#the-opening">Jeffrey De Fauw</a> has implemented some continuous approximation of kappa metric using Theano with a lot of success.</p> <p>When we switched to 0,1 vs 2,3,4 classification, I thought 2-neuron softmax would be better than Euclidean loss because of the second neuron: it might bring some information that could help to obtain better score. But after some tests I saw that the sum of the activations of the two softmax neurons tends to 1, so the second neuron does not bring new information. The rest of the training was done using Euclidean loss (although I am not sure if that was the best option).</p> <p>We logged the output of Caffe into a file, then plotted the graphs of training and validation losses using a <a href="https://github.com/YerevaNN/Caffe-python-tools/blob/master/plot_loss.py">Python script</a> written by Hrayr:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">./build/tools/caffe train <span class="nt">-solver</span><span class="o">=</span>solver.prototxt &amp;&gt; log_g_g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5-wd0-lr0.001.txt python plot_loss.py log_g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5-wd0-lr0.001.txt</code></pre></figure> <p>The script allows to print multiple logs on the same image and uses <code class="highlighter-rouge">moving average</code> to make the graph look smoother. It correctly aligns the graphs even if the log does not start from the first iteration (in case the training is resumed from a Caffe snapshot). For example, in the plot below <code class="highlighter-rouge">train 1</code> and <code class="highlighter-rouge">val 1</code> correspond to the model described in the previous section with <code class="highlighter-rouge">weight decay=0</code>; <code class="highlighter-rouge">train 2</code> and <code class="highlighter-rouge">val 2</code> correspond to the model which started from the 48000th iteration of the previous model but used <code class="highlighter-rouge">weight decay=0.0015</code>. The best kappa score was obtained on 81000th iteration of the second model. Then we observe overfitting.</p> <p><img src="/public/2015-08-17/log_g_01v234_40r-2-40r-2-40r-2-40r-4-256rd0.5-256rd0.5-wd0-lr0.001.txt.png" alt="Training and validation losses for our best model" title="Training and validation losses for our best model" /></p> <p>Note that the validation loss is usually lower than the training loss. The reason is that the classes are equal in the training set and are far from being equal in the validation set. So the training and validation losses cannot be compared.</p> <h2 id="preparing-submissions">Preparing submissions</h2> <p>After training the models we used a <a href="https://github.com/YerevaNN/Caffe-python-tools/blob/master/predict_regression.py">Python script</a> to make predictions for the images in validation set. It creates a CSV file with neuron activations. Then we imported this CSV into Wolfram Mathematica and played with it there.</p> <p>I use Mathematica mainly because of its nice visualizations. Here is one of them: the <code class="highlighter-rouge">x</code> axis is the activation of the single neuron of the last layer, and the graphs present the percentages of the images of each particular label that have <code class="highlighter-rouge">x</code> activation. Ideally the graphs corresponding to different labels should be clearly separable by vertical lines. Unfortunately that’s not the case, which visually explains why the kappa score is so low.</p> <p><img src="/public/2015-08-17/best-model-graphs.png" alt="Percentage of images per given neuron activation" title="Percentage of images per given neuron activation" /></p> <p>In order to convert the neuron activations to predicted levels we need to determine 4 “threshold” numbers. These graphs show that it’s not obvious how to choose these 4 numbers in order to maximize the kappa score. So we take, say, 1000 random 4-tuples of numbers between minimum and maximum activations of the neuron, and calculate the kappa score for each of the tuples. Then we take the 4-tuple for which the kappa was maximal, and use these numbers as thresholds for the images in the test set.</p> <p>Note that we calculate the kappa scores for the validation set, although there is a risk to overfit the validation set. Ideally we should choose those thresholds which attain maximum kappa score on the train set. But, in practice, the thresholds that maximize the kappa score on validation set perform better on the test set, mainly because the network has already overfit the training set!</p> <h2 id="attempts-to-ensemble">Attempts to ensemble</h2> <p>Usually it is possible to improve the scores by merging several models. This is called <a href="https://en.wikipedia.org/wiki/Ensemble_learning">ensembling</a>. For example, the 3rd place winners of this contest have merged the results of 9 convolutional networks.</p> <p>We developed couple of ways to merge the results from two networks, but they didn’t work well for us. They gave very small improvements (less than 0.01) only when both networks gave similar kappa scores. When one network was clearly stronger than the other one, the ensemble didn’t help at all. One of our ensemble methods was an extension of the “thresholding” method described in the previous section to 2 dimensions. We plot the images on a 2D plane in a way that each of the coordinates corresponds to a neuron activation of one model. Then we looked for random lines that split the plane in a way that maximizes the kappa score. We tried two methods of splitting the plane which are demonstrated below. Each blue dot corresponds to an image of label 0, orange dots correspond of images having label 4.</p> <table> <tbody> <tr> <td><img src="/public/2015-08-17/model-merge-diagonals.png" alt="Ensemble of two networks, threshold lines are diagonal" title="Ensemble of two networks, threshold lines are diagonal" /></td> <td><img src="/public/2015-08-17/model-merge-lines.png" alt="Ensemble of two networks, threshold curves are perpendicular lines" title="Ensemble of two networks, threshold curves are perpendicular lines" /></td> </tr> </tbody> </table> <p>We didn’t try to merge more than 2 networks at once. Probably this was another mistake.</p> <p>The only method of ensembling that worked for us was to take an average over 4 rotated / flipped versions of the images. We also tried to take minimum, maximum and harmonic mean of the neuron activations. Minimum and maximum brought 0.01 improvement to the kappa score, while harmonic and arithmetic means brought 0.02 improvement. The best result we achieved used the arithmetic mean. Note that this required to have 4 versions of test images (which took 2 days to rotate / flip) and to run the network on all versions (which took another day).</p> <p>All these experiments can be replicated in Mathematica by using the script <code class="highlighter-rouge">main.nb</code> and the required CSV files that are <a href="https://github.com/YerevaNN/Kaggle-diabetic-retinopathy-detection/tree/master/mathematica">available on Github</a>.</p> <p>Finally, note that Mathematica is the only non-free software used in the whole training process. We believe it is better to keep the ecosystem clean :) We will probably use <a href="http://ipython.org/">IPython</a> next time.</p> <h2 id="more-on-this-contest">More on this contest</h2> <p>Many contestants have published their solutions. Here are the ones I could find. Please, let me know if I missed something. Most of the solution are heavily influenced by the winner method of the plankton classification contest.</p> <ul> <li>1st place: <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/forums/t/15801/competition-report-min-pooling-and-thank-you">Min-Pooling</a> used OpenCV to preprocess the images, augmented the dataset by scaling, skewing and rotating (and notably not by changing colors), trained several networks on his own <a href="https://github.com/btgraham/SparseConvNet">SparseConvNet</a> library and used random forests to combine predictions from two eyes of the same person. Kappa = 0.84958</li> <li>2nd place: <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/forums/t/15807/team-o-o-competition-report-and-code">o_O team</a> used Theano, Lasagne, nolearn to train OxfordNet-like network on minimal preprocessed images. They have heavily augmented the dataset. They note the importance of using larger images to achieve high scores. Kappa = 0.84479</li> <li>3rd place: <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection/forums/t/15845/3rd-place-solution-report">Reformed Gamblers team</a> combined results of 9 convolutional networks (OxfordNet-like and others) with leaky ReLU activations and non-trivial loss functions. They used Torch on multiple GPUs. Kappa = 0.83937</li> <li><strong>Update:</strong> 4th place: Julian and Daniel <a href="http://blog.kaggle.com/2015/08/14/diabetic-retinopathy-winners-interview-4th-place-julian-daniel/">gave an interview</a> to Kaggle. They did extensive preprocessing and data augmentation, used CXXNet, PyLearn and Keras to train multiple OxfordNet-like networks. They highlight the importance of good parameter initialization.</li> <li>5th place: <a href="http://jeffreydf.github.io/diabetic-retinopathy-detection/">Jeffrey De Fauw</a> used Theano to train OxfordNet-like network with leaky ReLU activations on significantly augmented dataset. He has also implemented a smooth approximation of kappa metric and used it as a loss layer. Well written blog post. Kappa = 0.82899</li> <li>20th place: <a href="http://ilyakava.tumblr.com/post/125230881527/my-1st-kaggle-convnet-getting-to-3rd-percentile">Ilya Kavalerov</a>, again Theano, OxfordNet, good augmentation, non-obvious loss function. Interesting read. Kappa = 0.76523</li> <li>46th place: <a href="https://nikogamulin.github.io/2015/07/31/Diabetic-retinopathy-detection-with-convolutional-neural-network.html">Niko Gamulin</a> used Caffe on GTX 980 GPU (just like us) but OxfordNet architecture. Kappa = 0.63129</li> </ul> <p>After the contest we tried to use leaky ReLUs, something we just didn’t think of during the contest. The results are not promising. Here are the plots of the validation losses with negative slope values (<code class="highlighter-rouge">ns</code>) 0, 0.01, 0.33 and 0.5 respectively:</p> <p><img src="/public/2015-08-17/leaky-ReLU.png" alt="Validation losses using leaky ReLU activations" title="Validation losses using leaky ReLU activations" /></p> <p>Finally, Hrayr suggested to use different learning rates for different convolutional layers (Caffe supports this by specifying multiplication constants per layer). He used larger coefficients (12) for the first layers than for the top layers. The full prototxt file is on <a href="https://github.com/YerevaNN/Kaggle-diabetic-retinopathy-detection/blob/master/g_01v234_32r-2-64r-2-64r-2-128r-2-128r-2-256r-2-512rd0.5-256rd0.5_manual_learning_rates.prototxt">Github</a>. This network allowed to get up to 0.52 kappa score on the local validation set. We didn’t try to run it on test images, although in almost all cases our scores on private leaderboard were higher than the scores on local validation sets.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>We would like to express gratitude to Hugo Larochelle for his excellent video course on neural networks. After watching the videos we could easily understand almost all the terms in Caffe documentation.</p> <p>We would like to thank the organizers of the contest for a great competition and the contestants for helpful discussions in forums and published solutions. We learned a lot from this contest.</p> Getting started with neural networks 2015-07-30T00:00:00+00:00 http://yerevann.github.io//2015/07/30/getting-started-with-neural-networks <h2 id="who-we-are">Who we are</h2> <p>We are a group of students from the department of <a href="http://ysu.am/faculties/en/Informatics-and-Applied-Mathematics">Informatics and Applied Mathematics</a> at <a href="http://ysu.am/main/en">Yerevan State University</a>. In 2014, inspired by successes of neural nets in various fields, especially by <a href="http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/">GoogLeNet’s</a> excellent performance in ImageNet 2014, we decided to dive into the topic of neural networks. We study calculus, combinatorics, graph theory, algebra and many other topics in the university but we learn nothing about machine learning. Just a few students take some <a href="https://www.coursera.org/learn/machine-learning/home/info">ML courses</a> from Coursera or elsewhere.</p> <!--more--> <h2 id="choosing-a-video-course">Choosing a video course</h2> <p>At the beginning of 2015 the <a href="http://ysu.am/sss/en">Student Scientific Society</a> of the department initiated a project to study neural networks. We had to choose some video course on the internet, then watch and discuss the videos once per week in the university. We wanted a course that would cover everything from the very basics to convolutional networks and deep learning. We followed <a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Yoshua Bengio’s</a> advice given during his <a href="http://www.reddit.com/r/MachineLearning/comments/1ysry1/ama_yoshua_bengio">interview on Reddit</a> and chose <a href="https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH">this excellent class</a> by <a href="http://www.dmi.usherb.ca/~larocheh/index_en.html">Hugo Larochelle</a>.</p> <p>Hugo’s lectures are really great. First two chapters teach the <a href="https://www.youtube.com/watch?v=SGZ6BttHMPw&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=1">basic structure</a> of neural networks and describe the <a href="https://www.youtube.com/watch?v=5adNQvSlF50&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=7">backpropagation</a> algorithm in details. We loved that he showed the derivation of the gradients of the loss function. Because of this, <a href="https://github.com/Harhro94">Hrayr</a> managed to implement a simple multilayer neural net on his own. Next two chapters (which we skipped) talk about Conditional Random Fields. The fifth chapter introduces unsupervised learning with <a href="https://www.youtube.com/watch?v=p4Vh_zMw-HQ&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=36">Restricted Boltzmann Machines</a>. This was the hardest part for us, mainly because of our lack of knowledge in probabilistic graphical models. The sixth chapter on <a href="https://www.youtube.com/watch?v=FzS3tMl4Nsc&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=44">autoencoders</a> is our favorite: the magic of denoising autoencoders is very surprising. Then there are chapters on <a href="https://www.youtube.com/watch?v=vXMpKYRhpmI&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=51">deep learning</a>, another unsupervised learning technique called <a href="https://www.youtube.com/watch?v=7a0_iEruGoM&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=60">sparse coding</a> (which we also skipped due to time limits) and <a href="https://www.youtube.com/watch?v=rxKrCa4bg1I&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=69">computer vision</a> (with strong emphasis on convolutional networks). The last chapter is about <a href="https://www.youtube.com/watch?v=OzZIOiMVUyM&amp;list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&amp;index=79">natural language processing</a>.</p> <p><img src="/public/2015-07-30/denoising-autoencoder-slide.png" alt="Denoising autoencoders" title="A slide on denoising autoencoders from Hugo Larochelle's video course" /></p> <p>The lectures contain lots of references to papers and demonstrations, the slides are full of visualizations and graphs, and, last but not least, Hugo kindly answers all questions posed in the comments of Youtube videos. After watching the chapter on convolutional networks we decided to apply what we learned on some computer vision contest. We looked at the list of active competitions on Kaggle and the only one related to computer vision was the <a href="https://www.kaggle.com/c/diabetic-retinopathy-detection">Diabetic retinopathy detection contest</a>. It seemed to be very hard as a first project in neural nets, but we decided to try. We’ll describe our experience with this contest in the next post.</p>