Jekyll2021-09-07T19:46:03-06:00https://www.cs.utah.edu/~tli/Tao Lipersonal descriptionTao Litli_at_cs_utah_eduArgmax vs Softmax vs Sparsemax2019-01-10T00:00:00-07:002019-01-10T00:00:00-07:00https://www.cs.utah.edu/~tli/posts/2019/01/future-post<p>A summary inspired by the SparseMAP paper.</p>
<h2 id="1argmax">1.Argmax</h2>
<p>Argmax is the backbone of softmax and sparsemax. Suppose we want to get a probability distribution given a set of unnormalized scores $\theta$, the optimization problem is:</p>
<script type="math/tex; mode=display">y = \arg\max_{y \in \bigtriangleup^d} \theta^T y</script>
<p>where the simplex $\bigtriangleup^d$ says $\sum_i y_i = 1$ and $\forall_i y_i \geq 0$, i.e. to make $y$ looks like a distribution.</p>
<h2 id="2softmax">2.Softmax</h2>
<p>Softmax, on the other hand, can be formulated on top of argmax.</p>
<script type="math/tex; mode=display">y = \arg\max_{y \in \bigtriangleup^d} \theta^T y - y^T ln(y)</script>
<p>where $-y^T ln(y)$ is a negative entropy prior/normalizer. (This form is exactly as appeared in the SparseMAP paper.) The immediate question is that: how this equation is softmax?</p>
<p>To see why, first we need to solve $y_i$. By rewriting the above optimization as:</p>
<script type="math/tex; mode=display">y = \arg\min_{y \in \bigtriangleup^d} - \theta^T y + y^T ln(y)</script>
<p>we can see the objective is strictly convex. Thus we can take its Lagrangian:</p>
<script type="math/tex; mode=display">L(y, \lambda_1, \lambda_2) = y^T ln(y) - \theta^T y + \lambda_1 (1 - 1^T y) + \lambda_2^T y</script>
<p>With KKT conditions and slackness, we have the followings:</p>
<script type="math/tex; mode=display">\frac{\partial L(y, \lambda_1, \lambda_2)}{\partial y_i} = ln(y_i) + 1 - \theta_i - \lambda_1 + \lambda_2 = 0</script>
<script type="math/tex; mode=display">\text{i.e. } y_i = exp(\theta_i + \lambda_1 - \lambda_2 - 1)</script>
<script type="math/tex; mode=display">\forall_i \lambda_{2i} = 0 \text{, since } ln(y_i) \text{ prohibits } y_i=0</script>
<p>Then we have $y_i = exp(\theta_i + \lambda_1 - 1)$. To solve $\lambda_1$, we will need the simplex constraint:</p>
<script type="math/tex; mode=display">\sum_i y_i = \sum_i e^{\theta_i + \lambda_1 - 1} = 1</script>
<script type="math/tex; mode=display">e^{\lambda_1} = \frac{e}{\sum_i e^{\theta_i}}</script>
<p>Plugging this back gives $y_i = \frac{e^i}{\sum_i e^{\theta_i}}$, i.e. softmax itself.</p>
<h2 id="3sparsemax">3.Sparsemax</h2>
<p>Sparsemax uses L-2 normalizer instead of using a negative entropy prior.</p>
<script type="math/tex; mode=display">y = \arg\max_{y \in \bigtriangleup^d} \theta^T y - \frac{1}{2}y^T y</script>
<p>Again, by rewriting in $\arg\min$, we can see the objective is strictly convex. But the problem is in fact a QP problem.
Further, because it does not have the $ln(y)$ term like softmax, nothing prohibits $y_i=0$. So the solving becomes challenging.
Simply put, one can use ADMM or off-the-shelf solver for the QP optimization.</p>
<p>But eitherway, we still need the gradient in backward pass for end-to-end learning.</p>
<p>Well, gradient TODO.</p>Tao Litli_at_cs_utah_eduA summary inspired by the SparseMAP paper.Making a Local Instance of ConceptNet2018-09-24T00:00:00-06:002018-09-24T00:00:00-06:00https://www.cs.utah.edu/~tli/posts/2018/09/future-post<p>This note explains how to make a local instance of ConceptNet running.</p>
<h2 id="headsup">Headsup</h2>
<p>This note is built on top of the official guide:</p>
<ul>
<li><a href="https://github.com/commonsense/conceptnet5/wiki/Build-process">https://github.com/commonsense/conceptnet5/wiki/Build-process</a></li>
<li><a href="https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy">https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy</a></li>
</ul>
<p>Hard disk requirement: ~400GB in total (can be two separate blocks of 200GB).</p>
<h2 id="step-by-step">Step-by-Step</h2>
<h3 id="1-fetch-the-data">1. Fetch the data</h3>
<p>First, use puppet to dump the data into your local system.
You can either make a virtual environment or not. I found it does not matter much (explained below).</p>
<p>Just make sure before you run any of the following lines, reserve about 200GB space under <code class="highlighter-rouge">/home</code> directory.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd to [ANY_DIRECTORY]
git clone https://github.com/commonsense/conceptnet-puppet
cd conceptnet-puppet
sudo sh puppet-setup.sh
sudo sh puppet-apply.sh
</code></pre></div></div>
<p>It does not matter where you clone the github repo, because it will not contain the data.</p>
<p>The <code class="highlighter-rouge">puppet-setup.sh</code> script will create a sudo-type user called <code class="highlighter-rouge">conceptnet</code> which you will not have passcode thus in which you can not run sudo.
<strong>So later on you will have to switch between your sudo account and this <code class="highlighter-rouge">conceptnet</code> account to make things work.</strong></p>
<p>Again, the <code class="highlighter-rouge">puppet-setup.sh</code> will download about 24GB compressed data into <code class="highlighter-rouge">/home/conceptnet/</code> and inflate them into about 200GB. The process can be VERY slow.
Upon the finish of this script, you will be entered into the conceptnet environment. Exit it to stay with your sudo user.</p>
<p>Then run the <code class="highlighter-rouge">puppet-apply.sh</code> will throw you with tons of warnings about server not found or something. Simply ignore them.</p>
<h3 id="2-build-the-database">2. Build the database</h3>
<p>Stay in your sudo user. Start the database service:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo service postgresql start
</code></pre></div></div>
<p>You can not get it done with user <code class="highlighter-rouge">conceptnet</code> since you do not have the password:)</p>
<p>Enter the <code class="highlighter-rouge">conceptnet</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo su conceptnet
cd ~/conceptnet5
</code></pre></div></div>
<p>where you will need another 200GB space (at <code class="highlighter-rouge">~/conceptnet5/data/</code>) to continue smoothly.</p>
<p>There are already data in <code class="highlighter-rouge">~/conceptnet5/data/</code>.
Therefore if your disk is running in short, you can fetch another 200GB and create a softlink as <code class="highlighter-rouge">~/conceptnet5/data2</code>;
and move all data in <code class="highlighter-rouge">~/conceptnet5/data/</code> to there. And rename the <code class="highlighter-rouge">data2</code> to <code class="highlighter-rouge">data</code>.</p>
<p>Stay in user <code class="highlighter-rouge">conceptnet</code>. Before moving on, run this in <code class="highlighter-rouge">~/conceptnet5/</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install -e '.[vectors]'
</code></pre></div></div>
<p>Stay in user <code class="highlighter-rouge">conceptnet</code>. Build and test the database. The process turned out to be very fast. Expect definitely no error here.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./build.sh
./test.sh
</code></pre></div></div>
<h3 id="3-running-the-backend">3. Running the backend</h3>
<p>Stay in user <code class="highlighter-rouge">conceptnet</code>. Start the conceptnet service (I am not sure if this is already started at this point)</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl restart conceptnet
</code></pre></div></div>
<p>This actually requires sudo, and it will ask you for which account to authenticate sudo. Simply select your sudo acount.</p>
<p>Now switch to your sudo user (simply type <code class="highlighter-rouge">exit</code>). Start a <code class="highlighter-rouge">screen</code>. Then switch to user <code class="highlighter-rouge">conceptnet</code> again:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo su conceptnet
cd ~/conceptnet5/web
python conceptnet_web/api.py
</code></pre></div></div>
<p>The backend will be running at <code class="highlighter-rouge">http://127.0.0.1:8084/</code>. Type <code class="highlighter-rouge">Ctrl+A+D</code> to hide the screen into backend.</p>
<h3 id="4-query">4. Query</h3>
<p>Now let’s have fun with it. First you should be able to get no error running:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://localhost:8084/
</code></pre></div></div>
<p>Then try:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://localhost:8084/c/en/example
</code></pre></div></div>
<p>where you will get a json file printed.</p>
<p>As a realworld query, try this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://localhost:8084/search?rel=/r/Synonym&end=/c/en/play
</code></pre></div></div>
<p>which will print out the whole json query result for synonym edges ending in the word <code class="highlighter-rouge">play</code>.</p>
<p>Cheers:)</p>Tao Litli_at_cs_utah_eduThis note explains how to make a local instance of ConceptNet running.Model Reproduction2018-09-11T00:00:00-06:002018-09-11T00:00:00-06:00https://www.cs.utah.edu/~tli/posts/2018/09/future-post<p>Model reimplementation is both a pain and a great source of learning.</p>
<p>It’s a pain because it takes time, time that is not well spent on research. It takes time to communicate with the author(s), to read the paper again and again, and to do training and testing. Even if there is a public repo, you might not reproduce the reported F1 scores.</p>
<p>But it’s a good learning process that you need to think the problem from scratch: how to preprocess, to batch, to regulate, and to tune. I will focus on my limited experience of reproducing end-to-end neural models because they seem low-hanging fruit in the world of reproduction.</p>
<h2 id="models-reproduced">Models Reproduced</h2>
<p>A list of models I have successfully reproduced in Pytorch (I somehow missed the train of TensorFlow…)</p>
<ul>
<li>Decomposable Attention for Natural Language Inference</li>
<li>Bidirectional Attention Flow for Machine Comprehension</li>
<li>Deep Contextualized Word Representations (the MC model and the NLI model)</li>
</ul>
<p>And a list in progress:</p>
<ul>
<li>Reasoning with Sarcasm by Reading In-between</li>
<li>Get To The Point: Summarization with Pointer-Generator Networks</li>
</ul>
<p>And I will hide the list I failed reproducing…</p>
<p>Yeah, they are all attention models. Once you have done one, it’s easy to make more:)</p>
<h2 id="heads-up">Heads-up</h2>
<p>Here is a list of things I found important</p>
<ul>
<li>Preprocessing</li>
<li>Evaluation</li>
<li>Positions of dropout</li>
<li>Initialization</li>
<li>Learning algorithm</li>
</ul>
<p>The empire building is easy to spot, but the fuzziness on the street is more important here. Formulations in paper are easy to implement. The reason why you can’t get thee F1 score is mostly because of something else.</p>
<p>Let’s elaborate on each one of them</p>
<h2 id="1preprocessing">1.Preprocessing</h2>
<p>Generally speaking, end-to-end models don’t require complicated preprocessing as the only thing you feed the model are embeddings. Preprocessing covers parsing and batching.</p>
<p>For parsing, the first thing I would do is tokenization. Tokenization can be handled by NLTK, Spacy, StanfordCoreNLP, and many other packages. What is important is none of them is perfect. When it comes to open domain text, the output can be bad.</p>
<p>For instance there could be plenty of cases like “sense-making” got tokenized into “sense-“ and “making”. Such things can make vocabulary larger than neccessary. And you will end up with bunch of out-of-vocabulary tokens. Therefore extra manual tokenization is needed. To do that, Just look into the data first, and then come up with some simple regex to tokenize the tokens again.</p>
<p>In addition to tokenization, I rarely do parsing as they tends to be flimsy. Particularly when having a large training data, these non-tunable preprocessed things have a potential to hurt performance.</p>
<p>For sentence chunking, I found it works well on plain english text. When it comes to open domain text, it could be bad (e.g. SQuAD). Considering, “…(((n+2)*2)^3)…”, I have seen plenty of times that Spacy chops the innocent math expression into pieces.</p>
<p>Batching is another important thing as it speeds up training significantly. But in NLP tasks, cases are that examples are having variable sequence lengths. So padding comes into picture. But in attention models, padding could mess up alignment where you will have to mask off padding-related alignment.
But that is not saying you can’t do batching without padding. Considering the following toy dataset:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A cat is sitting in a tree .
A dog is barking by the tree .
A girl is drinking coffee by the street .
Oh my , I love coffee , medium roast the best .
</code></pre></div></div>
<p>We can batch up according to sentence lengths such that batches could have variable sizes. For instance we have a batch consists of sentences 1 and 2, sentence 3 is a batch, and sentence 4 another batch. So we end up with 3 batches of different sizes. During training, we can have another batch size i.e. the minimal batch size of gradient update, equals to 2. And delay the gradient update if the current batch is a batch of size smaller than that. In experiment, I found this works well.</p>
<p>Of course, there are cases you have to use padding. In TensorFlow’s recurrent cell module, padding is implicitly handled. But in Pytorch, this has to be handled explicitly.</p>
<h2 id="2evaluation">2.Evaluation</h2>
<p>This is important because you should always use the “official” evaluation, while you might have a customized loss function.</p>
<p>For instance, the Interpretable Semantic Textual Similarity task, the accuracy evaluation is a complicated script that is not differentiable. For another instance, in SQuAD, the metric is span over lap F1. The official evaluation function was something hidden in somewhere that I did not notice before. So I used the F1 of token span overlap which turns out to be constantly 10% lower than the official metric.</p>
<h2 id="3positions-of-dropout">3.Positions of Dropout</h2>
<p>Without a public repo or contact of the author(s), this is hard to make sure. Many papers simply say something like “…we added dropout to every linear layer…”. Then you followed exactly what they said, and your model overfit to hell…</p>
<p>The actual positions of dropout can be interesting. For instance with ELMo concatenated with GloVe embeddings, the ELMo first get 0.5 dropout, and after concatenation, the whole embeddings receive 0.2 dropout.</p>
<p>When downsampling from high dimension to lower dimension, I found it’s better not to have dropout.</p>
<p>During alignment/attention, the similarity calculation typically prefers to use all features without dropout. But still it depends. Depends on whether you are using ELMo, whether the alignment phase is close to the end of the pipeline.</p>
<p>At the very end of model pipeline, e.g. the final linear layer of classifier, I found adding dropout always help. This is about an intuition that the rear part of the pipeline has stronger tendency towards overfitting.</p>
<p>Lastly, dropout position is something needs extra tunning. The actually working verion in you reimplementation might appear different from the original paper.</p>
<h2 id="4initialization">4.Initialization</h2>
<p>Initialization covers input embeddings and parameter initialization.</p>
<p>When I was struggling with my first attention model, no matter what I try, I got a F1 which is 3% lower than the reported. Later it turns out I was using GloVe 100d word embeddings while the author used GloVe 300d. This is another reminder that when reproducing model, keep in touch with the author(s) :)</p>
<p>When possible, use xavier (aka glorot) initialization. For models of just feedforward layers (e.g. the decomposable attention model), you can live with just manually scaled random initialization. For models with LSTM/GRU, xavier initialization is a life saver. Specially, using good parameter initialization makes the model less volatile against randomness.</p>
<h2 id="5learning-algorithm">5.Learning algorithm</h2>
<p>Among all the models on the reproduced list, I used different learning algorithm and different learning rate. Using the same algorithm and rate simply did not work. I found this is becoming a common issue when rewriting a TensorFlow model in Pytorch. For instance:</p>
<ul>
<li>The decomposable attention paper reported AdaGrad with learning rate 0.05. While I found that works well, Adam with rate 0.0001 works a bit better.</li>
<li>The BiDAF paper reported AdaDelta with learning rate 0.5. The same setting failed in my case but I found Adam with 0.001 works well.</li>
<li>The ELMo enhanced BiDAF uses AdaDelta with learning rate 1.0. The same setting failed in my case but I found Adam with 0.0005 works well.</li>
</ul>
<p>So be sure to try more algorithms and redo the hyperparameter searching.</p>Tao Litli_at_cs_utah_eduModel reimplementation is both a pain and a great source of learning.Reverse Mode Differentiation with Vector Calculus2018-09-01T00:00:00-06:002018-09-01T00:00:00-06:00https://www.cs.utah.edu/~tli/posts/2018/09/future-post<p>This is something back in 2016 when I was struggling with my first serious neural model. Previously I mostly use Scala for NLP. But there was no auto differentiation locally in Scala, so we ended up thinking to write our own package. The task, naturally, turned out to be waaaaay too non-trivial, and we soon gave it up days after. Well, anyway, here is the wrap-up.</p>
<h2 id="heads-up">Heads-up</h2>
<p>The purpose of this note is to brief on how to write backpropagation. Manually:)</p>
<p>Why is this useful? Back to the days when auto-grad libraries (e.g. TensorFlow/Pytorch) were not yet full-fledged, I think this is basically how people train their networks using SGD.</p>
<p>If you want something sense-making mathematically, checkout the “Matrix Cookbook”. This note is just a simplied/optimized brief that works in practice, and so much easier to write on paper.</p>
<p>Finally, for backpropagation, all we need is to get gradient using:</p>
<h2 id="the-forward-computation-of-chain-rule">The Forward Computation of “Chain Rule”</h2>
<p>Say we have a n-layer feedforward network. Around the k-th layer, it’s like this:</p>
<script type="math/tex; mode=display">\boldsymbol{x}_{k} = f_{k}(\boldsymbol{\theta}_{k}, \boldsymbol{x}_{k-1})\\
\boldsymbol{x}_{k+1} = f_{k+1}(\boldsymbol{\theta}_{k+1}, \boldsymbol{x}_{k})</script>
<p>where $\boldsymbol{\theta}$’s are the parameters and $\boldsymbol{x}$’s are naturally the input and output.
And at the very end of the pipeline, we have a loss function:</p>
<script type="math/tex; mode=display">J(\cdot) = \sum_{k} g_k(\boldsymbol{x}_k)</script>
<p>where $g_k$ is a layer/function.
So the gradients we are to get are partial derivatives on $\boldsymbol{\theta}$ and $\boldsymbol{x}$. For instance,</p>
<script type="math/tex; mode=display">\frac{\partial J}{\partial \boldsymbol{x}_k} = \frac{\partial J}{\partial g_k} \frac{\partial g_k}{\partial \boldsymbol{x}_k}
+ \frac{\partial J}{\partial \boldsymbol{x}_{k+1}} \frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{x}_k}</script>
<p>And the gradient on parameters:</p>
<script type="math/tex; mode=display">\frac{\partial J}{\partial \boldsymbol{\theta}_k} = \frac{\partial J}{\partial g_{k+1}} \frac{\partial g_{k+1}}{\partial \boldsymbol{x}_{k+1}} \frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{\theta}_{k}}
+ \frac{\partial J}{\partial \boldsymbol{x}_{k+2}} \frac{\partial \boldsymbol{x}_{k+2}}{\partial \boldsymbol{x}_{k+1}} \frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{\theta}_{k}}</script>
<p><strong>Let’s pretend equation (3) and (4) are all we have, without infinitely expanding.</strong></p>
<h2 id="the-reverse-mode-differentiation-chain-rule-in-vector-calculus">The Reverse-Mode Differentiation “Chain Rule” in Vector Calculus</h2>
<p>The above chain rule is for notation in scalar world only. In vector world, it makes no bloody sense to write as so.
The reason for that is the order of multiplication in vector space is not always commutative.
That is the scalar multiplication can be generalized in two possible ways:</p>
<ul>
<li>matrix multiplication (non-commutative)</li>
<li>element-wise multiplication $\odot$ (commutative)</li>
</ul>
<p>Therefore the scalar chain rule needs to be generalized accordingly as well. The way is to treat partial derivative as both a function and a value.
So the actual gradient of ${\partial J} / {\partial \boldsymbol{\theta}_k}$ becomes:</p>
<script type="math/tex; mode=display">\frac{\partial J}{\partial \boldsymbol{\theta}_k} = \frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{\theta}_k} \bigg[ \frac{\partial g_{k+1}}{\partial \boldsymbol{x}_{k+1}} \big[ \frac{\partial J}{\partial g_{k+1}} \big] \bigg]
+ \frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{\theta}_{k}} \bigg[ \frac{\partial \boldsymbol{x}_{k+2}}{\partial \boldsymbol{x}_{k+1}} \big[ \frac{\partial J}{\partial \boldsymbol{x}_{k+2}} \big] \bigg]</script>
<p>where we will keep using <strong>squared bracket</strong> ${\partial}/{\partial}[\cdot]$ to denote the “derivative function”, and ${\partial}/{\partial}$ value.</p>
<p>PS, what does it mean “a function and a value”? It’s all about the order of operation. If we write something like $A[B[C]]$, then the order of operation is enforced as: calculate $C$, then $B$, then $A$. And when considering $B$ as a value, it’s actually a tensor calculated from $B[C]$.</p>
<h2 id="well-what-is-happening-inside-of-these-functions">Well, what is happening inside of these “functions”?</h2>
<p>What we gonna do here is to fill those gradients in equation (5).</p>
<p>To do that, let’s focus on the k-th layer. Assuming $\boldsymbol{x}$’s are row vectors, and the k-th layer is defined as $f_k(\boldsymbol{\theta}_k, \boldsymbol{x}_k) = \boldsymbol{x}_k \boldsymbol{W}_k \boldsymbol{V}_k$ (i.e. two linear transformations)
(Yes, they can be merged into one, but let’s just say so.) where $\boldsymbol{\theta}_k = \{\boldsymbol{W}_k, \boldsymbol{V}_k\}$.</p>
<p>Remember that we haven’t yet defined the function $g$’s, so let’s assume the following gradients have already been calculated:</p>
<script type="math/tex; mode=display">\frac{\partial g_{k+1}}{\partial \boldsymbol{x}_{k+1}} \Big[ \frac{\partial J}{\partial g_{k+1}} \Big] = \boldsymbol{d}_{x_{k+1}}^{g_{k+1}}</script>
<script type="math/tex; mode=display">\frac{\partial J}{\partial \boldsymbol{x}_{k+2}} = \boldsymbol{d}_{x_{k+1}}^J</script>
<p>According to the definition of $f_k$, we can have the following immediately:</p>
<script type="math/tex; mode=display">\frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{W}_k} \Big[ \boldsymbol{z} \Big] = \boldsymbol{x}_{k+1}^T (\boldsymbol{z} \boldsymbol{V}_k^T)</script>
<script type="math/tex; mode=display">\frac{\partial \boldsymbol{x}_{k+1}}{\partial \boldsymbol{V}_k} \Big[ \boldsymbol{z} \Big] = (\boldsymbol{x}_{k+1} \boldsymbol{W}_k)^T \boldsymbol{z}</script>
<script type="math/tex; mode=display">\frac{\partial \boldsymbol{x}_{k+2}}{\partial \boldsymbol{x}_{k+1}} \Big[ \boldsymbol{z} \Big] = \boldsymbol{z} \boldsymbol{V}^T \boldsymbol{W}^T</script>
<p>where $\boldsymbol{z}$ is an arbitrary tensor (e.g. the right hand side of equations of (6-7)). Now, plugging equations (6-10) back to equation (5) will yield the gradient <script type="math/tex">{\partial J} / {\partial \boldsymbol{\theta}_k}</script>.</p>
<p>Above is just a toy example. Let’s see something serious.</p>
<h2 id="gradient-of-lstm">Gradient of LSTM</h2>
<p>Let’s take the Graves LSTM formulation for example. The forward pass is</p>
<script type="math/tex; mode=display">\boldsymbol{i}_t = \sigma(\boldsymbol{c}_{t-1} \boldsymbol{V}_i + \boldsymbol{x} \boldsymbol{U}_i + \boldsymbol{h}_{t-1} \boldsymbol{W}_i + \boldsymbol{b}_i)</script>
<script type="math/tex; mode=display">\boldsymbol{f}_t = \sigma(\boldsymbol{c}_{t-1} \boldsymbol{V}_f + \boldsymbol{x} \boldsymbol{U}_f + \boldsymbol{h}_{t-1} \boldsymbol{W}_f + \boldsymbol{b}_f)</script>
<script type="math/tex; mode=display">\boldsymbol{c}_t = \boldsymbol{f}_t \odot \boldsymbol{c}_{t-1} + \boldsymbol{i}_t \odot tanh(\boldsymbol{x} \boldsymbol{U}_c + \boldsymbol{h}_{t-1} \boldsymbol{W}_c + \boldsymbol{b}_c)</script>
<script type="math/tex; mode=display">\boldsymbol{o}_t = \sigma(\boldsymbol{c}_t \boldsymbol{V}_o + \boldsymbol{x} \boldsymbol{U}_o + \boldsymbol{h}_{t-1} \boldsymbol{W}_o + \boldsymbol{b}_o)</script>
<script type="math/tex; mode=display">\boldsymbol{h}_t = \boldsymbol{o}_t \odot tanh(\boldsymbol{c}_t)</script>
<p>(refer to <a href="https://arxiv.org/pdf/1308.0850.pdf">https://arxiv.org/pdf/1308.0850.pdf</a>)</p>
<p>Imaging in the backpropagation of this lstm cell is a function with input $d\boldsymbol{h}$
which is the gradient of $\boldsymbol{h}_t$.
Based on that, we are going to get gradients of those gates and hidden states: $d\boldsymbol{i}$, $d\boldsymbol{f}$, $d\boldsymbol{o}$, and $d\boldsymbol{c}$. And further get gradients of learnable parameters $d\boldsymbol{U}_i$, $d\boldsymbol{U}_f$, and so on and so on.</p>
<p>To this end, let’s first focus on equation (11) first. Here we want $d\boldsymbol{U}_i$. Assuming $d\boldsymbol{i}$ is already calculated, the gradient of $\boldsymbol{U}_i$ is:</p>
<script type="math/tex; mode=display">d\boldsymbol{U}_i = \frac{\partial (\cdot)}{\partial \boldsymbol{U}_i} \bigg[ \frac{\partial \sigma_i (\cdot)}{\partial (\cdot)} \Big[ \frac{\partial \boldsymbol{i}}{\partial \sigma (\cdot)} \big[ d\boldsymbol{i}\big] \Big] \bigg]</script>
<p>where $(\cdot)$ represents the linear summation in equation (11). Referring back to the above toy example, what we gonna do next is to get the definition of these “derivative functions”:</p>
<script type="math/tex; mode=display">\frac{\partial (\cdot)}{\partial \boldsymbol{U}_i}\Big[ \boldsymbol{z} \Big] = \boldsymbol{x}^T \boldsymbol{z}</script>
<script type="math/tex; mode=display">\frac{\partial \sigma_i (\cdot)}{\partial (\cdot)}\Big[ \boldsymbol{z} \Big] = \sigma(\boldsymbol{z}) \odot (1-\sigma(\boldsymbol{z}))</script>
<script type="math/tex; mode=display">\frac{\partial \boldsymbol{i}}{\partial \sigma_i(\cdot)}\Big[ \boldsymbol{z} \Big] = \boldsymbol{z}</script>
<p>where $\boldsymbol{z}$ is a generalized notation for arguments. Plugging equations (17-19) back to (16) yields:</p>
<script type="math/tex; mode=display">d\boldsymbol{U}_i = \boldsymbol{x}^T \Big( \sigma(d\boldsymbol{i}) \odot (1-\sigma(d\boldsymbol{i})) \Big)</script>
<p>A simple way of sanity check is to make sure the dimensionality of $d\boldsymbol{U}_i$ matches that of $\boldsymbol{U}_i$. Remember $\boldsymbol{x}$ is a row vector, so the dimensionality matches here.</p>
<p>Using the same methods above, we can get $d\boldsymbol{U}_i$, $d\boldsymbol{W}_i$, $d\boldsymbol{V}_i$, $d\boldsymbol{U}_f$, $d\boldsymbol{W}_f$, $d\boldsymbol{V}_f$, $d\boldsymbol{U}_o$, $d\boldsymbol{W}_o$, and $d\boldsymbol{V}_o$. A little extra effort is required to get $d\boldsymbol{U}_c$, and $d\boldsymbol{W}_c$:</p>
<script type="math/tex; mode=display">\frac{\partial (\cdot)}{\partial \boldsymbol{U}_c}\bigg[ \frac{\partial tanh(\cdot)}{\partial (\cdot)} \Big[ \frac{\partial \boldsymbol{c}}{\partial tanh(\cdot)} \big[ d\boldsymbol{c} \big] \Big] \bigg]</script>
<script type="math/tex; mode=display">\frac{\partial (\cdot)}{\partial \boldsymbol{U}_c}\big[ \boldsymbol{z} \big] = \boldsymbol{x}^T \boldsymbol{z}</script>
<script type="math/tex; mode=display">\frac{\partial \tanh_c (\cdot)}{\partial (\cdot)}\Big[ \boldsymbol{z} \Big] = 1-tanh^2(\boldsymbol{z})</script>
<script type="math/tex; mode=display">\frac{\partial \boldsymbol{c}}{\partial tanh_c(\cdot)}\Big[ \boldsymbol{z} \Big] = \boldsymbol{i} \odot \boldsymbol{z}</script>
<p>where $(\cdot)$ denotes the linear form in equation (13).</p>
<p>Thus,
<script type="math/tex">d\boldsymbol{U}_c = \boldsymbol{x}^T \big( 1 - tanh^2 ( d\boldsymbol{c} \odot \boldsymbol{i} ) \big)</script>. Similarly we can get $d\boldsymbol{W}_c$.
Next, we need calculate what exactly {$d\boldsymbol{i}$, $d\boldsymbol{f}$, $d\boldsymbol{o}$, and $d\boldsymbol{c}$} are, given $d\boldsymbol{h}$.</p>
<script type="math/tex; mode=display">d\boldsymbol{c} = \frac{\partial (\cdot)}{\partial \boldsymbol{c}}\bigg[ \frac{\partial \sigma(\cdot)}{\partial (\cdot)} \Big[ \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{o}} \big[ d\boldsymbol{h} \big] \Big] \bigg]
+ \frac{\partial tanh(\boldsymbol{c})}{\partial \boldsymbol{c}} \Big[ \frac{\partial \boldsymbol{h}}{\partial tanh(\boldsymbol{c})} \big[ d\boldsymbol{h} \big] \Big]</script>
<p>where $(\cdot)$ represents the linear form in equation (14). And</p>
<script type="math/tex; mode=display">d\boldsymbol{f} = \frac{\partial \boldsymbol{f} \odot \boldsymbol{c}_{t-1}}{\partial \boldsymbol{f}} \big[ d\boldsymbol{c} \big]</script>
<p>And</p>
<script type="math/tex; mode=display">d\boldsymbol{i} = \frac{\partial \boldsymbol{i} \odot tanh(\cdot)}{\partial \boldsymbol{i}}\big[ d\boldsymbol{c} \big]</script>
<p>where $(\cdot)$ represents the linear form in equation (13). And</p>
<script type="math/tex; mode=display">d\boldsymbol{o} = tanh(\boldsymbol{c}) \odot d\boldsymbol{h}</script>
<p>Do the calculation for equations (25-27) will yield what we wanted at the very beginning. Don’t forget those bias terms whose gradients are easy to get. Furthermore, beware of that <script type="math/tex">d\boldsymbol{c}_{t-1}</script> and <script type="math/tex">d\boldsymbol{h}_{t-1}</script> should be summed over partial derivatives from equation (11-14).</p>
<h2 id="gradient-checking">Gradient Checking</h2>
<p>Make sure all manually written gradients pass gradient check:)</p>
<h2 id="conclusion">Conclusion</h2>
<p>Overall, deriving gradients manually is a lot of fun. The whole process becomes unsustainable when it comes to modeling stage which involves constant changes in network architecture. And besides that, performance throttle can be a big issue, especially compared with CUDNN’s builtin LSTM/GRU cell. But still it’s good to know what effectively happened in those <code class="highlighter-rouge">backward</code> calls in modern learning libraries.</p>
<p>Updates 10-05-2018: Reverse mode differentiation</p>
<p>Updates 05-17-2019: Minor fixes</p>Tao Litli_at_cs_utah_eduThis is something back in 2016 when I was struggling with my first serious neural model. Previously I mostly use Scala for NLP. But there was no auto differentiation locally in Scala, so we ended up thinking to write our own package. The task, naturally, turned out to be waaaaay too non-trivial, and we soon gave it up days after. Well, anyway, here is the wrap-up.