Inner Attention based Recurrent Neural Networks for Answer Selection

A RNN model using Inner Attention to select answer by Bingning Wang, Kang Liu and Jun Zhao for ACL 2016

Abstract

Attention: Based on recurrent neural networks, external attention information was added to hidden representation to get an attentive sentenve representation.
This work:
- analyze the deficiency of traditional attention based RNN models quantitatively and qualitatively.
- present three new RNN models that add attention information before RNN hidden representation.
Experiment: achieves new stat-of-art results in answer selection task.

Answer selection(AS): given a question, the goal is to choose the answer from a set of pre-selected sentences.
Traditional AS model: mainly based on lexical features such as parsing tree edit distance.
Traditional RNN model: represent the meaning of a sentence in a vector space and then compare the question and answer candidates in this hidden space, but may ignore the information subject to the question when representing the answer.
This work just same as the above said.

Attention based model
- attention techniques can improve the performance of machine learning models.
- attention model: one representation is built with attention from other representation.
- common two ways: get attention from source sentence, either by the whole sentence representation(which they call attentive) or word-by-word attention(called impatient).
Answer Selection
- given a question and a set of candidate sentences, one should choose the best sentence from a candidate sentence set that can answer the question.
- traditional functions are all based on lexical and parse tree, suffering from the availability of additional resources and errors of many NLP tools such as dependency parsing.

RNN architecture:
- $X=D[q_1, q_2, \ldots, q_n]$：$D$ is an embedding maxtrix in $R^d$
- $h_t=\delta(W_{ih}x_t+W_{hh}h_{t-1}+b_h)$: $W_{ih}, W_{hh}, W_{ho}$ are weights matrices and $b_h$ is bias. $\delta$ is active function such as $tanh$. This means $h_t$ is from this input $x_t$ and last hidden stat $h_{t-1}$
- $y_t=\delta(W_{ho}h_t+b_o)$: $b_o$ is bias vectors.
- Usually we can ignore the output variables and use the hidden variables: the last hidden variable $h_n$ or all hidden stats averge$\frac{1}{n}\sum_{t=1}^n h_t$ as sentence(question) representation $r_q$.
Attention based RNN model:
- $H_a=[h_a(1), h_a(2), \ldots, h_a(m)]$: $h_a(t)$ is hidden state of the sentence at time $t$.
- $s_t \propto f_{attention}(r_q, h_a(t))$: $f_{attention}$ was computed as: $m(t)=tanh(W_{hm}h_a(t)+W_{qm}r_q)$ and $f_{attention}(r_q, h_a(t))=exp(w_{mx}^Tm(t))$, and $W_{hm}$, $W_{qm}$ are attentive weight matrices and $w_{ms}$ is attentive weight vector.
- $h’_a(t)=h_a(t)s_t$
- $r_a=\sum_{t=1}^mh’_a(t)$
- this is OARNN: Outer Attention based RNN

attention before representation