Note for Papers in ICLR 2017 about NLP
International Conference on Learning Representations known as ICLR is one of the topest Conference. Because I have being concentrating on the representation of the text more, I have done a review of the papers in ICLR on NLP.
1. Representation of Character, Word and Sentence
Representations or Embeddings in many ways
1.1 Character-Aware Attention Residual Network for Sentence Representation
One way about the sentence embedding by Xin Zheng(Nanyang Technological University, Singapore), Zhenzhou Wu(SAP)
- Goal: Classify short and noisy text
- Problem: - Feature sparsity using bag-of-word model, with TFIDF or other weighting schemes
- Bag-of-word method has an intrinsic disadvantage that two separate features will be generated for two words with the same root or of different tenses. In other word, the morphology is very important to understand the information of the short document.
- Word2vec or Doc2Vec which are distributed word representation miss the word morphology information and word combination information.
 
- Backgroud: The quality of document representation here has a great impact on the classification accuracy. 
- Works of this paper: - Take word morphology and word semantic meaning into considerationby using character-aware embedding and word distributed embedding.(This may be the common benefit of the embeddings.)
- To obtained a sentence representation matrix: concatenate both character-level and word distributed embedding together and arranging words in order. Sentence representation vector is then derived based on different views from sentence representation matrix to overcome data sparsity problem of short text. At last, a residual network is employed on the sentence embedding to get a consistent and refined sentence representation.(The detials will be shown later.)
 
- Details of the model: 
 This paper proposes a character-aware attention residual network to generate sentence representation as the Figure shown. - A matrix constructed by characters embedding in word is encoded into a vector by convolution network.
- Concatenate both character-level embedding and word semantic embedding into a word representation vector.
- A sentence is represented by a matrix.
- Enrich sentence representation vector by Attention Mechanism: solve the problem that not all the features contribute the same for classification(or other tasks) and target on pertinent parts.
- Refine sentence representation vector by Residual network: extracte features from different views consistent.
- Obtain the final sentence vector for the classification(or other tasks).
 
- More details of the model: - Word representation construction: $C$ is vocabulary of characters, $E\in R^{d_c\dot |C|}$ is the character embedding matrix, $d_c$ is the dimensionality of character embedding, $E^w\in R^{d_c\dot n_c}$ is word character-level embedding matrix, $E_i^w=E\dot v_i$ where $v_i$ is a binary column vector that is one row in $E$. Then use convolution network to get the vector $q_c$ which captures the character-level information.(But I still don’t know how does the paper solve the problem that the dimentions of the matrix are not same about all words. Maybe some skills known as the padding?) The character-level embedding can only caputure the word morphological features, therefore concatenating the distributed word representative vector as the reflect of the word semantic and syntactic characteristics.
- Sentence representation vector construction: shown as below: 
 Using different weights for every vector of matirx and attention mechanism to enrich the sentence representation.- attention mechanism shown as below. 
 $g(q_i)=Tanh(W_{qh}q_i+b_{qh})$
 $s_i=\frac{exp(g(q_i))}{\sum_n_w^{j=i}exp(g(q_j))}, \hat q_i=s_iq_i$
- convolution operations on $Q$ with n-grams.
 
- attention mechanism shown as below.
- Residual Network for Refining Sentence Representation: shown as below: 
 That is one kind of convolution network. But I konw nothing about the residual.-_-|| So let it go.:)
 
- Experiments: The model outperforms stat-of-the-art models on a few short-text datasets. - Dataset
 |Dataset|classes|Train Samples|Test Samples|Average Length of text|
 |:—–:|:—–:|:———–:|:———-:|:——————–:|
 |Tweet|5|28,000|7,500|7|
 |Question|5|2,000|700|25|
 |AG_news|5|120,000|7,600|20|
- Other details of the experiment is ignored by me.:)
- The result is very good.:)
 
- Dataset
- High insight: - We must explain the word-level representation about the Chinese. And that is important also.
- Attention mechanism which focuses on specific part of input could help achieve this goal that not all the words in a sentence contribute the same when predicting the sentence’s label
 
1.2 Program Synthesis for Character Level Language Modeling
a character level language modeling created by Pavol Bielik, Veselin Raychev & Martin Vechev in Department of Computer Science
- Goal: a character level language modeling for both program source code and English text.
- How to do: the model is parameterized by a program from a domain-specific language(DSL).
1.3 Words or Characters? Fine-Grained Gating for Reading Comprehension
A mode for reading comprehension created by Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu, William W. Cohen, Ruslan Salakhutdinov in CMU
- This paper should be read more carefully!
- Goal: The authors present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the word and model the interaction between questions and paragraphs for reading comprehension.
- Method: The authors compute a vector gate as a linear projection of the token features followed by a sigmoid avtivation. Then multiplicatively apply the gate to the character-level and word-level representations. Each dimension of the gate controls how much information is flowed from the word-level and character-level representations respectively. The gate is determined by named entity tags, part-of-speech tags, document frequencies, and word-level representations as the features for token properties. The gating mechanism can be generally used to model multiple levels of structure in language, including words, characters, phrases, sentences and paragraphs. 
- Datasets: Children’s Book Test dataset.
- Tasks: children’s book test dataset and social media tag prediction task.
- Experiments: Their approach can improve the performance on reading comprehension task.
- advantage: Character-level representations are used to alleviate the difficulties of modeling out-of-vocabulary(OOV) tokens.
1.4 Deep Character-Level Neural Machine Translation By Learning Morphology
A new method for NMT by Shenjian Zhao in Shanghai Jiao Tong University and Zhihua Zhang in Peking University
- Goal: NMT aims at building a single large neural network that can be trained to maximize translation preformance.
- Problem: The use of large vocabulary becomes the bottleneck in both training and improving the performance.
- Methods: Two recurrent networks and a hierarchical decoder which translates at character level.
- Advantages: It avoids the large vocabulary issue radically; It is more efficient in training than word-based models.
- Experiments: Higer BLEU and learn more morphology.
1.5 Opening The Vocabulary of Neural Language Models with Character-Level Word Representations
a opening-vocabulary neural language model by Matthieu Labeau and Alexandre Allauzen in LIMSI-CNRS / Orsay, France
- Goal: an open-vocabulary neural language model
- Advantage: can consider any word, that is the model can build representations of unknown words.
- Experiment: gain up tp 0.7 BLEU point.
1.6 Unsupervised Sentence Representation Learning With Adversarial Auto-Encoder
A fixed dimension to represent sentence by Shuai Tang(UC San Diego) and Hailin Jin & Chen Fang & Zhaowen Wang(Adobe Research)
- Goal: a fixed dimension
- Challenge: capture both of the semantic and structural information conveyed by a sentence.
- Advantage: learn representation from the unlabeled large corpus text data.
1.7 Offline Bilingual Word Vectors Without A Dictionary
Word Vectors for Bilingual as offline by Samuel L. Smith, David H. P. Turban, Nils Y. Hammerla & Steven Hamblin in London, SW3 3DD, UK
- Goal: A model that two pre-trained embeddings are aligned by a linear transformation, using dictionaries compiled from expert knowledge.
- Method: “inverted softmax” for identifying translation pairs.
1.8 Learning Word-like Units from Joint Audio-Visual Analysis
created by David Harwath and James R. Glass in Massachusetts Institute of Technology
- Goal: A method for discorvering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions from given a collection of images and spoken audio captions.
1.9 Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
A new framework created by Massachusetts Institute of Technology(in Stanford University) and Richard Socher(in Salesforce Research)
- Goal: A new framework
- Advantage: greatly reducing the number of trainable variables.
- Experiments: Their LSTM model lowers the state of the art word-level perplexity on the Penn Treebank to 68.5.
1.10 Sentence Ordering Using Recurrent Neural Networks
A model for structure of coherent text created by Lajanugen Logeswaran, Honglak Lee & Dragomir Radev in University of Michigan
- Goal: an end-to-end neural approach based on the proposed set to sequence mapping framework to address the sentence ordering problem.
- Method: …
- Result: the model has captured high level logical structure in these paragraphs and also learns rich semantic sentence representations.
2. Retrieval / Q&A / Recommend System
some applications by using NN
2.1 Learning to Query, Reason, and Answer Questions on Ambiguous Texts
2.2 Group Sparse CNNs for Question Sentence Classification with Answer Sets
2.3 Content2Vec: Specializing Joint Representations of Product Images and Text for the task Product Recommendation
2.4 Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce
3. Word/Sentence Embedding
some ways for embedding