Highlights of EMNLP 2018

6 min readNov 5, 2018

In past conferences, I always wish that someone attending would give an overview of a few interesting papers. Since I went this year, I thought I could give it a shot. This is basically a quick summary of papers that I still remember from EMNLP, and my understanding of what they cover. I wanted to get this written while its still fresh in my mind. The caveat is that I didn’t get a chance to thoroughly review them, so probably the overview is incorrect.

Phrase-Based & Neural Unsupervised Machine Translation

Won best paper. Admittedly, this talk went a little over my head, but this is my understanding. This work builds on top of previous research done by FAIR which learns how to align words in different languages by:

Creating word embeddings in each of two languages, X and Y.
Learn a (nonlinear?) transformation W that maps from X to Y with a pairs of seed words that are known to be translatable. So, for english and french, a set of seed words could be: (dog, chien), (car, voiture), etc
The learning is done in an adversarial setting, where the adversarial objective is language classification.

In this work, they do the same thing, but instead of learning how to align in the word embedding space, they learn to align for phrase embeddings.

Then, with the aligned phrases, they use traditional phrase-based machine translation techniques: Given a sentence in the source language, there are many different phrase translated candidates in the target language, and they use a neural language model trained on the target language to rank the candidates.

Then there was some stuff with back-translation, where, somehow, iteratively training a translation model makes it better

MultiWOZ — A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

Won best paper. Authors collected 10k dialogues in the travel domain, which is significantly larger than previous attempts at dialogue datasets. The dataset is quite complex with an average of 13.6 turns per dialogue, which puts this definitely in the realm of research datasets.

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Reminds me of a workshop on generalization. Percy Liang showed that the performance of SOTA models on SQuAD decrease significantly when arbitrary text is added to the end of the passage. This paper studies all the question answering datasets and found that for many of them, they were able to obtain surprisingly good performances, by only looking at the passage, and not the questions.

Linguistically-Informed Self-Attention for Semantic Role Labeling

Won best paper. I’m not sure I got a great understanding of this after the presentation but, from what I gathered: LISA uses self-attention transformer style neural architecture to jointly model POS, dependency parsing, predicate detection and, the main objective which is semantic role labeling. I think the general aspect of this paper that is very interesting is a very unique way of dealing with pipelines in NLP. In traditional NLP, these syntax → semantic parsing is done via a pipeline, which causes error propagation. This is often mitigated with beam search decoding at error prone pipeline stages, but still, difficult to handle. The contribution of this paper is the multi level self attention will allocate an attention head to infer POS at a lowest level, dependency parse at the next level, predicate detection at the next level, and SRL at the highest level. At each level it uses scores of the labels, as a signal for subsequent tasks. This allows for a natural way to inject a gold label at at lower level during inference time. For instance, we can use Spacy’s POS tagger, and inject those into the model, instead of the scores that comes from the attention layer responsible for POS.

Evaluating the Utility of Hand-crafted Features in Sequence Labelling

Directly applies to what we are doing now. These guys basically found that simply added hand built features to neural sequence models don’t seem to offer any significant performance gains unless the model is trained with a multi task objective, whereby the model is forced to recover the actual hand built features that are added to word embeddings. Otherwise the model just forgets the features.

Developing Production-Level Conversational Interfaces with Shallow Semantic Parsing

This was one of the more practical ones, Cisco built a goal-oriented dialogue system, and this paper describes how they did it. I spent quite a bit of time at this poster, and they do a really good job showing the behind the scenes of a task oriented conversational agent

Self-Governing Neural Networks for On-Device Short Text Classification

From Google Research, they use locality sensitive hashing (similar items are hashed to same bucket) to create binary feature vectors instead of using continuous word vectors, and yet still somehow achieve state of the art results on a few datasets. This means that it is extremely memory and compute efficient, since, when they dropped the embedding layer, they ended up with something like 300k parameters vs the 25 million you’d get if you kept the embedding layer.

Knowledge Graph Embedding with Hierarchical Relation Structure

This one decomposes similar relationships into a hierarchical sub-relation units. Similar relationships like /software/developer, /operating_system/developer and /videogame/developer. They construct a 3 layer hierarchy of relations. I’m not sure how they discovered the relationship taxonomy. They then change the TransE objective from h + r — t = 0 to h + r1 + r2 + r3 — t = 0. They see significant improvements in knowledge graph inference.

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Authors tested a bunch of activation functions across a few common NLP tasks (MLP classification, CNN classification and LSTM sequence tagging) and found that penalized tanh, relu, elu, and one called Swish (?) tends to work quite well across a variety of tasks.

Adversarial Removal of Demographic Attributes from Text Data

This paper discusses how to use adversarial training to remove demographic information. Basically, the set up applies to all models that have an encoder decoder formulation. If some text is encoded with an encoder, an attacker could do better than random for predicting demographic information (like age) out of that representation. The way this can be fixed is by adding an adversarial objective, to predict the demographic attribute jointly with training for the target objective. This results in more privacy aware language representations, since the demographic information is lost through the adversarial objective. This technique probably applies to any neural representation.

Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study

They basically tried a different sampling method for active learning. Their method marginally outperformed standard max entropy sampling by a few points. What they did was they chose a sample by running inference multiple times through a model with dropout, which would result in different distributions. Then, the next sample that will be presented present to the oracle, is the one with the most variants in predicted class.

Cross entropy loss is not suitable for tasks like text generation, since it penalizes every word that is incorrect. This leads the generator to just repeat words from the training corpus. This paper argues for using a “semantic loss”, which is just the cosine distance of the target sentence and the generated sentence in word embedding space. They authors showed how this resulted in more diverse generations.

HyTE: Hyperplane-based Temporally aware Knowledge Graph Embedding

A knowledge graph often has temporally inconsistent facts. For instance, (“christiano ronaldo”, “played_for”, “real madrid”), and (“christiano ronaldo”, “played_for”, “juventus”). Both of these facts are correct, since Ronaldo was moved from Real Madrid to Juventus this year. Embedding objectives like TransE has a hard time capturing this. The authors of this paper basically augmented a knowledge graph (FB15k?) to contain temporal information (2010–2011, 2011–2012) for each fact. Then, they learn a subspace for each time frame with which to project the entity and relation embedding into. Therefore, the TransE objective becomes:

P_t(s) + P_t(r) = P_t(t)

s is the source entity embedding

r is the relation embedding

t is the target embedding

P_t is the projection into a subspace for timeframe t

Find me on Twitter

Highlights of EMNLP 2018

Written by Chris Zhu