Is it time to switch to Word Embedding and Recurrent Neural Networks for Spoken Language Understanding?

Abstract

Recently, word embedding representations have been investigated for slot filling in Spoken Language Understanding, along with the use of Neural Networks as classifiers. Neural Networks, especially Recurrent Neural Networks, that are specifically adapted to sequence labeling problems, have been applied successfully on the popular ATIS database. In this work, we make a comparison of this kind of models with the previously state-of-the-art Conditional Random Fields (CRF) classifier on a more challenging SLU database. We show that, despite efficient word representations used within these Neural Networks, their ability to process sequences is still significantly lower than for CRF, while also having a drawback of higher computational costs, and that the ability of CRF to model output label dependencies is crucial for SLU.

Overview

In this work, we're analyzing whether RNNs are a good candidate for the task of slot tagging in spoken language understanding and whether they really outperform the old state of the art method, CRF. We analyze two things:

is the improvement in RNNs more likely due to their architecture or due to the fact that they use continuous word representations?
do RNNs outperform CRF when evaluated on a dataset that is not as simple as ATIS (that is unfortunately still the default dataset for SLU)?

To gain insight for the first question, we use a classifier that can use both symbolic and numeric (counties) representations as input: boosting over decision trees. We then evaluate how do the two different input affect a same type of classifier:

Representation	Precision (%)	Recall (%)	F-measure (%)
ATIS
symbolic	93.00	93.43	93.21
numeric	93.50	94.54	94.02
MEDIA
symbolic	71.09	75.48	73.22
numeric	73.61	78.85	76.12

We see that using numerical (continuous) representations brings significant improvement and that this is definitely a significant factor aiding RNNs.

To try to answer the second question, we evaluate both CRF and RNNs on two datasets, namely:

ATIS - the famous/standard air traffic information dataset, that is unfortunately not very challenging
MEDIA - a less know but more complex, French dataset that contains touristic / reservation scenarios

For RNNs, we also evaluate both retrained representations (Word2Vec) and word representations that are trained jointly with the RNN (a lookup table at the beginning of the network that is updated with backpropagation). The results are as follows:

Algorithm	Representation	F-measure (%)
ATIS
Bonzaiboost	numeric (Word2Vec)	94.02
Bonzaiboost	symbolic	92.97
CRF	symbolic	95.23
Elman RNN	numeric (joint)	96.16
MEDIA
Bonzaiboost	numeric (Word2Vec)	76.14
Bonzaiboost	symbolic	73.22
CRF	symbolic	86.00
Elman RNN	numeric (joint)	81.76
Elman RNN	numeric (Word2Vec)	81.94
Jordan RNN	numeric (joint)	83.25
Jordan RNN	numeric (Word2Vec)	83.15

We conclude that continuous representation spaces allow for a better generalization (better accuracy) and make the classification algorithm to converge faster. Moreover, continous representations decrease the possibility for a classifier to produce noise fitted decision rules and thus are more robust to noise than symbolic ones. Despite this conclusion, algorithms able to exploit them, like RNNs are not able to compete with CRF. Although CRF is trained solely on symbolic features, its ability to model output label dependencies appears crucial for the task. CRF with symbolic features thus remains the best classification algorithm for SLU, in term of prediction.

Full Article

Article

Presentation

View on Google Scholar

Click here