Is it time to switch to Word Embedding and Recurrent Neural Networks for Spoken Language Understanding?

Vedran Vukotić, Christian Raymond, Guillaume Gravier



Recently, word embedding representations have been investigated for slot filling in Spoken Language Understanding, along with the use of Neural Networks as classifiers. Neural Networks, especially Recurrent Neural Networks, that are specifically adapted to sequence labeling problems, have been applied successfully on the popular ATIS database. In this work, we make a comparison of this kind of models with the previously state-of-the-art Conditional Random Fields (CRF) classifier on a more challenging SLU database. We show that, despite efficient word representations used within these Neural Networks, their ability to process sequences is still significantly lower than for CRF, while also having a drawback of higher computational costs, and that the ability of CRF to model output label dependencies is crucial for SLU.


In this work, we're analyzing whether RNNs are a good candidate for the task of slot tagging in spoken language understanding and whether they really outperform the old state of the art method, CRF. We analyze two things:

  • is the improvement in RNNs more likely due to their architecture or due to the fact that they use continuous word representations?
  • do RNNs outperform CRF when evaluated on a dataset that is not as simple as ATIS (that is unfortunately still the default dataset for SLU)?

To gain insight for the first question, we use a classifier that can use both symbolic and numeric (counties) representations as input: boosting over decision trees. We then evaluate how do the two different input affect a same type of classifier:

Representation Precision (%) Recall (%) F-measure (%)
symbolic 93.00 93.43 93.21
numeric 93.50 94.54 94.02
symbolic 71.09 75.48 73.22
numeric 73.61 78.85 76.12

We see that using numerical (continuous) representations brings significant improvement and that this is definitely a significant factor aiding RNNs.

To try to answer the second question, we evaluate both CRF and RNNs on two datasets, namely:

  • ATIS - the famous/standard air traffic information dataset, that is unfortunately not very challenging
  • MEDIA - a less know but more complex, French dataset that contains touristic / reservation scenarios

For RNNs, we also evaluate both retrained representations (Word2Vec) and word representations that are trained jointly with the RNN (a lookup table at the beginning of the network that is updated with backpropagation). The results are as follows:

Algorithm Representation F-measure (%)
Bonzaiboost numeric (Word2Vec) 94.02
Bonzaiboost symbolic 92.97
CRF symbolic 95.23
Elman RNN numeric (joint) 96.16
Bonzaiboost numeric (Word2Vec) 76.14
Bonzaiboost symbolic 73.22
CRF symbolic 86.00
Elman RNN numeric (joint) 81.76
Elman RNN numeric (Word2Vec) 81.94
Jordan RNN numeric (joint) 83.25
Jordan RNN numeric (Word2Vec) 83.15

We conclude that continuous representation spaces allow for a better generalization (better accuracy) and make the classification algorithm to converge faster. Moreover, continous representations decrease the possibility for a classifier to produce noise fitted decision rules and thus are more robust to noise than symbolic ones. Despite this conclusion, algorithms able to exploit them, like RNNs are not able to compete with CRF. Although CRF is trained solely on symbolic features, its ability to model output label dependencies appears crucial for the task. CRF with symbolic features thus remains the best classification algorithm for SLU, in term of prediction.

Full Article



View on Google Scholar

Click here

To Cite