(2015. 12) Vocabulary Strategy

  • published in 2015. 12

  • Welin Chen, David Grangier and Michael Auli

Simple Summary

  • Present a systematic comparison of strategies to represent and train large vocabularies

  • Strategies

    • softmax : over all output classes

    • hierarchical softmax : introduces latent variables, or clusters, to simplify normalization

    • Differentiated Softmax : adjusting capacity based on token frequency (a novel variation of softmax which assigns more capacity to frequent words and which we show to be faster and more accurate than softmax)

    • target sampling : only considers a random subset of classes for normalization

    • noise contrastive estimation : discriminates between genuine data points and samples from a noise distribution

    • Infrequent Normalization (self normalization) : computes the partition

      function at an infrequent rate.

Last updated