(2015. 12) Byte To Span

  • published in 2015. 12

  • Dan Gillick, Cliff Brunk, Oriol Vinyals and Amarnag Subramanya

Simple Summary

  • An LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary.

  • Text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact,

  • Experiments

    • POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments

    • NER: 4 languags, 0.88M tokens, 6M training segments

  • Contributions 1. use the bytes in variable length unicode encodings as input. allows us to train a multilingual model that improves over single-language models without using additional parameters. and byte-dropout (by randomly replacing them with a DROP symbol.) 2. the model produces span annotations, where each is a sequence of three outputs 3. the models are much more compact than traditional word-based systems and they are standalone – no processing pipeline is needed.

Last updated