HumanBrain
1.0.0
1.0.0
  • What is notes
  • Knowledge Base
    • Machine Learning
      • Gausian Process
    • Math
      • Statistics
        • Importance Sampling
        • Probability And Counting
      • Linear Algebra
        • Dummy
    • Deep Learning
      • Deep Learning
  • Code
    • Code
      • Generative
      • NLP
      • RL
      • Vision
  • Papers
    • papers
  • Notes
    • Cognitive
      • (2016. 4) ML Learn And Think Like Human
    • Optimization
      • (2010. 5) Xavier Initialization
      • (2015. 2) Batch Normalization
      • (2015. 2) He Initialization
    • Reinforcement Learning
      • (2017. 6) Noisy Network Exploration
    • Vision
      • (2013. 12) Network In Network
      • (2014. 12) Fractional Max-pooling
      • (2015. 12) Residual Network
    • Natural Language Processing
      • (2014. 9) Bahdanau Attention
      • (2015. 11) Diversity Conversation
      • (2015. 11) Multi Task Seq2seq
      • (2015. 12) Byte To Span
      • (2015. 12) Vocabulary Strategy
      • (2015. 6) Skip Thought
      • (2015. 6) Teaching Machine Read And Comprehend
      • (2015. 8) Luong Attention
      • (2015. 8) Subword NMT
      • (2016. 10) Bytenet
      • (2016. 10) Diverse Beam Search
      • (2016. 10) Fully Conv NMT
      • (2016. 11) Bidaf
      • (2016. 11) Dual Learning NMT
      • (2016. 11) Generate Wiki
      • (2016. 11) NMT With Reconstruction
      • (2016. 2) Exploring Limits Of Lm
      • (2016. 3) Copynet
      • (2016. 4) NMT Hybrid Word And Char
      • (2016. 5) Adversarial For Semi Supervised Text Classification
      • (2016. 6) Sequence Knowledge Distillation
      • (2016. 6) Squad
      • (2016. 7) Actor Critic For Seq
      • (2016. 7) Attn Over Attn NN RC
      • (2016. 9) PS LSTM
      • (2017. 10) Multi Paragraph RC
      • (2017. 11) Neural Text Generation
      • (2017. 12) Contextualized Word For RC
      • (2017. 3) Self Attn Sentence Embed
      • (2017. 6) Slicenet
      • (2017. 6) Transformer
      • (2017. 7) Text Sum Survey
      • (2018. 1) Mask Gan
      • (2018. 2) Qanet
      • (2018. 5) Minimal Qa
    • Generative
      • (2013. 12) VAE
      • (2014. 6) Gan
      • (2016. 7) Seq Gan
    • Model
      • (2012. 7) Dropout
      • (2013. 6) Dropconnect
      • (2015. 7) Highway Networks
      • (2015. 9) Pointer Network
      • (2016. 10) Fast Weights Attn
      • (2016. 10) Professor Forcing
      • (2016. 3) Stochastic Depth
      • (2016. 7) Layer Normalization
      • (2016. 7) Recurrent Highway
      • (2017. 1) Very Large NN More Layer
      • (2017. 6) Relational Network
Powered by GitBook
On this page
  • Simple Summary
  • Preprocessing
  • Model
  • Training
  • Decoding
  • Attention
  • Evaluation
  1. Notes
  2. Natural Language Processing

(2017. 11) Neural Text Generation

Previous(2017. 10) Multi Paragraph RCNext(2017. 12) Contextualized Word For RC

Last updated 6 years ago

  • published in 2017. 11

  • Ziang Xie

Simple Summary

  • Practivate guide for diagnosing and resolving pathological behavior during decoding.

  • The primary focus is on tasks where the target is a single sentence hence the term "text generation" as opposed to "language generation".

Preprocessing

  • Ultimately, if using word tokens, it's important to use a consistent tokenization scheme for all inputs to the system - this includes handling of contractions, punctuation marks such as quotes and hyphens, periods denoting abbrevations (nonbreaking prefixes) vs. sentence boundaries, character escaping, etc.

Model

  • RNN

    • use shared parameter matrices across different time steps and combine the input at the current time step with the previous hidden state summarizing all previous time steps.

  • CNN

    • Convolutions with kernels reused across timesteps can also be used with masking to avoid peeking ahead at future inputs during trainig.

  • Attention

    • Attention mechanism acts as a shortcut connection between the target output prediction and the relevant source input hidden states.

Training

  • Optimize over the model parameters θ the sequence cross-entropy loss.

  • recent research has also explored other methods

    • using RL or separate adversarial loss

  • Useful heuristics

    • Sort the next dozen or so batches of sentences by length

    • training set is small, tuning regularization

    • Measure validation loss after each epoch and anneal the learning rate when validation loss stop decreasing.

    • Periodically checkpoint model parameters and measure downstream performance.

    • Ensembling almost always improves performance.

Decoding

  • During decoding, we are given the source sentence X and seek to generate the target Y^ that maximizes some scoring function s(Y^)^2.

  • Beam Search

    • surprising result with neural models is that relatively small beam size yield good results with rapidly dimishing return. Further, larger beam sizes can event yield (slightly) worse results.

  • When measuring the score function, Useful metrics include:

    • Average length of decoded outputs Y^ vs average length of reference targets Y.

    • s(Y^) vs s(Y), then inspecting the ratio s(Y^)/s(Y). If the average ratio is especially low, then there may be a bug in the beam search, or the beam size may need to be increased. If the average ratio is high, then the scoring fuction may not be appropriate.

    • For some applications computing edit distance (insertions, substitutions, deletions) between Y^ and Y my also be useful.

Common issues.

  • Rare and out-of-vocabulary (OOV) words

    • [Neural machine translation of rare words with subword

  • Docoded output short, truncated or ignore portions of input

    • normalizing the log-probability score and ading ad length bonus.

  • Decoded output repeats

    • easily detected using the attention matrix A with some manually selected threshold.

    • [Get to the point: Summarization with pointer-generator

  • Lack of diversity

    • Increasing the temperature τ of the softmax exp(zi/τ)/ Sigma_j exp(zj/τ) is a simple method for trying to encourage more diversity in decoded outputs.

Attention

  • basic attention mechanism used to "attend" to portions of the encoder hidden states during each decoder timestep has many extensions and applications.

Evaluation

  • Common metric based n-gram

  • Need metric

(2017. 1)

(2017. 1)

units]() (2015. 8)

(2016 .9)

(2016. 1)

(2016 .9)

networks]() (2017. 4)

(2015. 10)

(2016. 11)

[ROUGE]())

(2017. 6)

Adversarial Learning for Neural Dialogue Generation
Wasserstein gan
https://arxiv.org/abs/1508.07909
Google’s neural machine translation system: Bridging the gap between human and machine translation
Modeling coverage for neural machine translation
Google’s neural machine translation system: Bridging the gap between human and machine translation
https://arxiv.org/abs/1704.04368
A Diversity-Promoting Objective Function for Neural Conversation Models
A simple, fast diverse decoding algorithm for neural generation
https://en.wikipedia.org/wiki/ROUGE_(metric
BLUE
Why We Need New Evaluation Metrics for NLG
images
images