HumanBrain
1.0.0
1.0.0
  • What is notes
  • Knowledge Base
    • Machine Learning
      • Gausian Process
    • Math
      • Statistics
        • Importance Sampling
        • Probability And Counting
      • Linear Algebra
        • Dummy
    • Deep Learning
      • Deep Learning
  • Code
    • Code
      • Generative
      • NLP
      • RL
      • Vision
  • Papers
    • papers
  • Notes
    • Cognitive
      • (2016. 4) ML Learn And Think Like Human
    • Optimization
      • (2010. 5) Xavier Initialization
      • (2015. 2) Batch Normalization
      • (2015. 2) He Initialization
    • Reinforcement Learning
      • (2017. 6) Noisy Network Exploration
    • Vision
      • (2013. 12) Network In Network
      • (2014. 12) Fractional Max-pooling
      • (2015. 12) Residual Network
    • Natural Language Processing
      • (2014. 9) Bahdanau Attention
      • (2015. 11) Diversity Conversation
      • (2015. 11) Multi Task Seq2seq
      • (2015. 12) Byte To Span
      • (2015. 12) Vocabulary Strategy
      • (2015. 6) Skip Thought
      • (2015. 6) Teaching Machine Read And Comprehend
      • (2015. 8) Luong Attention
      • (2015. 8) Subword NMT
      • (2016. 10) Bytenet
      • (2016. 10) Diverse Beam Search
      • (2016. 10) Fully Conv NMT
      • (2016. 11) Bidaf
      • (2016. 11) Dual Learning NMT
      • (2016. 11) Generate Wiki
      • (2016. 11) NMT With Reconstruction
      • (2016. 2) Exploring Limits Of Lm
      • (2016. 3) Copynet
      • (2016. 4) NMT Hybrid Word And Char
      • (2016. 5) Adversarial For Semi Supervised Text Classification
      • (2016. 6) Sequence Knowledge Distillation
      • (2016. 6) Squad
      • (2016. 7) Actor Critic For Seq
      • (2016. 7) Attn Over Attn NN RC
      • (2016. 9) PS LSTM
      • (2017. 10) Multi Paragraph RC
      • (2017. 11) Neural Text Generation
      • (2017. 12) Contextualized Word For RC
      • (2017. 3) Self Attn Sentence Embed
      • (2017. 6) Slicenet
      • (2017. 6) Transformer
      • (2017. 7) Text Sum Survey
      • (2018. 1) Mask Gan
      • (2018. 2) Qanet
      • (2018. 5) Minimal Qa
    • Generative
      • (2013. 12) VAE
      • (2014. 6) Gan
      • (2016. 7) Seq Gan
    • Model
      • (2012. 7) Dropout
      • (2013. 6) Dropconnect
      • (2015. 7) Highway Networks
      • (2015. 9) Pointer Network
      • (2016. 10) Fast Weights Attn
      • (2016. 10) Professor Forcing
      • (2016. 3) Stochastic Depth
      • (2016. 7) Layer Normalization
      • (2016. 7) Recurrent Highway
      • (2017. 1) Very Large NN More Layer
      • (2017. 6) Relational Network
Powered by GitBook
On this page
  • Simple Summary
  • internal Covariate Shift
  • Towards Reducing Internal Covariate Shift
  • Algorithms
  • Experiments
  1. Notes
  2. Optimization

(2015. 2) Batch Normalization

  • published in 2015. 2

  • Ioffe, Sergey, and Christian Szegedy

Simple Summary

address the problem (internal covariate shift) by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.

  • Using mini-batches

    1. the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases.

    2. computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the morden computing platforms.

internal Covariate Shift

  • The change in the distribution of layers' inputs presents a problem because the layers need to continuously adapt to the new distribution. (covariate shift)

  • The input distribution properties that makes training more efficient apply to trainig the sub-network as well. ... ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

  • it cause gradient vanishing/exploding problems.

Towards Reducing Internal Covariate Shift

  • whitening : linearly transformed to have zero means and unit variances, and decorrelated.

    • issue is the gradient descenet optimization does not take into account the fact that the normalization takes place.

  • for any parameters values, the network always produces activations with the desired distribution. Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters.

  • each mini-batch produces estimates of the mean and variance of each activation.

Algorithms

  • Batch Normalizing Transform, applied to activation x over a mini-batch.

  • Training a Batch-Normalized Network

Experiments

Previous(2010. 5) Xavier InitializationNext(2015. 2) He Initialization

Last updated 6 years ago

image
image
image
image