(2017. 1) Very Large NN More Layer

  • Submitted on 2017. 1

  • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

Simple Summary

Introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. ... address large capacity challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.

  • introduce a general-purpose mechanism to scale up neural networks significantly beyond their current size using sparsity of activation.

  • Challenges

    • Modern computing devices (GPU): propose turning on/off large chunks of the

      network with each gating decision.

    • Large batch sizes are critical for performance

    • Network bandwidth can be a bottleneck

    • Depending on the scheme, loss terms may be necessary to achieve the desired level of sparsity per-chunk and/or per example.

    • Model capacity is most critical for very large data sets.

  • The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.

  • G(x): output of the gating network

    • Softmax Gating: apply Softmax function

    • Noisy Top-K Gating: add sparsity and noise

  • E_i(x): the output of the i-th expert network for a given input x.

  • The Shrinking Batch Problem: each expert receives a much smaller batch of approximately kb/n << b examples.

    • Mixing Data Parallelism and Model Parallelism

    • Taking Advantage of Convolutionality

    • Increasing Batch Size for a Recurrent MoE

  • Balancing Expert Utilization

    • the gating network tends to converge to a state where it always produces

      large weights for the same few experts -> take a soft constraint approach

  • Experiments

Last updated