Introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence
Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. (eg. finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem)
images
images
Softmax normalizes the vector e_ij to be an output distribution over the dictionary of inputs.