Layer Normalization: Batch-size Independent Normalization

In Layer Normalization, the input values in all neurons in the same layer are normalized for each data sample. Therefore, in layer normalization, all the neurons in the same layer will have the same normalization term – same mean and the same variance. In layer normalization, irrespective of the batch size, normalization will be done for each data point. We do the exact same calculation for layer normalization during training and test time – thus in networks such as RNN it is for each time step we are doing the calculation. In case of Transformers, it is for every resultant transformed positionally encoded word embedding we will do the calculation. It is believed that this method is far better for transformers and recursive neural networks.

Feed-Forward Networks Linear/Fully-Connected Layer