Add and Norm: Improving Training Without Changing the Core of the Model

Essentially, these layers are responsible for normalizing in order to transform the contours of the cost function, which might be elongated due to different size of each feature.

It transforms these contours into even sized contours with respect to all the features trough normalizing all the layers through the network, so this is why you see so many add and norm layers in most architectures.