Multihead Attention

Multihead Attention: Simultaneous Attention Layers All At Once

In the Transformer architecture, there is more than one instance of attention, and the intuition is that each attention layer running in parallel learns a different aspect of language individually. The number of attention heads vary from model to model, but it is theorized that there can be anywhere from 12-100 in one of these multihead layers.