Self Attention: Turning Words Into Embeddings

Self attention is the result of many problems with Recursive Neural Networks...mainly, the problem that they can't handle long sequences of words very well.

So, through the creation of Query, Key and Value vectors, we are able to make the word vectors of every single word carry contextual information. This mechanism is known as self attention, due to the fact that we are "attending" on the same sentence that we are using as input. Also, it is important to note that many other forms of attention are in existence and they can be used for computer vision.

Rotary Position Embedding Softmax Layer