Mistral: An Open Weights Model Published By Mistral.ai

(many thanks to Nikhil Goel for providing this information, you can find his article here (opens in a new tab))

Mistral 7B is an open source LLM from Mistral.ai. Mistral has the following architecture:

Mistral Architecture has following main components:

Self-Attention Layer - Implemented through Sliding Window Attention, Grouped Query Attention, Rolling Buffer KV Cache. Sliding Window Attention along with KV Cache is what makes Mistral fast, less parameters and also ability to handle very large sequences.
Feed Forward Layer (SiLU) - SiLU activation function for improved accuracy and highly efficient.
RMS Norm - Root Mean Square Normalization - Computationally simple and more efficient.

Additionally, the most important point in the architecture image above - The blue box above indicates 'N' Transformer Decoder Layers and here 'N' = 32.

With main architecture covered let's dive into the main architecture element that makes Mistral as among the best LLM.

Mistral is best due to the ability to handle very long text sequences and at the same time being very fast (basically faster inference). Mistral accomplishes this due to these components.

Sliding Window Attention - This helps in handling long text sequences. Sliding window attention is a Self-Attention implementation concept that helps Mistral to consider only those tokens equal to length of sliding window parameter. So, if the sliding window is let's say W=3 then during self-attention only 3 tokens will be considered. This is a very powerful concept as this enables Mistral to handle longer text sequences. Now the question comes if sliding window = 3 then what happens to other tokens or long sequences, this is where the Transformer architecture and stacking of transformer (n_layers = 32) layers come. This layer enables Mistral to provide attention to all the tokens with each other in very long sequences. This will be clear from the image below.

Now if you refer to the above image the sliding window attention is paying attention to 3 tokens (highlighted in yellow) as compared to Vanilla attention. Another important fact that highlights how all tokens are able to pay attention to each other even if there is a sliding window = 3 is no of layers, as increase in no of layers/ stacking of layers, the tokens in group pay attention to earlier layer tokens. Refer 'Effective Context Length'. This is KEY.

Rolling Buffer KV Cache - This helps in faster inference. As KV Cache is used to cache earlier Query Tokens as part of attention and this is KEY. Added in Mistral is Rolling Buffer which reduces memory. Mistral uses W=4 as Rolling Buffer and as soon as tokens >4 the past values are overwritten. Refer below image.

These two primary concepts make Mistral handle large sequences with very fast inference time.

One thing to note is that in the v0.2 of Mistral 7B, there is no sliding window attention. It just has a native context of 32k, no fancy tricks to extend the context.

Llama Mixtral/MoE