The Llama Architecture: Meta's Series of LLMs

(Many thanks to "Chingis" for writing about this architecture. You can find his original paper here (opens in a new tab))

LLaMA is based on the transformer architecture; however, the authors leverage various improvements that were proposed and used in different models such as PaLM.

Like GPT-3, LLaMA uses the Transformer's decoder-only architecture.

The figure above shows the final LLaMA architecture on the high level. As seen, LLaMA resembles a transformer in a way that it's a stack of Transformer blocks. LLaMA is a decoder-only architecture .

RMSNorm performs normalization where each element of x is multiplied by the reciprocal of the square root of the mean of the squares (to avoid negative values) of elements of x, with a small number eps added for numerical stability. Then, it multiplies the normalized output with the learnable self.weight.

The RMSNorm is applied to every input of each transformer sub-layer within the TransformerBlock, as shown in the figure above.

They also introduce adding rotary positional embeddings (RoPE) at each layer of the network (instead of absolute positional embeddings) and replace ReLU activation with SwiGLU (with a dimension of 2/3*4d).

GPT Mistral