Fully Connected Layer/Linear Layer: The Token's Final Destination
Right before calculating the output probabilities, all the information gathered in earlier steps of the transformers architecture will pass through what is known as a fully connected layer, or a linear layer. This layer will have a batch size, number of inputs, and number of outputs.