Skip to main content

All about LLMs

Difference between Transformers and LLMs:

  • Transformer → is a neural network architecture — a blueprint for how to arrange layers, attention mechanisms, and feed-forward networks to process sequences.
  • LLM (Large Language Model) → is a specific type of model trained on huge text datasets, and it uses the transformer architecture (or a variant of it) to understand and generate human-like text.

1. Transformers

  • First introduced in the 2017 paper “Attention Is All You Need”.

  • Purpose: A general-purpose architecture for sequence-to-sequence tasks (translation, summarization, speech recognition, image processing, etc.).

  • Key ideas:

    • Self-attention: Each token can “look” at all other tokens in the sequence to find relevant context.
    • Positional encoding: Since attention doesn’t have order by default, we add positional info.
    • Stack of encoder and/or decoder layers.
  • Scope: Can be used for many domains, not just text — e.g., Vision Transformers (ViT) for images, Audio Transformers, etc.

2. Large Language Models (LLMs)

  • Definition: A category of models trained on massive text corpora to perform many natural language tasks.

  • Almost all modern LLMs use a transformer-based architecture (GPT, BERT, LLaMA, Claude, etc.).

  • Key features:

    • Trained on hundreds of billions of tokens.
    • Massive parameter counts (billions+).
    • Learned to predict the next token (or fill in missing tokens) in context.
    • Fine-tuned for specific behaviors like instruction-following or conversation.
  • Scope: Specialized for language, reasoning, and knowledge retrieval, but built on transformer principles.

In short

Transformers = the architecture (like the design of an engine). LLMs = a big, trained model using that architecture for language tasks (like a sports car built with that engine).


The Significance of LLM Weights

  • Performance and Size: The number of weights is a key factor in a model's performance. For example, a model with a larger number of weights (e.g., billions or even trillions) can learn more complex patterns and relationships in data, leading to more accurate and nuanced outputs.
  • Customization: The ability to access and modify a model's weights allows for fine-tuning, where the model is further trained on a specific, smaller dataset to become an expert in a particular domain or task. This is how a general-purpose LLM can be adapted for a specific industry or application.
  • Efficiency: The sheer size of LLM weights can make them computationally expensive to run. Techniques like quantization are used to reduce the precision of the weights, making the model smaller and faster to use while trying to maintain its performance.
  • Open source LLM weights
    • lowers the barrier to entry for developers and researchers. Instead of having to train a massive model from scratch—a process that requires immense computational resources and cost—they can use a pre-trained model and fine-tune it for their specific needs.
    • Open weights enable the use of techniques like quantization and pruning to create smaller, more efficient versions of models that can run on consumer devices or at the "edge" (e.g., on a smartphone or in a car). This reduces the need for expensive cloud infrastructure and can lead to lower power consumption.

A Mixture-of-Experts (MoE) transformer

A mixture-of-experts (MoE) transformer is a type of large language model (or more generally, neural network) architecture that splits its computations across multiple specialized “expert” sub-networks instead of having a single, monolithic feed-forward block.

1. The Core Concept

  • In a standard transformer, each feed-forward layer is the same for all tokens and runs for every token in the input sequence.
  • In an MoE transformer, you have many different feed-forward networks (the “experts”) but only a small subset of them are active for any given token.
  • A router (or gating network) decides which experts to use for each token.
  • This means the model’s capacity (number of parameters) can be huge, but its computation cost per token is kept low.

2. How It Works (Inside a Transformer Block)

  1. Input Tokens → Pass through attention layer as usual.
  2. Router/Gating Network → Small learned network decides which K experts each token should be sent to (K is often 1 or 2).
  3. Experts → Each expert is a feed-forward network (MLP) with its own weights. For example, in a large language model, one expert might specialize in processing code, while another is better at handling factual information, and a third is an expert in creative writing.
  4. Combine Outputs → The outputs from the active experts are weighted and merged back into the token representation.

3. Trade-offs:

  • More complex training (load balancing is tricky — some experts may get overused).
  • Harder inference distribution (routing tokens to different experts can be costly in multi-device setups).
  • Memory overhead — all experts’ parameters must still be stored, even if unused in a given step.

4. Notable Recent MoE Models

  • Meta’s LLaMA 4 Series (April 2025)
  • Moonshot AI’s Kimi K2 (July 2025)
  • DeepSeek V2.5 (September 2024)

5. Further reading: