LegoLLM

ONGOING

2025

A complete LLM development framework built from first principles. Every component — tokenization, embeddings, attention, training, generation, alignment — is implemented from scratch as modular "Lego pieces" that can be combined, swapped, and extended. The goal is to deeply understand the entire LLM stack, not just use it.

Core pipeline: Raw Text → Tokenization → Embeddings → Attention → Transformer → Training → Generation → Alignment

Roadmap

✓

Phase 1 — Core Foundation

Tokenization (BPE from scratch), embeddings, multi-head causal attention, GPT-2 architecture, DataLoader, Trainer, generation strategies

✓

Phase 2 — Pretrained Weights

Small-scale pretraining, loading HuggingFace GPT-2 weights (safetensors), Conv1D→Linear mapping, fused QKV splitting

→

Phase 3 — Instruction Fine-tuning

SFT, Alpaca-style chat formatting, dynamic padding, loss masking, LoRA

Phase 4 — Modern Architectures

LLaMA 3 (RoPE, RMSNorm, SwiGLU, GQA), KV Cache, Qwen 3

Phase 5 — Alignment

DPO, PPO/RLHF, Mixture of Experts

Architecture Highlights

Self-contained model files — each architecture (GPT-2, LLaMA 3, Qwen 3) is readable as a single file with everything inline
Reusable components — attention (MHA, GQA), normalization, embeddings, feedforward layers exist as separate swappable modules
Protocol-based interfaces — Python Protocol contracts instead of abstract base classes
Two BPE tokenizers — NaiveBPE (educational) and RegexBPE (production-grade, GPT-2/GPT-4 compatible)
Memory-efficient DataLoader — NumPy memmap with circular buffer, no full-dataset RAM load
Proper weight loading — HuggingFace Conv1D→Linear transpose, fused QKV splitting, tied embeddings
Generation — greedy, top-k, top-p, temperature sampling with pre-allocated buffers

Repository: GitHub
Stack: PyTorch, NumPy, tiktoken, safetensors, Rich, pytest, MkDocs, Ruff, GitHub Actions
Models: GPT-2 (all 4 sizes) · LLaMA 3 (planned) · Qwen 3 (planned)
Tests: 218 unit tests + integration tests

Module Structure

legollm/
├── architectures/       # GPT-2, LLaMA 3, Qwen 3
├── components/
│   ├── attention/       # Multi-head, Grouped-query
│   ├── blocks/          # Transformer block
│   ├── embeddings/
│   ├── feedforward/
│   └── normalization/
├── core/
│   ├── interfaces.py    # Protocol contracts
│   └── tokenization/    # NaiveBPE, RegexBPE
├── data/                # Memmap DataLoader
├── training/            # Trainer (cosine LR, AdamW)
├── generation/          # Greedy, top-k, top-p
├── optimization/        # KV Cache
├── peft/                # LoRA (upcoming)
└── finetuning/          # SFT, chat formatting