LegoLLM
ONGOINGA complete LLM development framework built from first principles. Every component — tokenization, embeddings, attention, training, generation, alignment — is implemented from scratch as modular "Lego pieces" that can be combined, swapped, and extended. The goal is to deeply understand the entire LLM stack, not just use it.
Core pipeline: Raw Text → Tokenization → Embeddings → Attention → Transformer → Training → Generation → Alignment
Roadmap
Tokenization (BPE from scratch), embeddings, multi-head causal attention, GPT-2 architecture, DataLoader, Trainer, generation strategies
Small-scale pretraining, loading HuggingFace GPT-2 weights (safetensors), Conv1D→Linear mapping, fused QKV splitting
SFT, Alpaca-style chat formatting, dynamic padding, loss masking, LoRA
LLaMA 3 (RoPE, RMSNorm, SwiGLU, GQA), KV Cache, Qwen 3
DPO, PPO/RLHF, Mixture of Experts
Architecture Highlights
- Self-contained model files — each architecture (GPT-2, LLaMA 3, Qwen 3) is readable as a single file with everything inline
- Reusable components — attention (MHA, GQA), normalization, embeddings, feedforward layers exist as separate swappable modules
- Protocol-based interfaces — Python Protocol contracts instead of abstract base classes
- Two BPE tokenizers — NaiveBPE (educational) and RegexBPE (production-grade, GPT-2/GPT-4 compatible)
- Memory-efficient DataLoader — NumPy memmap with circular buffer, no full-dataset RAM load
- Proper weight loading — HuggingFace Conv1D→Linear transpose, fused QKV splitting, tied embeddings
- Generation — greedy, top-k, top-p, temperature sampling with pre-allocated buffers
- Repository
- GitHub
- Stack
- PyTorch, NumPy, tiktoken, safetensors, Rich, pytest, MkDocs, Ruff, GitHub Actions
- Models
- GPT-2 (all 4 sizes) · LLaMA 3 (planned) · Qwen 3 (planned)
- Tests
- 218 unit tests + integration tests
Module Structure
legollm/ ├── architectures/ # GPT-2, LLaMA 3, Qwen 3 ├── components/ │ ├── attention/ # Multi-head, Grouped-query │ ├── blocks/ # Transformer block │ ├── embeddings/ │ ├── feedforward/ │ └── normalization/ ├── core/ │ ├── interfaces.py # Protocol contracts │ └── tokenization/ # NaiveBPE, RegexBPE ├── data/ # Memmap DataLoader ├── training/ # Trainer (cosine LR, AdamW) ├── generation/ # Greedy, top-k, top-p ├── optimization/ # KV Cache ├── peft/ # LoRA (upcoming) └── finetuning/ # SFT, chat formatting