diff --git a/README.md b/README.md index 627e087..bc25682 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,10 @@ # ARIA — Zero-to-Tiny LLM (Python) -**ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies. +ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a +tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear +docstrings, doctests where helpful, and minimal dependencies. -> **Note:** This repository’s instructional content was **generated with the assistance of an AI language model**. +> Note: This repository’s instructional content was generated with the assistance of an AI language model. --- @@ -18,48 +20,117 @@ ## Who this is for - Beginners who can run `python script.py` and have written a basic “Hello World”. -- Learners who want a **clear path** to an LLM without heavy math or large datasets. +- Learners who want a clear path to an LLM without heavy math or large datasets. --- ## Course outline (lessons) -1. Read a Text File (with docstrings) -2. Character Frequency Counter -3. Train/Val Split -4. Char Vocabulary + Encode/Decode -5. Uniform Random Text Generator -6. Bigram Counts Language Model -7. Laplace Smoothing (compare w/ and w/o) -8. Temperature & Top-k Sampling -9. Perplexity on Validation -10. NumPy Softmax + Cross-Entropy (toy) -11. PyTorch Tensors 101 -12. Autograd Mini-Lab (fit *y = 2x + 3*) -13. Char Bigram Neural LM (PyTorch) -14. Sampling Function (PyTorch) -15. Single-Head Self-Attention (causal mask) -16. Mini Transformer Block (pre-LN) -17. Tiny Decoder-Only Model (1–2 blocks) -18. *(Optional)* Save/Load & CLI Interface +1. Read a Text File (with docstrings) +2. Character Frequency Counter +3. Train/Val Split +4. Char Vocabulary + Encode/Decode +5. Uniform Random Text Generator +6. Bigram Counts Language Model +7. Laplace Smoothing (compare w/ and w/o) +8. Temperature & Top-k Sampling +9. Perplexity on Validation +10. NumPy Softmax + Cross-Entropy (toy) +11. PyTorch Tensors 101 +12. Autograd Mini-Lab (fit y = 2x + 3) +13. Char Bigram Neural LM (PyTorch) +14. Sampling Function (PyTorch) +15. Single-Head Self-Attention (causal mask) +16. Mini Transformer Block (pre-LN) +17. Tiny Decoder-Only Model (1–2 blocks) +18. (Optional) Save/Load & CLI Interface -Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.** +Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, +Run, What you learned, Troubleshooting, Mini-exercises, Next lesson. --- ## Requirements -- **Python**: 3.10+ -- **OS**: Windows/macOS/Linux (UTF-8 locale recommended) -- **Dependencies**: - - Stdlib only until Lesson 9 - - **NumPy** for Lessons 8–10 - - **PyTorch** (CPU is fine) from Lesson 11 onward -- **Hardware**: CPU is enough for all lessons; tiny models, short runs +- Python: 3.10+ +- OS: Windows/macOS/Linux (UTF-8 locale recommended) +- Dependencies: + - Stdlib only until Lesson 9 + - NumPy for Lessons 8–10 + - PyTorch (CPU is fine) from Lesson 11 onward +- Hardware: CPU is enough for all lessons; tiny models, short runs Install common deps (when needed): + pip install numpy torch --upgrade -```bash -pip install numpy torch --upgrade +--- -``` +## Getting started + +1) Clone or download this project. +2) Place a small corpus in `data.txt` (public-domain text is ideal). + If `data.txt` is missing, the scripts include a tiny fallback so you can still run them. +3) Start at Lesson 1: + python 01_read_text.py +4) (Optional) Run doctests: + python -m doctest -v 01_read_text.py + +--- + +## Repository layout (suggested) + + aria/ + data.txt + 01_read_text.py + 02_char_counts.py + 03_split.py + 04_vocab.py + 05_uniform_gen.py + 06_bigram_counts.py + 07_laplace.py + 08_sampling.py + 09_perplexity.py + 10_numpy_softmax_ce.py + 11_torch_tensors_101.py + 12_autograd_linreg.py + 13_bigram_nn.py + 14_generate.py + 15_attention.py + 16_block.py + 17_tiny_decoder.py + 18_cli.py + +--- + +## What to expect + +- Short, focused lessons (often < 200 LOC) with runnable starters. +- Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful. +- Reproducibility: fixed seeds for random, numpy, and torch. +- No safety/guardrail features: this is purely a learning project in a controlled environment. +- Incremental wins: you’ll see text samples improve as models get smarter. + +--- + +## About the project + +- Author: Dani +- AI Assistance: Content and structure generated with the help of an AI language model. +- License: MIT (recommended). +- Intended use: Education, personal learning, teaching small groups. + +--- + +## FAQ (quick) + +- Do I need a GPU? No. CPU is fine; models are tiny. +- Where’s the data from? Provide your own public-domain `data.txt`. +- Why character-level? Simpler pipeline; no tokenizer complexity early on. +- Why pre-LN Transformer? Stable training and cleaner gradients in small models. + +--- + +## AI-Generated Content Notice + +Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and +reviewed for clarity. Always run and test code in your own environment. diff --git a/prompt.md b/prompt.md index a843511..1db5c04 100644 --- a/prompt.md +++ b/prompt.md @@ -51,28 +51,28 @@ Use this exact section order: 18) (Optional) Save/Load & CLI Interface === Constraints & Defaults -- Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run. +- Dataset: do NOT auto-download. Expect a local data.txt. If missing, include a tiny built-in fallback sample so scripts still run. - Encoding: UTF-8. Normalize newlines to "\n" for consistency. -- Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`). +- Seeds: demonstrate reproducibility (random, numpy, torch). - Dependencies: - * Stdlib only until Lesson 9; - * NumPy in Lessons 8–10; - * PyTorch from Lesson 11 onward. + * Stdlib only until Lesson 9 + * NumPy in Lessons 8–10 + * PyTorch from Lesson 11 onward - Training defaults (for Lessons 13+): - * Batch size ~32, block size ~128, AdamW(lr=3e-4). - * Brief note on early stopping when val loss plateaus. + * Batch size ~32, block size ~128, AdamW(lr=3e-4) + * Brief note on early stopping when val loss plateaus - Inference defaults: - * Start with greedy; then temperature=0.8, top-k=50. + * Start with greedy; then temperature=0.8, top-k=50 - Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders. === Lesson 1 Specifics For Lesson 1, include: -- Module docstring with Usage example (`python 01_read_text.py`). -- Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`, - `make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`. -- At least one doctest per function where reasonable. -- Fallback text snippet if `data.txt` isn’t found. -- Output: total chars, unique chars, 200-char preview with literal "\n". +- Module docstring with Usage example (python 01_read_text.py) +- Functions: load_text(path: Optional[Path]), normalize_newlines(text: str), + make_preview(text: str, n_chars: int = 200), report_stats(text: str), main() +- At least one doctest per function where reasonable +- Fallback text snippet if data.txt isn’t found +- Output: total chars, unique chars, 200-char preview with literal "\n" === Delivery - Start with a short “How to use this repo” preface and a file tree suggestion.