- Reformatted the README for better readability with consistent indentation and line breaks - Restructured course outline with clear lesson numbering and descriptions - Added detailed getting started instructions with step-by-step setup process - Included repository layout diagram showing file organization - Enhanced requirements section with clearer dependency structure - Added what to expect section outlining project characteristics and learning approach
137 lines
4.0 KiB
Markdown
137 lines
4.0 KiB
Markdown
# ARIA — Zero-to-Tiny LLM (Python)
|
||
|
||
ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a
|
||
tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear
|
||
docstrings, doctests where helpful, and minimal dependencies.
|
||
|
||
> Note: This repository’s instructional content was generated with the assistance of an AI language model.
|
||
|
||
---
|
||
|
||
## What you’ll build
|
||
|
||
- A progression of tiny language models:
|
||
- Count-based bigram model → NumPy softmax toy → PyTorch bigram NN
|
||
- Single-head self-attention → Mini Transformer block
|
||
- A tiny decoder-only model trained on a small corpus (e.g., Tiny Shakespeare)
|
||
|
||
---
|
||
|
||
## Who this is for
|
||
|
||
- Beginners who can run `python script.py` and have written a basic “Hello World”.
|
||
- Learners who want a clear path to an LLM without heavy math or large datasets.
|
||
|
||
---
|
||
|
||
## Course outline (lessons)
|
||
|
||
1. Read a Text File (with docstrings)
|
||
2. Character Frequency Counter
|
||
3. Train/Val Split
|
||
4. Char Vocabulary + Encode/Decode
|
||
5. Uniform Random Text Generator
|
||
6. Bigram Counts Language Model
|
||
7. Laplace Smoothing (compare w/ and w/o)
|
||
8. Temperature & Top-k Sampling
|
||
9. Perplexity on Validation
|
||
10. NumPy Softmax + Cross-Entropy (toy)
|
||
11. PyTorch Tensors 101
|
||
12. Autograd Mini-Lab (fit y = 2x + 3)
|
||
13. Char Bigram Neural LM (PyTorch)
|
||
14. Sampling Function (PyTorch)
|
||
15. Single-Head Self-Attention (causal mask)
|
||
16. Mini Transformer Block (pre-LN)
|
||
17. Tiny Decoder-Only Model (1–2 blocks)
|
||
18. (Optional) Save/Load & CLI Interface
|
||
|
||
Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests,
|
||
Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.
|
||
|
||
---
|
||
|
||
## Requirements
|
||
|
||
- Python: 3.10+
|
||
- OS: Windows/macOS/Linux (UTF-8 locale recommended)
|
||
- Dependencies:
|
||
- Stdlib only until Lesson 9
|
||
- NumPy for Lessons 8–10
|
||
- PyTorch (CPU is fine) from Lesson 11 onward
|
||
- Hardware: CPU is enough for all lessons; tiny models, short runs
|
||
|
||
Install common deps (when needed):
|
||
pip install numpy torch --upgrade
|
||
|
||
---
|
||
|
||
## Getting started
|
||
|
||
1) Clone or download this project.
|
||
2) Place a small corpus in `data.txt` (public-domain text is ideal).
|
||
If `data.txt` is missing, the scripts include a tiny fallback so you can still run them.
|
||
3) Start at Lesson 1:
|
||
python 01_read_text.py
|
||
4) (Optional) Run doctests:
|
||
python -m doctest -v 01_read_text.py
|
||
|
||
---
|
||
|
||
## Repository layout (suggested)
|
||
|
||
aria/
|
||
data.txt
|
||
01_read_text.py
|
||
02_char_counts.py
|
||
03_split.py
|
||
04_vocab.py
|
||
05_uniform_gen.py
|
||
06_bigram_counts.py
|
||
07_laplace.py
|
||
08_sampling.py
|
||
09_perplexity.py
|
||
10_numpy_softmax_ce.py
|
||
11_torch_tensors_101.py
|
||
12_autograd_linreg.py
|
||
13_bigram_nn.py
|
||
14_generate.py
|
||
15_attention.py
|
||
16_block.py
|
||
17_tiny_decoder.py
|
||
18_cli.py
|
||
|
||
---
|
||
|
||
## What to expect
|
||
|
||
- Short, focused lessons (often < 200 LOC) with runnable starters.
|
||
- Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
|
||
- Reproducibility: fixed seeds for random, numpy, and torch.
|
||
- No safety/guardrail features: this is purely a learning project in a controlled environment.
|
||
- Incremental wins: you’ll see text samples improve as models get smarter.
|
||
|
||
---
|
||
|
||
## About the project
|
||
|
||
- Author: Dani
|
||
- AI Assistance: Content and structure generated with the help of an AI language model.
|
||
- License: MIT (recommended).
|
||
- Intended use: Education, personal learning, teaching small groups.
|
||
|
||
---
|
||
|
||
## FAQ (quick)
|
||
|
||
- Do I need a GPU? No. CPU is fine; models are tiny.
|
||
- Where’s the data from? Provide your own public-domain `data.txt`.
|
||
- Why character-level? Simpler pipeline; no tokenizer complexity early on.
|
||
- Why pre-LN Transformer? Stable training and cleaner gradients in small models.
|
||
|
||
---
|
||
|
||
## AI-Generated Content Notice
|
||
|
||
Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and
|
||
reviewed for clarity. Always run and test code in your own environment.
|