docs: update README with improved formatting and structured content

- Reformatted the README for better readability with consistent indentation and line breaks
- Restructured course outline with clear lesson numbering and descriptions
- Added detailed getting started instructions with step-by-step setup process
- Included repository layout diagram showing file organization
- Enhanced requirements section with clearer dependency structure
- Added what to expect section outlining project characteristics and learning approach
This commit is contained in:
2025-09-23 12:01:01 -04:00
parent 272172e87c
commit b1bb6fc705
2 changed files with 117 additions and 46 deletions

101
README.md
View File

@@ -1,8 +1,10 @@
# ARIA — Zero-to-Tiny LLM (Python)
**ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.
ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a
tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear
docstrings, doctests where helpful, and minimal dependencies.
> **Note:** This repositorys instructional content was **generated with the assistance of an AI language model**.
> Note: This repositorys instructional content was generated with the assistance of an AI language model.
---
@@ -18,7 +20,7 @@
## Who this is for
- Beginners who can run `python script.py` and have written a basic “Hello World”.
- Learners who want a **clear path** to an LLM without heavy math or large datasets.
- Learners who want a clear path to an LLM without heavy math or large datasets.
---
@@ -35,31 +37,100 @@
9. Perplexity on Validation
10. NumPy Softmax + Cross-Entropy (toy)
11. PyTorch Tensors 101
12. Autograd Mini-Lab (fit *y = 2x + 3*)
12. Autograd Mini-Lab (fit y = 2x + 3)
13. Char Bigram Neural LM (PyTorch)
14. Sampling Function (PyTorch)
15. Single-Head Self-Attention (causal mask)
16. Mini Transformer Block (pre-LN)
17. Tiny Decoder-Only Model (12 blocks)
18. *(Optional)* Save/Load & CLI Interface
18. (Optional) Save/Load & CLI Interface
Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.**
Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests,
Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.
---
## Requirements
- **Python**: 3.10+
- **OS**: Windows/macOS/Linux (UTF-8 locale recommended)
- **Dependencies**:
- Python: 3.10+
- OS: Windows/macOS/Linux (UTF-8 locale recommended)
- Dependencies:
- Stdlib only until Lesson 9
- **NumPy** for Lessons 810
- **PyTorch** (CPU is fine) from Lesson 11 onward
- **Hardware**: CPU is enough for all lessons; tiny models, short runs
- NumPy for Lessons 810
- PyTorch (CPU is fine) from Lesson 11 onward
- Hardware: CPU is enough for all lessons; tiny models, short runs
Install common deps (when needed):
pip install numpy torch --upgrade
```bash
pip install numpy torch --upgrade
---
```
## Getting started
1) Clone or download this project.
2) Place a small corpus in `data.txt` (public-domain text is ideal).
If `data.txt` is missing, the scripts include a tiny fallback so you can still run them.
3) Start at Lesson 1:
python 01_read_text.py
4) (Optional) Run doctests:
python -m doctest -v 01_read_text.py
---
## Repository layout (suggested)
aria/
data.txt
01_read_text.py
02_char_counts.py
03_split.py
04_vocab.py
05_uniform_gen.py
06_bigram_counts.py
07_laplace.py
08_sampling.py
09_perplexity.py
10_numpy_softmax_ce.py
11_torch_tensors_101.py
12_autograd_linreg.py
13_bigram_nn.py
14_generate.py
15_attention.py
16_block.py
17_tiny_decoder.py
18_cli.py
---
## What to expect
- Short, focused lessons (often < 200 LOC) with runnable starters.
- Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
- Reproducibility: fixed seeds for random, numpy, and torch.
- No safety/guardrail features: this is purely a learning project in a controlled environment.
- Incremental wins: youll see text samples improve as models get smarter.
---
## About the project
- Author: Dani
- AI Assistance: Content and structure generated with the help of an AI language model.
- License: MIT (recommended).
- Intended use: Education, personal learning, teaching small groups.
---
## FAQ (quick)
- Do I need a GPU? No. CPU is fine; models are tiny.
- Wheres the data from? Provide your own public-domain `data.txt`.
- Why character-level? Simpler pipeline; no tokenizer complexity early on.
- Why pre-LN Transformer? Stable training and cleaner gradients in small models.
---
## AI-Generated Content Notice
Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and
reviewed for clarity. Always run and test code in your own environment.

View File

@@ -51,28 +51,28 @@ Use this exact section order:
18) (Optional) Save/Load & CLI Interface
=== Constraints & Defaults
- Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run.
- Dataset: do NOT auto-download. Expect a local data.txt. If missing, include a tiny built-in fallback sample so scripts still run.
- Encoding: UTF-8. Normalize newlines to "\n" for consistency.
- Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`).
- Seeds: demonstrate reproducibility (random, numpy, torch).
- Dependencies:
* Stdlib only until Lesson 9;
* NumPy in Lessons 810;
* PyTorch from Lesson 11 onward.
* Stdlib only until Lesson 9
* NumPy in Lessons 810
* PyTorch from Lesson 11 onward
- Training defaults (for Lessons 13+):
* Batch size ~32, block size ~128, AdamW(lr=3e-4).
* Brief note on early stopping when val loss plateaus.
* Batch size ~32, block size ~128, AdamW(lr=3e-4)
* Brief note on early stopping when val loss plateaus
- Inference defaults:
* Start with greedy; then temperature=0.8, top-k=50.
* Start with greedy; then temperature=0.8, top-k=50
- Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders.
=== Lesson 1 Specifics
For Lesson 1, include:
- Module docstring with Usage example (`python 01_read_text.py`).
- Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`,
`make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`.
- At least one doctest per function where reasonable.
- Fallback text snippet if `data.txt` isnt found.
- Output: total chars, unique chars, 200-char preview with literal "\n".
- Module docstring with Usage example (python 01_read_text.py)
- Functions: load_text(path: Optional[Path]), normalize_newlines(text: str),
make_preview(text: str, n_chars: int = 200), report_stats(text: str), main()
- At least one doctest per function where reasonable
- Fallback text snippet if data.txt isnt found
- Output: total chars, unique chars, 200-char preview with literal "\n"
=== Delivery
- Start with a short “How to use this repo” preface and a file tree suggestion.