docs: update README with improved formatting and structured content
- Reformatted the README for better readability with consistent indentation and line breaks - Restructured course outline with clear lesson numbering and descriptions - Added detailed getting started instructions with step-by-step setup process - Included repository layout diagram showing file organization - Enhanced requirements section with clearer dependency structure - Added what to expect section outlining project characteristics and learning approach
This commit is contained in:
135
README.md
135
README.md
@@ -1,8 +1,10 @@
|
||||
# ARIA — Zero-to-Tiny LLM (Python)
|
||||
|
||||
**ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.
|
||||
ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a
|
||||
tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear
|
||||
docstrings, doctests where helpful, and minimal dependencies.
|
||||
|
||||
> **Note:** This repository’s instructional content was **generated with the assistance of an AI language model**.
|
||||
> Note: This repository’s instructional content was generated with the assistance of an AI language model.
|
||||
|
||||
---
|
||||
|
||||
@@ -18,48 +20,117 @@
|
||||
## Who this is for
|
||||
|
||||
- Beginners who can run `python script.py` and have written a basic “Hello World”.
|
||||
- Learners who want a **clear path** to an LLM without heavy math or large datasets.
|
||||
- Learners who want a clear path to an LLM without heavy math or large datasets.
|
||||
|
||||
---
|
||||
|
||||
## Course outline (lessons)
|
||||
|
||||
1. Read a Text File (with docstrings)
|
||||
2. Character Frequency Counter
|
||||
3. Train/Val Split
|
||||
4. Char Vocabulary + Encode/Decode
|
||||
5. Uniform Random Text Generator
|
||||
6. Bigram Counts Language Model
|
||||
7. Laplace Smoothing (compare w/ and w/o)
|
||||
8. Temperature & Top-k Sampling
|
||||
9. Perplexity on Validation
|
||||
10. NumPy Softmax + Cross-Entropy (toy)
|
||||
11. PyTorch Tensors 101
|
||||
12. Autograd Mini-Lab (fit *y = 2x + 3*)
|
||||
13. Char Bigram Neural LM (PyTorch)
|
||||
14. Sampling Function (PyTorch)
|
||||
15. Single-Head Self-Attention (causal mask)
|
||||
16. Mini Transformer Block (pre-LN)
|
||||
17. Tiny Decoder-Only Model (1–2 blocks)
|
||||
18. *(Optional)* Save/Load & CLI Interface
|
||||
1. Read a Text File (with docstrings)
|
||||
2. Character Frequency Counter
|
||||
3. Train/Val Split
|
||||
4. Char Vocabulary + Encode/Decode
|
||||
5. Uniform Random Text Generator
|
||||
6. Bigram Counts Language Model
|
||||
7. Laplace Smoothing (compare w/ and w/o)
|
||||
8. Temperature & Top-k Sampling
|
||||
9. Perplexity on Validation
|
||||
10. NumPy Softmax + Cross-Entropy (toy)
|
||||
11. PyTorch Tensors 101
|
||||
12. Autograd Mini-Lab (fit y = 2x + 3)
|
||||
13. Char Bigram Neural LM (PyTorch)
|
||||
14. Sampling Function (PyTorch)
|
||||
15. Single-Head Self-Attention (causal mask)
|
||||
16. Mini Transformer Block (pre-LN)
|
||||
17. Tiny Decoder-Only Model (1–2 blocks)
|
||||
18. (Optional) Save/Load & CLI Interface
|
||||
|
||||
Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.**
|
||||
Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests,
|
||||
Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.
|
||||
|
||||
---
|
||||
|
||||
## Requirements
|
||||
|
||||
- **Python**: 3.10+
|
||||
- **OS**: Windows/macOS/Linux (UTF-8 locale recommended)
|
||||
- **Dependencies**:
|
||||
- Stdlib only until Lesson 9
|
||||
- **NumPy** for Lessons 8–10
|
||||
- **PyTorch** (CPU is fine) from Lesson 11 onward
|
||||
- **Hardware**: CPU is enough for all lessons; tiny models, short runs
|
||||
- Python: 3.10+
|
||||
- OS: Windows/macOS/Linux (UTF-8 locale recommended)
|
||||
- Dependencies:
|
||||
- Stdlib only until Lesson 9
|
||||
- NumPy for Lessons 8–10
|
||||
- PyTorch (CPU is fine) from Lesson 11 onward
|
||||
- Hardware: CPU is enough for all lessons; tiny models, short runs
|
||||
|
||||
Install common deps (when needed):
|
||||
pip install numpy torch --upgrade
|
||||
|
||||
```bash
|
||||
pip install numpy torch --upgrade
|
||||
---
|
||||
|
||||
```
|
||||
## Getting started
|
||||
|
||||
1) Clone or download this project.
|
||||
2) Place a small corpus in `data.txt` (public-domain text is ideal).
|
||||
If `data.txt` is missing, the scripts include a tiny fallback so you can still run them.
|
||||
3) Start at Lesson 1:
|
||||
python 01_read_text.py
|
||||
4) (Optional) Run doctests:
|
||||
python -m doctest -v 01_read_text.py
|
||||
|
||||
---
|
||||
|
||||
## Repository layout (suggested)
|
||||
|
||||
aria/
|
||||
data.txt
|
||||
01_read_text.py
|
||||
02_char_counts.py
|
||||
03_split.py
|
||||
04_vocab.py
|
||||
05_uniform_gen.py
|
||||
06_bigram_counts.py
|
||||
07_laplace.py
|
||||
08_sampling.py
|
||||
09_perplexity.py
|
||||
10_numpy_softmax_ce.py
|
||||
11_torch_tensors_101.py
|
||||
12_autograd_linreg.py
|
||||
13_bigram_nn.py
|
||||
14_generate.py
|
||||
15_attention.py
|
||||
16_block.py
|
||||
17_tiny_decoder.py
|
||||
18_cli.py
|
||||
|
||||
---
|
||||
|
||||
## What to expect
|
||||
|
||||
- Short, focused lessons (often < 200 LOC) with runnable starters.
|
||||
- Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
|
||||
- Reproducibility: fixed seeds for random, numpy, and torch.
|
||||
- No safety/guardrail features: this is purely a learning project in a controlled environment.
|
||||
- Incremental wins: you’ll see text samples improve as models get smarter.
|
||||
|
||||
---
|
||||
|
||||
## About the project
|
||||
|
||||
- Author: Dani
|
||||
- AI Assistance: Content and structure generated with the help of an AI language model.
|
||||
- License: MIT (recommended).
|
||||
- Intended use: Education, personal learning, teaching small groups.
|
||||
|
||||
---
|
||||
|
||||
## FAQ (quick)
|
||||
|
||||
- Do I need a GPU? No. CPU is fine; models are tiny.
|
||||
- Where’s the data from? Provide your own public-domain `data.txt`.
|
||||
- Why character-level? Simpler pipeline; no tokenizer complexity early on.
|
||||
- Why pre-LN Transformer? Stable training and cleaner gradients in small models.
|
||||
|
||||
---
|
||||
|
||||
## AI-Generated Content Notice
|
||||
|
||||
Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and
|
||||
reviewed for clarity. Always run and test code in your own environment.
|
||||
|
28
prompt.md
28
prompt.md
@@ -51,28 +51,28 @@ Use this exact section order:
|
||||
18) (Optional) Save/Load & CLI Interface
|
||||
|
||||
=== Constraints & Defaults
|
||||
- Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run.
|
||||
- Dataset: do NOT auto-download. Expect a local data.txt. If missing, include a tiny built-in fallback sample so scripts still run.
|
||||
- Encoding: UTF-8. Normalize newlines to "\n" for consistency.
|
||||
- Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`).
|
||||
- Seeds: demonstrate reproducibility (random, numpy, torch).
|
||||
- Dependencies:
|
||||
* Stdlib only until Lesson 9;
|
||||
* NumPy in Lessons 8–10;
|
||||
* PyTorch from Lesson 11 onward.
|
||||
* Stdlib only until Lesson 9
|
||||
* NumPy in Lessons 8–10
|
||||
* PyTorch from Lesson 11 onward
|
||||
- Training defaults (for Lessons 13+):
|
||||
* Batch size ~32, block size ~128, AdamW(lr=3e-4).
|
||||
* Brief note on early stopping when val loss plateaus.
|
||||
* Batch size ~32, block size ~128, AdamW(lr=3e-4)
|
||||
* Brief note on early stopping when val loss plateaus
|
||||
- Inference defaults:
|
||||
* Start with greedy; then temperature=0.8, top-k=50.
|
||||
* Start with greedy; then temperature=0.8, top-k=50
|
||||
- Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders.
|
||||
|
||||
=== Lesson 1 Specifics
|
||||
For Lesson 1, include:
|
||||
- Module docstring with Usage example (`python 01_read_text.py`).
|
||||
- Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`,
|
||||
`make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`.
|
||||
- At least one doctest per function where reasonable.
|
||||
- Fallback text snippet if `data.txt` isn’t found.
|
||||
- Output: total chars, unique chars, 200-char preview with literal "\n".
|
||||
- Module docstring with Usage example (python 01_read_text.py)
|
||||
- Functions: load_text(path: Optional[Path]), normalize_newlines(text: str),
|
||||
make_preview(text: str, n_chars: int = 200), report_stats(text: str), main()
|
||||
- At least one doctest per function where reasonable
|
||||
- Fallback text snippet if data.txt isn’t found
|
||||
- Output: total chars, unique chars, 200-char preview with literal "\n"
|
||||
|
||||
=== Delivery
|
||||
- Start with a short “How to use this repo” preface and a file tree suggestion.
|
||||
|
Reference in New Issue
Block a user