docs: update README with improved formatting and structured content

- Reformatted the README for better readability with consistent indentation and line breaks - Restructured course outline with clear lesson numbering and descriptions - Added detailed getting started instructions with step-by-step setup process - Included repository layout diagram showing file organization - Enhanced requirements section with clearer dependency structure - Added what to expect section outlining project characteristics and learning approach
2025-09-23 12:01:01 -04:00
parent 272172e87c
commit b1bb6fc705
2 changed files with 117 additions and 46 deletions
--- a/README.md
+++ b/README.md
@@ -1,8 +1,10 @@
 # ARIA — Zero-to-Tiny LLM (Python)

-**ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.
+ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a
+tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear
+docstrings, doctests where helpful, and minimal dependencies.

-> **Note:** This repository’s instructional content was **generated with the assistance of an AI language model**.
+> Note: This repository’s instructional content was generated with the assistance of an AI language model.

 ---

@@ -18,48 +20,117 @@
 ## Who this is for

 - Beginners who can run `python script.py` and have written a basic “Hello World”.
- Learners who want a **clear path** to an LLM without heavy math or large datasets.
+- Learners who want a clear path to an LLM without heavy math or large datasets.

 ---

 ## Course outline (lessons)

-1. Read a Text File (with docstrings)  
-2. Character Frequency Counter  
-3. Train/Val Split  
-4. Char Vocabulary + Encode/Decode  
-5. Uniform Random Text Generator  
-6. Bigram Counts Language Model  
-7. Laplace Smoothing (compare w/ and w/o)  
-8. Temperature & Top-k Sampling  
-9. Perplexity on Validation  
-10. NumPy Softmax + Cross-Entropy (toy)  
-11. PyTorch Tensors 101  
-12. Autograd Mini-Lab (fit *y = 2x + 3*)  
-13. Char Bigram Neural LM (PyTorch)  
-14. Sampling Function (PyTorch)  
-15. Single-Head Self-Attention (causal mask)  
-16. Mini Transformer Block (pre-LN)  
-17. Tiny Decoder-Only Model (1–2 blocks)  
-18. *(Optional)* Save/Load & CLI Interface
+1. Read a Text File (with docstrings)
+2. Character Frequency Counter
+3. Train/Val Split
+4. Char Vocabulary + Encode/Decode
+5. Uniform Random Text Generator
+6. Bigram Counts Language Model
+7. Laplace Smoothing (compare w/ and w/o)
+8. Temperature & Top-k Sampling
+9. Perplexity on Validation
+10. NumPy Softmax + Cross-Entropy (toy)
+11. PyTorch Tensors 101
+12. Autograd Mini-Lab (fit y = 2x + 3)
+13. Char Bigram Neural LM (PyTorch)
+14. Sampling Function (PyTorch)
+15. Single-Head Self-Attention (causal mask)
+16. Mini Transformer Block (pre-LN)
+17. Tiny Decoder-Only Model (1–2 blocks)
+18. (Optional) Save/Load & CLI Interface

-Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.**
+Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests,
+Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.

 ---

 ## Requirements

- **Python**: 3.10+  
- **OS**: Windows/macOS/Linux (UTF-8 locale recommended)  
- **Dependencies**:  
-  - Stdlib only until Lesson 9  
-  - **NumPy** for Lessons 8–10  
-  - **PyTorch** (CPU is fine) from Lesson 11 onward  
- **Hardware**: CPU is enough for all lessons; tiny models, short runs
+- Python: 3.10+
+- OS: Windows/macOS/Linux (UTF-8 locale recommended)
+- Dependencies:
+  - Stdlib only until Lesson 9
+  - NumPy for Lessons 8–10
+  - PyTorch (CPU is fine) from Lesson 11 onward
+- Hardware: CPU is enough for all lessons; tiny models, short runs

 Install common deps (when needed):
+    pip install numpy torch --upgrade

-```bash
-pip install numpy torch --upgrade
+---

-```
+## Getting started
+
+1) Clone or download this project.
+2) Place a small corpus in `data.txt` (public-domain text is ideal).
+   If `data.txt` is missing, the scripts include a tiny fallback so you can still run them.
+3) Start at Lesson 1:
+    python 01_read_text.py
+4) (Optional) Run doctests:
+    python -m doctest -v 01_read_text.py
+
+---
+
+## Repository layout (suggested)
+
+    aria/
+      data.txt
+      01_read_text.py
+      02_char_counts.py
+      03_split.py
+      04_vocab.py
+      05_uniform_gen.py
+      06_bigram_counts.py
+      07_laplace.py
+      08_sampling.py
+      09_perplexity.py
+      10_numpy_softmax_ce.py
+      11_torch_tensors_101.py
+      12_autograd_linreg.py
+      13_bigram_nn.py
+      14_generate.py
+      15_attention.py
+      16_block.py
+      17_tiny_decoder.py
+      18_cli.py
+
+---
+
+## What to expect
+
+- Short, focused lessons (often < 200 LOC) with runnable starters.
+- Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
+- Reproducibility: fixed seeds for random, numpy, and torch.
+- No safety/guardrail features: this is purely a learning project in a controlled environment.
+- Incremental wins: you’ll see text samples improve as models get smarter.
+
+---
+
+## About the project
+
+- Author: Dani
+- AI Assistance: Content and structure generated with the help of an AI language model.
+- License: MIT (recommended).
+- Intended use: Education, personal learning, teaching small groups.
+
+---
+
+## FAQ (quick)
+
+- Do I need a GPU? No. CPU is fine; models are tiny.
+- Where’s the data from? Provide your own public-domain `data.txt`.
+- Why character-level? Simpler pipeline; no tokenizer complexity early on.
+- Why pre-LN Transformer? Stable training and cleaner gradients in small models.
+
+---
+
+## AI-Generated Content Notice
+
+Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and
+reviewed for clarity. Always run and test code in your own environment.
--- a/prompt.md
+++ b/prompt.md
@@ -51,28 +51,28 @@ Use this exact section order:
 18) (Optional) Save/Load & CLI Interface

 === Constraints & Defaults
- Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run.
+- Dataset: do NOT auto-download. Expect a local data.txt. If missing, include a tiny built-in fallback sample so scripts still run.
 - Encoding: UTF-8. Normalize newlines to "\n" for consistency.
- Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`).
+- Seeds: demonstrate reproducibility (random, numpy, torch).
 - Dependencies:
-  * Stdlib only until Lesson 9;
-  * NumPy in Lessons 8–10;
-  * PyTorch from Lesson 11 onward.
+  * Stdlib only until Lesson 9
+  * NumPy in Lessons 8–10
+  * PyTorch from Lesson 11 onward
 - Training defaults (for Lessons 13+):
-  * Batch size ~32, block size ~128, AdamW(lr=3e-4).
-  * Brief note on early stopping when val loss plateaus.
+  * Batch size ~32, block size ~128, AdamW(lr=3e-4)
+  * Brief note on early stopping when val loss plateaus
 - Inference defaults:
-  * Start with greedy; then temperature=0.8, top-k=50.
+  * Start with greedy; then temperature=0.8, top-k=50
 - Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders.

 === Lesson 1 Specifics
 For Lesson 1, include:
- Module docstring with Usage example (`python 01_read_text.py`).
- Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`,
-  `make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`.
- At least one doctest per function where reasonable.
- Fallback text snippet if `data.txt` isn’t found.
- Output: total chars, unique chars, 200-char preview with literal "\n".
+- Module docstring with Usage example (python 01_read_text.py)
+- Functions: load_text(path: Optional[Path]), normalize_newlines(text: str),
+  make_preview(text: str, n_chars: int = 200), report_stats(text: str), main()
+- At least one doctest per function where reasonable
+- Fallback text snippet if data.txt isn’t found
+- Output: total chars, unique chars, 200-char preview with literal "\n"

 === Delivery
 - Start with a short “How to use this repo” preface and a file tree suggestion.