Dani 538a44247b feat: add uniform random text generator with reproducible sampling
This commit introduces a new script that generates random text by uniformly sampling characters from the training data's vocabulary. It loads text from train.txt or falls back to data.txt, normalizes line endings, builds a sorted character vocabulary, and samples characters using a fixed RNG seed for reproducibility. The implementation includes command-line arguments for specifying generation length and random seed, making it configurable while maintaining consistent output for the same inputs.
2025-09-23 21:18:57 -04:00

ARIA — Zero-to-Tiny LLM (Python)

ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.

Note: This repositorys instructional content was generated with the assistance of an AI language model.


What youll build

  • A progression of tiny language models:
    • Count-based bigram model → NumPy softmax toy → PyTorch bigram NN
    • Single-head self-attention → Mini Transformer block
    • A tiny decoder-only model trained on a small corpus (e.g., Tiny Shakespeare)

Who this is for

  • Beginners who can run python script.py and have written a basic “Hello World”.
  • Learners who want a clear path to an LLM without heavy math or large datasets.

Course outline (lessons)

  1. Read a Text File (with docstrings)
  2. Character Frequency Counter
  3. Train/Val Split
  4. Char Vocabulary + Encode/Decode
  5. Uniform Random Text Generator
  6. Bigram Counts Language Model
  7. Laplace Smoothing (compare w/ and w/o)
  8. Temperature & Top-k Sampling
  9. Perplexity on Validation
  10. NumPy Softmax + Cross-Entropy (toy)
  11. PyTorch Tensors 101
  12. Autograd Mini-Lab (fit y = 2x + 3)
  13. Char Bigram Neural LM (PyTorch)
  14. Sampling Function (PyTorch)
  15. Single-Head Self-Attention (causal mask)
  16. Mini Transformer Block (pre-LN)
  17. Tiny Decoder-Only Model (12 blocks)
  18. (Optional) Save/Load & CLI Interface

Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.


Requirements

  • Python: 3.10+
  • OS: Windows/macOS/Linux (UTF-8 locale recommended)
  • Dependencies:
    • Stdlib only until Lesson 9
    • NumPy for Lessons 810
    • PyTorch (CPU is fine) from Lesson 11 onward
  • Hardware: CPU is enough for all lessons; tiny models, short runs

Install common deps (when needed): pip install numpy torch --upgrade


Getting started

  1. Clone or download this project.
  2. Place a small corpus in data.txt (public-domain text is ideal). If data.txt is missing, the scripts include a tiny fallback so you can still run them.
  3. Start at Lesson 1: python 01_read_text.py
  4. (Optional) Run doctests: python -m doctest -v 01_read_text.py

Repository layout (suggested)

aria/
  data.txt
  01_read_text.py
  02_char_counts.py
  03_split.py
  04_vocab.py
  05_uniform_gen.py
  06_bigram_counts.py
  07_laplace.py
  08_sampling.py
  09_perplexity.py
  10_numpy_softmax_ce.py
  11_torch_tensors_101.py
  12_autograd_linreg.py
  13_bigram_nn.py
  14_generate.py
  15_attention.py
  16_block.py
  17_tiny_decoder.py
  18_cli.py

What to expect

  • Short, focused lessons (often < 200 LOC) with runnable starters.
  • Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
  • Reproducibility: fixed seeds for random, numpy, and torch.
  • No safety/guardrail features: this is purely a learning project in a controlled environment.
  • Incremental wins: youll see text samples improve as models get smarter.

About the project

  • Author: Dani
  • AI Assistance: Content and structure generated with the help of an AI language model.
  • License: MIT (recommended).
  • Intended use: Education, personal learning, teaching small groups.

FAQ (quick)

  • Do I need a GPU? No. CPU is fine; models are tiny.
  • Wheres the data from? Provide your own public-domain data.txt.
  • Why character-level? Simpler pipeline; no tokenizer complexity early on.
  • Why pre-LN Transformer? Stable training and cleaner gradients in small models.

AI-Generated Content Notice

Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and reviewed for clarity. Always run and test code in your own environment.

Description
No description provided
Readme 54 KiB
Languages
Python 100%