Go to file

Dani 674c53651c feat: add Laplace-smoothed bigram model perplexity computation script

This commit introduces a new script that implements a Laplace-smoothed bigram language model for computing validation perplexity. The implementation includes:
- Data loading and splitting functionality (90/10 train/validation split)
- Character vocabulary building from training data only
- Bigram counting and Laplace smoothing with alpha=1.0
- Negative log-likelihood and perplexity computation
- Proper handling of out-of-vocabulary characters during evaluation

The script can process existing train.txt/val.txt files or automatically split a data.txt file if the required input files are missing, making it self-contained and easy to use for language model evaluation tasks.

2025-09-24 00:33:23 -04:00

.gitignore

feat: add temperature and top-k sampling with NumPy implementation and update gitignore to exclude requirements.txt from text file ignore pattern

2025-09-23 23:50:23 -04:00

01_read_text.py

feat: add text file reader with normalization and stats preview

2025-09-23 12:44:22 -04:00

02_char_freq.py

add character frequency counter script with newline normalization and sorted output

2025-09-23 20:08:30 -04:00

03_train_val_split.py

feat: Add train/val split script and update gitignore

2025-09-23 20:28:22 -04:00

04_vocab_encode_decode.py

feat: add vocabulary encoding/decoding script with character-level tokenization

2025-09-23 20:57:48 -04:00

05_uniform_generator.py

feat: add uniform random text generator with reproducible sampling

2025-09-23 21:18:57 -04:00

06_bigram_counts.py

add bigram counts model with text sampling functionality and doctests

2025-09-23 22:02:04 -04:00

07_laplace_smoothing.py

add laplace smoothing implementation for bigram model with doctests and text sampling

2025-09-23 22:36:57 -04:00

08_temp_topk_numpy.py

feat: add temperature and top-k sampling with NumPy implementation and update gitignore to exclude requirements.txt from text file ignore pattern

2025-09-23 23:50:23 -04:00

09_perplexity.py

feat: add Laplace-smoothed bigram model perplexity computation script

2025-09-24 00:33:23 -04:00

prompt.md

docs: update README with improved formatting and structured content

2025-09-23 12:01:01 -04:00

README.md

docs: update README with improved formatting and structured content

2025-09-23 12:01:01 -04:00

requirements.txt

feat: add temperature and top-k sampling with NumPy implementation and update gitignore to exclude requirements.txt from text file ignore pattern

2025-09-23 23:50:23 -04:00

README.md

ARIA — Zero-to-Tiny LLM (Python)

ARIA is a beginner-friendly, step-by-step course that takes you from “Hello World” to training a tiny decoder-only, character-level LLM in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.

Note: This repository’s instructional content was generated with the assistance of an AI language model.

What you’ll build

A progression of tiny language models:
- Count-based bigram model → NumPy softmax toy → PyTorch bigram NN
- Single-head self-attention → Mini Transformer block
- A tiny decoder-only model trained on a small corpus (e.g., Tiny Shakespeare)

Who this is for

Beginners who can run python script.py and have written a basic “Hello World”.
Learners who want a clear path to an LLM without heavy math or large datasets.

Course outline (lessons)

Read a Text File (with docstrings)
Character Frequency Counter
Train/Val Split
Char Vocabulary + Encode/Decode
Uniform Random Text Generator
Bigram Counts Language Model
Laplace Smoothing (compare w/ and w/o)
Temperature & Top-k Sampling
Perplexity on Validation
NumPy Softmax + Cross-Entropy (toy)
PyTorch Tensors 101
Autograd Mini-Lab (fit y = 2x + 3)
Char Bigram Neural LM (PyTorch)
Sampling Function (PyTorch)
Single-Head Self-Attention (causal mask)
Mini Transformer Block (pre-LN)
Tiny Decoder-Only Model (1–2 blocks)
(Optional) Save/Load & CLI Interface

Each lesson includes: Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.

Requirements

Python: 3.10+
OS: Windows/macOS/Linux (UTF-8 locale recommended)
Dependencies:
- Stdlib only until Lesson 9
- NumPy for Lessons 8–10
- PyTorch (CPU is fine) from Lesson 11 onward
Hardware: CPU is enough for all lessons; tiny models, short runs

Install common deps (when needed): pip install numpy torch --upgrade

Getting started

Clone or download this project.
Place a small corpus in data.txt (public-domain text is ideal). If data.txt is missing, the scripts include a tiny fallback so you can still run them.
Start at Lesson 1: python 01_read_text.py
(Optional) Run doctests: python -m doctest -v 01_read_text.py

Repository layout (suggested)

aria/
  data.txt
  01_read_text.py
  02_char_counts.py
  03_split.py
  04_vocab.py
  05_uniform_gen.py
  06_bigram_counts.py
  07_laplace.py
  08_sampling.py
  09_perplexity.py
  10_numpy_softmax_ce.py
  11_torch_tensors_101.py
  12_autograd_linreg.py
  13_bigram_nn.py
  14_generate.py
  15_attention.py
  16_block.py
  17_tiny_decoder.py
  18_cli.py

What to expect

Short, focused lessons (often < 200 LOC) with runnable starters.
Docstrings everywhere: module & function-level (Args/Returns/Raises), plus doctests where useful.
Reproducibility: fixed seeds for random, numpy, and torch.
No safety/guardrail features: this is purely a learning project in a controlled environment.
Incremental wins: you’ll see text samples improve as models get smarter.

About the project

Author: Dani
AI Assistance: Content and structure generated with the help of an AI language model.
License: MIT (recommended).
Intended use: Education, personal learning, teaching small groups.

FAQ (quick)

Do I need a GPU? No. CPU is fine; models are tiny.
Where’s the data from? Provide your own public-domain data.txt.
Why character-level? Simpler pipeline; no tokenizer complexity early on.
Why pre-LN Transformer? Stable training and cleaner gradients in small models.

AI-Generated Content Notice

Parts of this repository (instructions, lesson templates, and examples) were generated by an AI model and reviewed for clarity. Always run and test code in your own environment.

README.md Unescape Escape

ARIA — Zero-to-Tiny LLM (Python)

What you’ll build

Who this is for

Course outline (lessons)

Requirements

Getting started

Repository layout (suggested)

What to expect

About the project

FAQ (quick)

AI-Generated Content Notice

README.md