From 272172e87cb155481ffb699d637f1966346abd4e Mon Sep 17 00:00:00 2001 From: Dani Date: Tue, 23 Sep 2025 11:54:02 -0400 Subject: [PATCH] Creating the project. --- .gitignore | 219 +++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 65 ++++++++++++++++ prompt.md | 83 ++++++++++++++++++++ 3 files changed, 367 insertions(+) create mode 100644 .gitignore create mode 100644 README.md create mode 100644 prompt.md diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..82ec0a7 --- /dev/null +++ b/.gitignore @@ -0,0 +1,219 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[codz] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py.cover +.hypothesis/ +.pytest_cache/ +cover/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +.pybuilder/ +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +# For a library or package, you might want to ignore these files since the code is +# intended to run in multiple environments; otherwise, check them in: +# .python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +# Pipfile.lock + +# UV +# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +# uv.lock + +# poetry +# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control +# poetry.lock +# poetry.toml + +# pdm +# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. +# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. +# https://pdm-project.org/en/latest/usage/project/#working-with-version-control +# pdm.lock +# pdm.toml +.pdm-python +.pdm-build/ + +# pixi +# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. +# pixi.lock +# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one +# in the .venv directory. It is recommended not to include this directory in version control. +.pixi + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# Redis +*.rdb +*.aof +*.pid + +# RabbitMQ +mnesia/ +rabbitmq/ +rabbitmq-data/ + +# ActiveMQ +activemq-data/ + +# SageMath parsed files +*.sage.py + +# Environments +.env +.envrc +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + +# pytype static type analyzer +.pytype/ + +# Cython debug symbols +cython_debug/ + +# PyCharm +# JetBrains specific template is maintained in a separate JetBrains.gitignore that can +# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore +# and can be added to the global gitignore or merged into this file. For a more nuclear +# option (not recommended) you can uncomment the following to ignore the entire idea folder. +# .idea/ + +# Abstra +# Abstra is an AI-powered process automation framework. +# Ignore directories containing user credentials, local state, and settings. +# Learn more at https://abstra.io/docs +.abstra/ + +# Visual Studio Code +# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore +# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore +# and can be added to the global gitignore or merged into this file. However, if you prefer, +# you could uncomment the following to ignore the entire vscode folder +# .vscode/ + +# Ruff stuff: +.ruff_cache/ + +# PyPI configuration file +.pypirc + +# Marimo +marimo/_static/ +marimo/_lsp/ +__marimo__/ + +# Streamlit +.streamlit/secrets.toml + +# Data/Material that should not be synced +data.txt \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..627e087 --- /dev/null +++ b/README.md @@ -0,0 +1,65 @@ +# ARIA — Zero-to-Tiny LLM (Python) + +**ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies. + +> **Note:** This repository’s instructional content was **generated with the assistance of an AI language model**. + +--- + +## What you’ll build + +- A progression of tiny language models: + - Count-based bigram model → NumPy softmax toy → PyTorch bigram NN + - Single-head self-attention → Mini Transformer block + - A tiny decoder-only model trained on a small corpus (e.g., Tiny Shakespeare) + +--- + +## Who this is for + +- Beginners who can run `python script.py` and have written a basic “Hello World”. +- Learners who want a **clear path** to an LLM without heavy math or large datasets. + +--- + +## Course outline (lessons) + +1. Read a Text File (with docstrings) +2. Character Frequency Counter +3. Train/Val Split +4. Char Vocabulary + Encode/Decode +5. Uniform Random Text Generator +6. Bigram Counts Language Model +7. Laplace Smoothing (compare w/ and w/o) +8. Temperature & Top-k Sampling +9. Perplexity on Validation +10. NumPy Softmax + Cross-Entropy (toy) +11. PyTorch Tensors 101 +12. Autograd Mini-Lab (fit *y = 2x + 3*) +13. Char Bigram Neural LM (PyTorch) +14. Sampling Function (PyTorch) +15. Single-Head Self-Attention (causal mask) +16. Mini Transformer Block (pre-LN) +17. Tiny Decoder-Only Model (1–2 blocks) +18. *(Optional)* Save/Load & CLI Interface + +Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.** + +--- + +## Requirements + +- **Python**: 3.10+ +- **OS**: Windows/macOS/Linux (UTF-8 locale recommended) +- **Dependencies**: + - Stdlib only until Lesson 9 + - **NumPy** for Lessons 8–10 + - **PyTorch** (CPU is fine) from Lesson 11 onward +- **Hardware**: CPU is enough for all lessons; tiny models, short runs + +Install common deps (when needed): + +```bash +pip install numpy torch --upgrade + +``` diff --git a/prompt.md b/prompt.md new file mode 100644 index 0000000..a843511 --- /dev/null +++ b/prompt.md @@ -0,0 +1,83 @@ +# Prompt.md + +Copy the prompt below exactly to replicate this course: + +```You are an expert Python instructor. Generate a complete, beginner-friendly course called +“ARIA — Zero-to-Tiny LLM (Python)” that takes a learner from “Hello World” to training a tiny +decoder-only, character-level LLM in ~17–18 single-file lessons. No safety/guardrail features; +assume a controlled learning environment. + +=== Audience & Scope +- Audience: absolute beginners who have only written “Hello World”. +- Language: Python. +- Goal: build up to a tiny decoder-only LLM trained on a small corpus (e.g., Tiny Shakespeare). +- Keep each lesson runnable in a single .py file (≤ ~200 lines where feasible). + +=== Output Format (for EACH lesson) +Use this exact section order: +1) Title +2) Duration (estimate) +3) Outcome (what they will accomplish) +4) Files to create (filenames) +5) Dependencies (Python stdlib / NumPy / PyTorch as specified) +6) Step-by-step Directions +7) Starter code (complete, runnable) with: + - A clear module docstring that includes: what it does, how to run, and notes. + - Function-level Google-style docstrings (Args/Returns/Raises) + at least one doctest where reasonable. +8) How to run (CLI commands) +9) What you learned (bullets) +10) Troubleshooting (common errors + fixes) +11) Mini-exercises (3–5 quick tasks) +12) What’s next (name the next lesson) + +=== Curriculum (keep these names and order) +01) Read a Text File (with docstrings) +02) Character Frequency Counter +03) Train/Val Split +04) Char Vocabulary + Encode/Decode +05) Uniform Random Text Generator +06) Bigram Counts Language Model +07) Laplace Smoothing (compare w/ and w/o) +08) Temperature & Top-k Sampling +09) Perplexity on Validation +10) NumPy Softmax + Cross-Entropy (toy) +11) PyTorch Tensors 101 +12) Autograd Mini-Lab (fit y=2x+3) +13) Char Bigram Neural LM (PyTorch) +14) Sampling Function (PyTorch) +15) Single-Head Self-Attention (causal mask) +16) Mini Transformer Block (pre-LN) +17) Tiny Decoder-Only Model (1–2 blocks) +18) (Optional) Save/Load & CLI Interface + +=== Constraints & Defaults +- Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run. +- Encoding: UTF-8. Normalize newlines to "\n" for consistency. +- Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`). +- Dependencies: + * Stdlib only until Lesson 9; + * NumPy in Lessons 8–10; + * PyTorch from Lesson 11 onward. +- Training defaults (for Lessons 13+): + * Batch size ~32, block size ~128, AdamW(lr=3e-4). + * Brief note on early stopping when val loss plateaus. +- Inference defaults: + * Start with greedy; then temperature=0.8, top-k=50. +- Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders. + +=== Lesson 1 Specifics +For Lesson 1, include: +- Module docstring with Usage example (`python 01_read_text.py`). +- Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`, + `make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`. +- At least one doctest per function where reasonable. +- Fallback text snippet if `data.txt` isn’t found. +- Output: total chars, unique chars, 200-char preview with literal "\n". + +=== Delivery +- Start with a short “How to use this repo” preface and a file tree suggestion. +- Then render Lessons 01–18 in order, each with the exact section headings above. +- End with a short FAQ (Windows vs. macOS paths, UTF-8 issues, CPU vs. GPU notes). + +Generate now. +```