Creating the project.

2025-09-23 11:54:02 -04:00
commit 272172e87c
3 changed files with 367 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,219 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[codz]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #   Usually these files are written by a python script from a template
 #   before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py.cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
 # .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 # Pipfile.lock
 # UV
 #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 # uv.lock
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 # poetry.lock
 # poetry.toml
 # pdm
 #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
 #   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
 #   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
 # pdm.lock
 # pdm.toml
 .pdm-python
 .pdm-build/
 # pixi
 #   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
 # pixi.lock
 #   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
 #   in the .venv directory. It is recommended not to include this directory in version control.
 .pixi
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # Redis
 *.rdb
 *.aof
 *.pid
 # RabbitMQ
 mnesia/
 rabbitmq/
 rabbitmq-data/
 # ActiveMQ
 activemq-data/
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .envrc
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 #   JetBrains specific template is maintained in a separate JetBrains.gitignore that can
 #   be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #   and can be added to the global gitignore or merged into this file.  For a more nuclear
 #   option (not recommended) you can uncomment the following to ignore the entire idea folder.
 # .idea/
 # Abstra
 #   Abstra is an AI-powered process automation framework.
 #   Ignore directories containing user credentials, local state, and settings.
 #   Learn more at https://abstra.io/docs
 .abstra/
 # Visual Studio Code
 #   Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore 
 #   that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
 #   and can be added to the global gitignore or merged into this file. However, if you prefer, 
 #   you could uncomment the following to ignore the entire vscode folder
 # .vscode/
 # Ruff stuff:
 .ruff_cache/
 # PyPI configuration file
 .pypirc
 # Marimo
 marimo/_static/
 marimo/_lsp/
 __marimo__/
 # Streamlit
 .streamlit/secrets.toml
 # Data/Material that should not be synced
 data.txt
--- a/README.md
+++ b/README.md
@@ -0,0 +1,65 @@
 # ARIA — Zero-to-Tiny LLM (Python)
 **ARIA** is a beginner-friendly, step-by-step course that takes you from **“Hello World”** to training a **tiny decoder-only, character-level LLM** in Python. Each lesson is a single, runnable file with clear docstrings, doctests where helpful, and minimal dependencies.
 > **Note:** This repository’s instructional content was **generated with the assistance of an AI language model**.
 ---
 ## What you’ll build
 - A progression of tiny language models:
  - Count-based bigram model → NumPy softmax toy → PyTorch bigram NN
  - Single-head self-attention → Mini Transformer block
  - A tiny decoder-only model trained on a small corpus (e.g., Tiny Shakespeare)
 ---
 ## Who this is for
 - Beginners who can run `python script.py` and have written a basic “Hello World”.
 - Learners who want a **clear path** to an LLM without heavy math or large datasets.
 ---
 ## Course outline (lessons)
 1. Read a Text File (with docstrings)  
 2. Character Frequency Counter  
 3. Train/Val Split  
 4. Char Vocabulary + Encode/Decode  
 5. Uniform Random Text Generator  
 6. Bigram Counts Language Model  
 7. Laplace Smoothing (compare w/ and w/o)  
 8. Temperature & Top-k Sampling  
 9. Perplexity on Validation  
 10. NumPy Softmax + Cross-Entropy (toy)  
 11. PyTorch Tensors 101  
 12. Autograd Mini-Lab (fit *y = 2x + 3*)  
 13. Char Bigram Neural LM (PyTorch)  
 14. Sampling Function (PyTorch)  
 15. Single-Head Self-Attention (causal mask)  
 16. Mini Transformer Block (pre-LN)  
 17. Tiny Decoder-Only Model (1–2 blocks)  
 18. *(Optional)* Save/Load & CLI Interface
 Each lesson includes: **Outcome, Files, Dependencies, Directions, Starter Code with docstrings + doctests, Run, What you learned, Troubleshooting, Mini-exercises, Next lesson.**
 ---
 ## Requirements
 - **Python**: 3.10+  
 - **OS**: Windows/macOS/Linux (UTF-8 locale recommended)  
 - **Dependencies**:  
  - Stdlib only until Lesson 9  
  - **NumPy** for Lessons 8–10  
  - **PyTorch** (CPU is fine) from Lesson 11 onward  
 - **Hardware**: CPU is enough for all lessons; tiny models, short runs
 Install common deps (when needed):
 ```bash
 pip install numpy torch --upgrade
 ```
--- a/prompt.md
+++ b/prompt.md
@@ -0,0 +1,83 @@
 # Prompt.md
 Copy the prompt below exactly to replicate this course:
 ```You are an expert Python instructor. Generate a complete, beginner-friendly course called
 “ARIA — Zero-to-Tiny LLM (Python)” that takes a learner from “Hello World” to training a tiny
 decoder-only, character-level LLM in ~17–18 single-file lessons. No safety/guardrail features;
 assume a controlled learning environment.
 === Audience & Scope
 - Audience: absolute beginners who have only written “Hello World”.
 - Language: Python.
 - Goal: build up to a tiny decoder-only LLM trained on a small corpus (e.g., Tiny Shakespeare).
 - Keep each lesson runnable in a single .py file (≤ ~200 lines where feasible).
 === Output Format (for EACH lesson)
 Use this exact section order:
 1) Title
 2) Duration (estimate)
 3) Outcome (what they will accomplish)
 4) Files to create (filenames)
 5) Dependencies (Python stdlib / NumPy / PyTorch as specified)
 6) Step-by-step Directions
 7) Starter code (complete, runnable) with:
   - A clear module docstring that includes: what it does, how to run, and notes.
   - Function-level Google-style docstrings (Args/Returns/Raises) + at least one doctest where reasonable.
 8) How to run (CLI commands)
 9) What you learned (bullets)
 10) Troubleshooting (common errors + fixes)
 11) Mini-exercises (3–5 quick tasks)
 12) What’s next (name the next lesson)
 === Curriculum (keep these names and order)
 01) Read a Text File (with docstrings)
 02) Character Frequency Counter
 03) Train/Val Split
 04) Char Vocabulary + Encode/Decode
 05) Uniform Random Text Generator
 06) Bigram Counts Language Model
 07) Laplace Smoothing (compare w/ and w/o)
 08) Temperature & Top-k Sampling
 09) Perplexity on Validation
 10) NumPy Softmax + Cross-Entropy (toy)
 11) PyTorch Tensors 101
 12) Autograd Mini-Lab (fit y=2x+3)
 13) Char Bigram Neural LM (PyTorch)
 14) Sampling Function (PyTorch)
 15) Single-Head Self-Attention (causal mask)
 16) Mini Transformer Block (pre-LN)
 17) Tiny Decoder-Only Model (1–2 blocks)
 18) (Optional) Save/Load & CLI Interface
 === Constraints & Defaults
 - Dataset: do NOT auto-download. Expect a local `data.txt`. If missing, include a tiny built-in fallback sample so scripts still run.
 - Encoding: UTF-8. Normalize newlines to "\n" for consistency.
 - Seeds: demonstrate reproducibility (`random`, `numpy`, `torch`).
 - Dependencies:
  * Stdlib only until Lesson 9;
  * NumPy in Lessons 8–10;
  * PyTorch from Lesson 11 onward.
 - Training defaults (for Lessons 13+):
  * Batch size ~32, block size ~128, AdamW(lr=3e-4).
  * Brief note on early stopping when val loss plateaus.
 - Inference defaults:
  * Start with greedy; then temperature=0.8, top-k=50.
 - Keep code clean: type hints where helpful; no frameworks beyond NumPy/PyTorch; no external data loaders.
 === Lesson 1 Specifics
 For Lesson 1, include:
 - Module docstring with Usage example (`python 01_read_text.py`).
 - Functions: `load_text(path: Optional[Path])`, `normalize_newlines(text: str)`,
  `make_preview(text: str, n_chars: int = 200)`, `report_stats(text: str)`, `main()`.
 - At least one doctest per function where reasonable.
 - Fallback text snippet if `data.txt` isn’t found.
 - Output: total chars, unique chars, 200-char preview with literal "\n".
 === Delivery
 - Start with a short “How to use this repo” preface and a file tree suggestion.
 - Then render Lessons 01–18 in order, each with the exact section headings above.
 - End with a short FAQ (Windows vs. macOS paths, UTF-8 issues, CPU vs. GPU notes).
 Generate now.
 ```