This commit introduces a new script that generates random text by uniformly sampling characters from the training data's vocabulary. It loads text from train.txt or falls back to data.txt, normalizes line endings, builds a sorted character vocabulary, and samples characters using a fixed RNG seed for reproducibility. The implementation includes command-line arguments for specifying generation length and random seed, making it configurable while maintaining consistent output for the same inputs.
This commit introduces a new script that implements character-level vocabulary building and text encoding/decoding functionality. The script loads text from train.txt or falls back to data.txt, normalizes line endings, builds character-to-id mappings, and includes round-trip encoding/decoding validation. It's designed for CPU-only operation using only Python standard library modules and provides clear error handling for unseen characters during encoding.
- Added 03_train_val_split.py to create deterministic train/validation splits from data.txt or fallback text
- Updated .gitignore to un-comment .vscode/ directory exclusion
- Changed data.txt pattern to *.txt for better file matching in gitignore
- Script handles UTF-8 text loading with newline normalization and writes train.txt/val.txt files
- Includes doctest examples and proper type hints
This commit introduces a new script (02_char_freq.py) that analyzes character frequencies in text corpora. The script loads text from train.txt or data.txt (with fallback), normalizes all line endings to LF, counts character occurrences, sorts them by frequency (descending) and character (ascending), then prints a formatted ASCII table showing the top 50 most frequent characters. Newlines are displayed as literal "\\n" sequences in the output. The script includes proper type hints, docstrings with doctests, and handles missing files gracefully with a built-in fallback text.
Adds a new script to read local text files, normalize line endings, and display character statistics and previews. The script handles missing data files gracefully by using a fallback sample and provides detailed output including total characters, unique characters, and a 200-character preview with literal newline representations.
- Reformatted the README for better readability with consistent indentation and line breaks
- Restructured course outline with clear lesson numbering and descriptions
- Added detailed getting started instructions with step-by-step setup process
- Included repository layout diagram showing file organization
- Enhanced requirements section with clearer dependency structure
- Added what to expect section outlining project characteristics and learning approach