Commit Graph

6 Commits

Author SHA1 Message Date
feecf05ee3 feat: add vocabulary encoding/decoding script with character-level tokenization
This commit introduces a new script that implements character-level vocabulary building and text encoding/decoding functionality. The script loads text from train.txt or falls back to data.txt, normalizes line endings, builds character-to-id mappings, and includes round-trip encoding/decoding validation. It's designed for CPU-only operation using only Python standard library modules and provides clear error handling for unseen characters during encoding.
2025-09-23 20:57:48 -04:00
abba60a798 feat: Add train/val split script and update gitignore
- Added 03_train_val_split.py to create deterministic train/validation splits from data.txt or fallback text
- Updated .gitignore to un-comment .vscode/ directory exclusion
- Changed data.txt pattern to *.txt for better file matching in gitignore
- Script handles UTF-8 text loading with newline normalization and writes train.txt/val.txt files
- Includes doctest examples and proper type hints
2025-09-23 20:28:22 -04:00
a82efe6ea2 add character frequency counter script with newline normalization and sorted output
This commit introduces a new script (02_char_freq.py) that analyzes character frequencies in text corpora. The script loads text from train.txt or data.txt (with fallback), normalizes all line endings to LF, counts character occurrences, sorts them by frequency (descending) and character (ascending), then prints a formatted ASCII table showing the top 50 most frequent characters. Newlines are displayed as literal "\\n" sequences in the output. The script includes proper type hints, docstrings with doctests, and handles missing files gracefully with a built-in fallback text.
2025-09-23 20:08:30 -04:00
68d9d00123 feat: add text file reader with normalization and stats preview
Adds a new script to read local text files, normalize line endings, and display character statistics and previews. The script handles missing data files gracefully by using a fallback sample and provides detailed output including total characters, unique characters, and a 200-character preview with literal newline representations.
2025-09-23 12:44:22 -04:00
b1bb6fc705 docs: update README with improved formatting and structured content
- Reformatted the README for better readability with consistent indentation and line breaks
- Restructured course outline with clear lesson numbering and descriptions
- Added detailed getting started instructions with step-by-step setup process
- Included repository layout diagram showing file organization
- Enhanced requirements section with clearer dependency structure
- Added what to expect section outlining project characteristics and learning approach
2025-09-23 12:01:01 -04:00
272172e87c Creating the project. 2025-09-23 11:54:02 -04:00