1 Commits

Author SHA1 Message Date
feecf05ee3 feat: add vocabulary encoding/decoding script with character-level tokenization
This commit introduces a new script that implements character-level vocabulary building and text encoding/decoding functionality. The script loads text from train.txt or falls back to data.txt, normalizes line endings, builds character-to-id mappings, and includes round-trip encoding/decoding validation. It's designed for CPU-only operation using only Python standard library modules and provides clear error handling for unseen characters during encoding.
2025-09-23 20:57:48 -04:00