mystiatech/Aria - Aria - Gitea: Git with a cup of tea

6 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
Dani	feecf05ee3	feat: add vocabulary encoding/decoding script with character-level tokenization This commit introduces a new script that implements character-level vocabulary building and text encoding/decoding functionality. The script loads text from train.txt or falls back to data.txt, normalizes line endings, builds character-to-id mappings, and includes round-trip encoding/decoding validation. It's designed for CPU-only operation using only Python standard library modules and provides clear error handling for unseen characters during encoding.	2025-09-23 20:57:48 -04:00
Dani	abba60a798	feat: Add train/val split script and update gitignore - Added 03_train_val_split.py to create deterministic train/validation splits from data.txt or fallback text - Updated .gitignore to un-comment .vscode/ directory exclusion - Changed data.txt pattern to *.txt for better file matching in gitignore - Script handles UTF-8 text loading with newline normalization and writes train.txt/val.txt files - Includes doctest examples and proper type hints	2025-09-23 20:28:22 -04:00
Dani	a82efe6ea2	add character frequency counter script with newline normalization and sorted output This commit introduces a new script (02_char_freq.py) that analyzes character frequencies in text corpora. The script loads text from train.txt or data.txt (with fallback), normalizes all line endings to LF, counts character occurrences, sorts them by frequency (descending) and character (ascending), then prints a formatted ASCII table showing the top 50 most frequent characters. Newlines are displayed as literal "\\n" sequences in the output. The script includes proper type hints, docstrings with doctests, and handles missing files gracefully with a built-in fallback text.	2025-09-23 20:08:30 -04:00
Dani	68d9d00123	feat: add text file reader with normalization and stats preview Adds a new script to read local text files, normalize line endings, and display character statistics and previews. The script handles missing data files gracefully by using a fallback sample and provides detailed output including total characters, unique characters, and a 200-character preview with literal newline representations.	2025-09-23 12:44:22 -04:00
Dani	b1bb6fc705	docs: update README with improved formatting and structured content - Reformatted the README for better readability with consistent indentation and line breaks - Restructured course outline with clear lesson numbering and descriptions - Added detailed getting started instructions with step-by-step setup process - Included repository layout diagram showing file organization - Enhanced requirements section with clearer dependency structure - Added what to expect section outlining project characteristics and learning approach	2025-09-23 12:01:01 -04:00
Dani	272172e87c	Creating the project.	2025-09-23 11:54:02 -04:00