This commit introduces a new script that implements a Laplace-smoothed bigram language model for computing validation perplexity. The implementation includes:
- Data loading and splitting functionality (90/10 train/validation split)
- Character vocabulary building from training data only
- Bigram counting and Laplace smoothing with alpha=1.0
- Negative log-likelihood and perplexity computation
- Proper handling of out-of-vocabulary characters during evaluation
The script can process existing train.txt/val.txt files or automatically split a data.txt file if the required input files are missing, making it self-contained and easy to use for language model evaluation tasks.