Initial commit: NOVA - Neuro-Optimizing Versatile Agent
Complete transformer LLM built from scratch with: Core Features: - Full transformer architecture (RoPE, RMSNorm, SwiGLU, KV-cache) - SentencePiece tokenizer (BPE/Unigram) - Training pipeline (AMP, gradient checkpointing, DDP) - Persona system with personality matrix (NO AI disclosure by default) - Genetic evolution (NOVA-EVO) for hyperparameter optimization - Legal-only data pipeline with license tracking - Chat interface (CLI + REST API) - Conversation memory (SQLite) Model Sizes: - 125M, 350M, 1.3B, 3B parameters - Local-first, runs on CPU or GPU - Python 3.10.6+, PyTorch 2.0+ Personas: - girlfriend_gentle (high warmth, high empathy) - girlfriend_playful (high humor, high playfulness) - girlfriend_supportive (balanced, default) Documentation: - Complete README with quickstart - Model card with ethical considerations - Privacy documentation (local-first, zero telemetry) - Data licenses and attribution - Contributing guide Infrastructure: - GitHub Actions CI/CD - Comprehensive test suite - Quickstart script - CLI tool License: Apache 2.0 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
227
docs/CONTRIBUTING.md
Normal file
227
docs/CONTRIBUTING.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Contributing to NOVA
|
||||
|
||||
Thank you for your interest in contributing to NOVA! This document provides guidelines for contributing.
|
||||
|
||||
---
|
||||
|
||||
## How to Contribute
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
**Bug Reports:**
|
||||
1. Check existing issues first
|
||||
2. Use the bug report template
|
||||
3. Include:
|
||||
- Python version
|
||||
- OS and hardware
|
||||
- Steps to reproduce
|
||||
- Expected vs actual behavior
|
||||
- Error messages/logs
|
||||
|
||||
**Feature Requests:**
|
||||
1. Check if already proposed
|
||||
2. Explain the use case
|
||||
3. Describe the desired behavior
|
||||
|
||||
### Code Contributions
|
||||
|
||||
**Setup Development Environment:**
|
||||
|
||||
```bash
|
||||
# Fork and clone
|
||||
git clone https://github.com/yourusername/nova.git
|
||||
cd nova
|
||||
|
||||
# Create venv
|
||||
python -m venv venv
|
||||
source venv/bin/activate # Windows: venv\Scripts\activate
|
||||
|
||||
# Install dev dependencies
|
||||
pip install -r requirements.txt
|
||||
pip install -e .[dev]
|
||||
```
|
||||
|
||||
**Before Submitting:**
|
||||
|
||||
1. **Run Tests:**
|
||||
```bash
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
2. **Lint Code:**
|
||||
```bash
|
||||
ruff check .
|
||||
black --check .
|
||||
```
|
||||
|
||||
3. **Format Code:**
|
||||
```bash
|
||||
black nova_core/ nova_tokenizer/ nova_train/ nova_evo/ nova_chat/
|
||||
```
|
||||
|
||||
4. **Type Check (optional but recommended):**
|
||||
```bash
|
||||
mypy nova_core/ --ignore-missing-imports
|
||||
```
|
||||
|
||||
### Pull Request Process
|
||||
|
||||
1. **Branch Naming:**
|
||||
- `feature/description` for new features
|
||||
- `fix/description` for bug fixes
|
||||
- `docs/description` for documentation
|
||||
|
||||
2. **Commit Messages:**
|
||||
- Clear, descriptive messages
|
||||
- Reference issues: `Fix #123: Description`
|
||||
|
||||
3. **PR Description:**
|
||||
- What changed
|
||||
- Why the change
|
||||
- Testing performed
|
||||
- Screenshots (if UI changes)
|
||||
|
||||
4. **Review Process:**
|
||||
- CI must pass
|
||||
- At least one approval required
|
||||
- Address review feedback
|
||||
|
||||
---
|
||||
|
||||
## Development Guidelines
|
||||
|
||||
### Code Style
|
||||
|
||||
**Python:**
|
||||
- Follow PEP 8
|
||||
- Use Black formatter (line length 100)
|
||||
- Type hints encouraged
|
||||
- Docstrings for public APIs
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
def example_function(param: str, optional: int = 0) -> bool:
|
||||
"""
|
||||
Brief description.
|
||||
|
||||
Args:
|
||||
param: Description
|
||||
optional: Description (default: 0)
|
||||
|
||||
Returns:
|
||||
Description
|
||||
"""
|
||||
return True
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
**Write Tests For:**
|
||||
- New features
|
||||
- Bug fixes
|
||||
- Public APIs
|
||||
|
||||
**Test Locations:**
|
||||
- `tests/test_core.py` - Core transformer
|
||||
- `tests/test_tokenizer.py` - Tokenizer
|
||||
- `tests/test_persona.py` - Persona system
|
||||
- `tests/test_<module>.py` - Other modules
|
||||
|
||||
**Run Tests:**
|
||||
```bash
|
||||
# All tests
|
||||
pytest
|
||||
|
||||
# Specific file
|
||||
pytest tests/test_core.py
|
||||
|
||||
# With coverage
|
||||
pytest --cov=nova_core
|
||||
```
|
||||
|
||||
### Documentation
|
||||
|
||||
**Update Docs For:**
|
||||
- API changes
|
||||
- New features
|
||||
- Configuration options
|
||||
|
||||
**Documentation Files:**
|
||||
- `README.md` - Main documentation
|
||||
- `docs/MODEL_CARD.md` - Model information
|
||||
- `docs/PRIVACY_LOCAL.md` - Privacy details
|
||||
- `docs/DATA_LICENSES.md` - Data licensing
|
||||
|
||||
---
|
||||
|
||||
## Contribution Areas
|
||||
|
||||
### High Priority
|
||||
|
||||
- **Pre-trained Models:** Training and releasing checkpoints
|
||||
- **Export Tools:** GGUF converter, quantization improvements
|
||||
- **Evaluation Suite:** Comprehensive benchmarks
|
||||
- **Dataset Downloaders:** Legal dataset acquisition scripts
|
||||
|
||||
### Medium Priority
|
||||
|
||||
- **LoRA Support:** Fine-tuning with adapters
|
||||
- **Multi-language:** Support for non-English
|
||||
- **Performance:** Optimization improvements
|
||||
- **Tests:** Increase coverage
|
||||
|
||||
### Documentation
|
||||
|
||||
- **Tutorials:** Step-by-step guides
|
||||
- **Examples:** Real-world use cases
|
||||
- **API Docs:** Complete API documentation
|
||||
- **Architecture:** Deep-dive technical docs
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
By contributing, you agree that your contributions will be licensed under Apache License 2.0.
|
||||
|
||||
---
|
||||
|
||||
## Code of Conduct
|
||||
|
||||
### Our Pledge
|
||||
|
||||
- Be respectful and inclusive
|
||||
- Welcome newcomers
|
||||
- Focus on constructive feedback
|
||||
- Assume good intentions
|
||||
|
||||
### Unacceptable Behavior
|
||||
|
||||
- Harassment or discrimination
|
||||
- Trolling or insulting comments
|
||||
- Publishing others' private information
|
||||
- Other unprofessional conduct
|
||||
|
||||
### Enforcement
|
||||
|
||||
Violations can be reported to project maintainers. All complaints will be reviewed and investigated.
|
||||
|
||||
---
|
||||
|
||||
## Questions?
|
||||
|
||||
- **Discussions:** GitHub Discussions
|
||||
- **Issues:** GitHub Issues
|
||||
- **General:** Open an issue with the "question" label
|
||||
|
||||
---
|
||||
|
||||
## Recognition
|
||||
|
||||
Contributors will be:
|
||||
- Listed in CONTRIBUTORS.md
|
||||
- Mentioned in release notes
|
||||
- Credited for significant features
|
||||
|
||||
---
|
||||
|
||||
Thank you for contributing to NOVA! 🌟
|
315
docs/DATA_LICENSES.md
Normal file
315
docs/DATA_LICENSES.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Data Licenses and Attribution
|
||||
|
||||
NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.
|
||||
|
||||
---
|
||||
|
||||
## License Philosophy
|
||||
|
||||
### What We Use
|
||||
|
||||
✅ **Public Domain:** No restrictions
|
||||
✅ **CC0:** Public domain dedication
|
||||
✅ **CC-BY:** Attribution required
|
||||
✅ **MIT/Apache/BSD:** Permissive open source
|
||||
|
||||
### What We DON'T Use
|
||||
|
||||
❌ **All Rights Reserved:** Copyrighted without permission
|
||||
❌ **CC-BY-NC:** Non-commercial restrictions
|
||||
❌ **CC-BY-ND:** No derivatives restrictions
|
||||
❌ **Unknown/Unlicensed:** No verified license
|
||||
❌ **Scraped Web Data:** Without license verification
|
||||
|
||||
---
|
||||
|
||||
## Approved Dataset Sources
|
||||
|
||||
### 1. Wikipedia (English)
|
||||
|
||||
**License:** CC-BY-SA 3.0
|
||||
**URL:** https://dumps.wikimedia.org/
|
||||
**Size:** ~20 GB (compressed)
|
||||
**Language:** English
|
||||
**Description:** English Wikipedia articles
|
||||
|
||||
**Attribution:**
|
||||
> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.
|
||||
|
||||
**Usage:** Text data for general knowledge
|
||||
|
||||
---
|
||||
|
||||
### 2. Project Gutenberg
|
||||
|
||||
**License:** Public Domain
|
||||
**URL:** https://www.gutenberg.org/
|
||||
**Size:** ~15 GB
|
||||
**Language:** Primarily English
|
||||
**Description:** Public domain books (pre-1928 in US)
|
||||
|
||||
**Attribution:**
|
||||
> Project Gutenberg. Public domain literary works.
|
||||
|
||||
**Usage:** Literary text, historical documents
|
||||
|
||||
---
|
||||
|
||||
### 3. OpenWebText
|
||||
|
||||
**License:** CC0 1.0 (Public Domain Dedication)
|
||||
**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
|
||||
**Size:** ~38 GB
|
||||
**Language:** English
|
||||
**Description:** Open reproduction of WebText (Reddit links)
|
||||
|
||||
**Attribution:**
|
||||
> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.
|
||||
|
||||
**Usage:** Web-scraped text (Reddit-filtered)
|
||||
|
||||
---
|
||||
|
||||
### 4. C4 (Colossal Clean Crawled Corpus)
|
||||
|
||||
**License:** ODC-BY (Open Data Commons Attribution)
|
||||
**URL:** https://huggingface.co/datasets/c4
|
||||
**Size:** ~300 GB (en subset)
|
||||
**Language:** English
|
||||
**Description:** Cleaned Common Crawl data
|
||||
|
||||
**Attribution:**
|
||||
> C4 dataset from Google's T5 paper. ODC-BY license.
|
||||
|
||||
**Usage:** Large-scale web text
|
||||
|
||||
---
|
||||
|
||||
### 5. The Pile - ArXiv Subset
|
||||
|
||||
**License:** Various (mostly permissive for ArXiv subset)
|
||||
**URL:** https://pile.eleuther.ai/
|
||||
**Size:** ~60 GB (ArXiv subset)
|
||||
**Language:** English
|
||||
**Description:** ArXiv papers (scientific articles)
|
||||
|
||||
**Attribution:**
|
||||
> The Pile by EleutherAI. ArXiv papers subset.
|
||||
|
||||
**Usage:** Scientific and technical text
|
||||
|
||||
**Note:** Only use subsets with verified permissive licenses
|
||||
|
||||
---
|
||||
|
||||
## License Tracking System
|
||||
|
||||
### Ledger File
|
||||
|
||||
All downloaded datasets tracked in:
|
||||
```
|
||||
data/processed/license_ledger.json
|
||||
```
|
||||
|
||||
**Format:**
|
||||
```json
|
||||
{
|
||||
"sources": [
|
||||
{
|
||||
"name": "wikipedia-en",
|
||||
"license": "cc-by-sa-3.0",
|
||||
"url": "https://dumps.wikimedia.org/enwiki/",
|
||||
"download_date": "2025-01-15",
|
||||
"size_gb": 20.5,
|
||||
"attribution": "Wikipedia contributors..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Before training, verify licenses:
|
||||
|
||||
```bash
|
||||
python -m nova_data.pipeline verify_licenses
|
||||
```
|
||||
|
||||
This checks that all data sources have approved licenses.
|
||||
|
||||
---
|
||||
|
||||
## Attribution Requirements
|
||||
|
||||
### CC-BY Datasets
|
||||
|
||||
**Required:**
|
||||
- Attribute the original creator
|
||||
- Include license name
|
||||
- Link to license
|
||||
- Indicate if changes were made
|
||||
|
||||
**Our Attribution:**
|
||||
|
||||
All NOVA models trained on CC-BY data include:
|
||||
|
||||
> This model was trained on data including:
|
||||
> - Wikipedia (CC-BY-SA 3.0)
|
||||
> - [Other CC-BY sources]
|
||||
>
|
||||
> Full attributions in DATA_LICENSES.md
|
||||
|
||||
### Public Domain
|
||||
|
||||
**Required:** None (but we attribute anyway for transparency)
|
||||
|
||||
---
|
||||
|
||||
## Custom Datasets
|
||||
|
||||
### User-Provided Data
|
||||
|
||||
If training NOVA on your own data:
|
||||
|
||||
**Your Responsibility:**
|
||||
- Ensure you have rights to use the data
|
||||
- Verify any license requirements
|
||||
- Add custom sources to ledger
|
||||
|
||||
**Example:**
|
||||
```yaml
|
||||
# configs/data/custom.yaml
|
||||
sources:
|
||||
- name: my-custom-dataset
|
||||
license: mit # or your license
|
||||
path: /path/to/data
|
||||
description: My custom training data
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Commercial Use Considerations
|
||||
|
||||
### NOVA Code
|
||||
|
||||
**License:** Apache 2.0
|
||||
**Commercial Use:** ✅ Allowed
|
||||
|
||||
### Training Data
|
||||
|
||||
Depends on dataset:
|
||||
|
||||
| Dataset | Commercial Use |
|
||||
|---------|---------------|
|
||||
| Wikipedia | ✅ Allowed (with attribution) |
|
||||
| Project Gutenberg | ✅ Allowed (public domain) |
|
||||
| OpenWebText | ✅ Allowed (CC0) |
|
||||
| C4 | ✅ Allowed (ODC-BY, with attribution) |
|
||||
| The Pile (ArXiv) | ⚠️ Verify per-subset |
|
||||
|
||||
**Recommendation:** Review each dataset's license for commercial projects.
|
||||
|
||||
---
|
||||
|
||||
## Excluded Sources
|
||||
|
||||
### Why We Don't Use Certain Data
|
||||
|
||||
**Common Crawl (raw):**
|
||||
- Contains copyrighted material
|
||||
- License status unclear for many pages
|
||||
- We use filtered versions (C4) instead
|
||||
|
||||
**Social Media (Twitter, etc.):**
|
||||
- Terms of Service restrictions
|
||||
- Privacy concerns
|
||||
- Unclear licensing
|
||||
|
||||
**Books3/LibGen:**
|
||||
- Contains copyrighted books
|
||||
- Legal issues
|
||||
- Not permissively licensed
|
||||
|
||||
**YouTube Subtitles:**
|
||||
- Copyright unclear
|
||||
- TOS restrictions
|
||||
|
||||
---
|
||||
|
||||
## Compliance Checklist
|
||||
|
||||
Before training NOVA:
|
||||
|
||||
- [ ] All data sources listed in `license_ledger.json`
|
||||
- [ ] Each source has verified license
|
||||
- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
|
||||
- [ ] Attribution prepared for CC-BY sources
|
||||
- [ ] No excluded sources used
|
||||
|
||||
---
|
||||
|
||||
## Future Datasets
|
||||
|
||||
### Planned Additions
|
||||
|
||||
We're evaluating these sources:
|
||||
|
||||
- **BookCorpus:** Open domain books (pending license review)
|
||||
- **Stack Exchange:** CC-BY-SA (with attribution)
|
||||
- **OpenSubtitles:** Public domain/permissive subset
|
||||
- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)
|
||||
|
||||
**Criteria:**
|
||||
- Clear, permissive license
|
||||
- High quality
|
||||
- Legally distributable
|
||||
|
||||
---
|
||||
|
||||
## Dataset Removal Requests
|
||||
|
||||
If you believe we've incorrectly listed a dataset:
|
||||
|
||||
1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
|
||||
2. Include:
|
||||
- Dataset name
|
||||
- License concern
|
||||
- Supporting documentation
|
||||
3. We'll review and respond within 7 days
|
||||
|
||||
---
|
||||
|
||||
## Legal Disclaimer
|
||||
|
||||
**This project aims for legal compliance, but:**
|
||||
|
||||
- We're not lawyers
|
||||
- License interpretation may vary by jurisdiction
|
||||
- Users are responsible for their own compliance
|
||||
- Consult legal counsel for commercial use
|
||||
|
||||
**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
### License Texts
|
||||
|
||||
- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
|
||||
- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
|
||||
- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
|
||||
- **MIT:** https://opensource.org/licenses/MIT
|
||||
- **ODC-BY:** https://opendatacommons.org/licenses/by/
|
||||
|
||||
### Resources
|
||||
|
||||
- Creative Commons: https://creativecommons.org/
|
||||
- Open Data Commons: https://opendatacommons.org/
|
||||
- OSI Licenses: https://opensource.org/licenses
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025
|
||||
**Document Version:** 1.0
|
||||
**Review Frequency:** Quarterly
|
232
docs/MODEL_CARD.md
Normal file
232
docs/MODEL_CARD.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# NOVA Model Card
|
||||
|
||||
## Model Details
|
||||
|
||||
**Name:** NOVA (Neuro-Optimizing Versatile Agent)
|
||||
**Version:** 0.1.0
|
||||
**Date:** 2025
|
||||
**License:** Apache 2.0
|
||||
**Type:** Decoder-only transformer language model
|
||||
|
||||
### Model Sizes
|
||||
|
||||
NOVA comes in four sizes:
|
||||
|
||||
| Size | Parameters | Layers | Hidden Size | Attention Heads | Context Length |
|
||||
|------|-----------|--------|-------------|-----------------|----------------|
|
||||
| 125M | 125M | 12 | 768 | 12 | 2048 |
|
||||
| 350M | 350M | 24 | 1024 | 16 | 2048 |
|
||||
| 1.3B | 1.3B | 24 | 2048 | 32 (8 KV) | 2048 |
|
||||
| 3B | 3B | 32 | 2560 | 32 (8 KV) | 4096 |
|
||||
|
||||
### Architecture
|
||||
|
||||
- **Positional Encoding:** RoPE (Rotary Position Embedding)
|
||||
- **Normalization:** RMSNorm (default) or LayerNorm
|
||||
- **Activation:** SwiGLU (default), GeGLU, or GELU
|
||||
- **Attention:** Multi-head with optional grouped-query attention (GQA)
|
||||
- **Features:** KV-cache, gradient checkpointing, Flash Attention support
|
||||
|
||||
## Intended Use
|
||||
|
||||
### Primary Use Cases
|
||||
|
||||
- **Personal companion AI:** Conversational agent with customizable personas
|
||||
- **Local inference:** Privacy-focused applications on consumer hardware
|
||||
- **Research:** Transformer architecture experimentation
|
||||
- **Education:** Learning about modern LLM implementation
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- **Production deployment without safety measures:** Additional content filtering recommended
|
||||
- **High-stakes decisions:** Not suitable for medical, legal, or financial advice
|
||||
- **Scalable services:** Designed for local/personal use, not cloud deployment
|
||||
|
||||
## Training Data
|
||||
|
||||
NOVA uses **only legally licensed datasets**:
|
||||
|
||||
### Approved Sources
|
||||
|
||||
- **Public Domain:** Project Gutenberg books
|
||||
- **CC0/CC-BY:** Wikipedia, OpenWebText, C4 corpus
|
||||
- **Open Licensed:** The Pile (ArXiv), OSI-approved code datasets
|
||||
|
||||
### License Tracking
|
||||
|
||||
All training data sources logged in `license_ledger.json` with:
|
||||
- Source name and URL
|
||||
- License type
|
||||
- Download date
|
||||
- Data provenance
|
||||
|
||||
### Exclusions
|
||||
|
||||
- No scraped data without verified licenses
|
||||
- No copyrighted material
|
||||
- No personally identifiable information (PII)
|
||||
- No user data without explicit consent
|
||||
|
||||
## Training Procedure
|
||||
|
||||
### Hyperparameters
|
||||
|
||||
Default training configuration (125M):
|
||||
|
||||
```yaml
|
||||
batch_size: 8
|
||||
gradient_accumulation: 4
|
||||
learning_rate: 3e-4
|
||||
weight_decay: 0.1
|
||||
warmup_steps: 1000
|
||||
max_steps: 100000
|
||||
optimizer: AdamW
|
||||
lr_schedule: cosine with warmup
|
||||
```
|
||||
|
||||
### Hardware
|
||||
|
||||
- **Minimum:** CPU (4+ cores), 8GB RAM
|
||||
- **Recommended:** NVIDIA GPU (8GB+ VRAM), 16GB+ RAM
|
||||
- **Optimal:** NVIDIA GPU (24GB+ VRAM), 32GB+ RAM
|
||||
|
||||
### Optimizations
|
||||
|
||||
- **Mixed Precision:** AMP (Automatic Mixed Precision) on GPU
|
||||
- **Gradient Checkpointing:** Reduces memory usage
|
||||
- **Distributed Training:** DDP (DistributedDataParallel) support
|
||||
|
||||
## Evaluation
|
||||
|
||||
### Metrics
|
||||
|
||||
- **Perplexity:** Language modeling quality
|
||||
- **Latency:** Inference speed (tokens/second)
|
||||
- **Memory:** Peak RAM/VRAM usage
|
||||
- **Persona Adherence:** Style consistency with selected persona
|
||||
|
||||
### Benchmarks
|
||||
|
||||
(To be added as pre-trained models become available)
|
||||
|
||||
## Persona System
|
||||
|
||||
### Design Philosophy
|
||||
|
||||
NOVA includes a **personality matrix** system for controllable conversational style:
|
||||
|
||||
- **No AI Disclosure by Default:** `always_disclose: false`
|
||||
- **Private Use Context:** Designed for personal, local deployment
|
||||
- **Customizable:** Users can create custom personas
|
||||
|
||||
### Personality Traits
|
||||
|
||||
Eight traits (0.0-1.0) that modulate generation:
|
||||
|
||||
1. Warmth
|
||||
2. Humor
|
||||
3. Empathy
|
||||
4. Decisiveness
|
||||
5. Creativity
|
||||
6. Intimacy
|
||||
7. Playfulness
|
||||
8. Formality
|
||||
|
||||
### Default Personas
|
||||
|
||||
- **girlfriend_gentle:** High warmth, high empathy
|
||||
- **girlfriend_playful:** High humor, high playfulness
|
||||
- **girlfriend_supportive:** Balanced traits (default)
|
||||
|
||||
## Ethical Considerations
|
||||
|
||||
### Privacy
|
||||
|
||||
- **Local-First:** All processing on-device
|
||||
- **No Telemetry:** Zero data collection
|
||||
- **User Control:** Complete control over data and models
|
||||
|
||||
### Bias and Fairness
|
||||
|
||||
- **Training Data Bias:** Inherits biases from source datasets
|
||||
- **Mitigation:** Use diverse, openly licensed sources
|
||||
- **Ongoing Work:** Bias evaluation and mitigation strategies
|
||||
|
||||
### Content Safety
|
||||
|
||||
- **Basic Filters:** Profanity and unsafe content detection
|
||||
- **Limitations:** Not a complete safety solution
|
||||
- **Recommendation:** Additional filtering for public-facing use
|
||||
|
||||
### AI Disclosure
|
||||
|
||||
- **Configurable:** `always_disclose` setting in persona config
|
||||
- **Default:** False (for private, personal use)
|
||||
- **Recommendation:** Enable for any public or shared deployment
|
||||
|
||||
## Limitations
|
||||
|
||||
### Technical
|
||||
|
||||
- **Small Context:** 2048-4096 tokens (not suitable for long documents)
|
||||
- **Compute:** Smaller models may have lower quality than larger LLMs
|
||||
- **Hallucination:** May generate factually incorrect information
|
||||
|
||||
### Use Case
|
||||
|
||||
- **Not a knowledge base:** May not have up-to-date information
|
||||
- **Not a specialist:** General-purpose, not domain-specific
|
||||
- **Not production-ready (as-is):** Requires additional safety/filtering
|
||||
|
||||
## Evolutionary Algorithm (NOVA-EVO)
|
||||
|
||||
### Purpose
|
||||
|
||||
Optional genetic algorithm for automatic configuration optimization:
|
||||
|
||||
- **Hyperparameter Search:** Learning rate, batch size, warmup
|
||||
- **Architecture Search:** Activation, normalization, positional encoding
|
||||
- **Multi-Objective:** Optimizes loss, latency, memory simultaneously
|
||||
|
||||
### Fitness Metrics
|
||||
|
||||
- **Loss/Perplexity:** (50% weight)
|
||||
- **Latency:** (20% weight)
|
||||
- **Memory:** (20% weight)
|
||||
- **Quality:** (10% weight)
|
||||
|
||||
### Compute Budget
|
||||
|
||||
- **Small:** 20 individuals, 10 generations (~6-12 hours)
|
||||
- **Medium:** 40 individuals, 20 generations (~24-48 hours)
|
||||
- **Large:** 100 individuals, 50 generations (~1-2 weeks)
|
||||
|
||||
## Contact
|
||||
|
||||
For questions, issues, or contributions:
|
||||
|
||||
- **GitHub:** [github.com/yourusername/nova](https://github.com/yourusername/nova)
|
||||
- **Issues:** [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@software{nova2025,
|
||||
title={NOVA: Neuro-Optimizing Versatile Agent},
|
||||
author={NOVA Project Contributors},
|
||||
year={2025},
|
||||
url={https://github.com/yourusername/nova},
|
||||
license={Apache-2.0}
|
||||
}
|
||||
```
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- Transformer architecture inspired by GPT, LLaMA, and modern LLM research
|
||||
- RoPE, RMSNorm, SwiGLU from recent papers (Su et al., Zhang et al., Shazeer et al.)
|
||||
- Open source community for datasets and tools
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025
|
||||
**Model Card Version:** 1.0
|
330
docs/PRIVACY_LOCAL.md
Normal file
330
docs/PRIVACY_LOCAL.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Privacy and Local Use
|
||||
|
||||
## NOVA Privacy Statement
|
||||
|
||||
NOVA is designed as a **local-first, privacy-focused** language model. This document explains how NOVA handles your data.
|
||||
|
||||
---
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Local-First
|
||||
|
||||
**Everything runs on your device.**
|
||||
|
||||
- Model inference happens locally
|
||||
- Training data stays on your machine
|
||||
- No cloud dependencies
|
||||
- No internet required (except for dataset downloads)
|
||||
|
||||
### 2. Zero Telemetry
|
||||
|
||||
**NOVA collects zero data.**
|
||||
|
||||
- No usage tracking
|
||||
- No error reporting
|
||||
- No analytics
|
||||
- No phone-home functionality
|
||||
|
||||
### 3. Complete User Control
|
||||
|
||||
**You own everything.**
|
||||
|
||||
- Your conversations
|
||||
- Your trained models
|
||||
- Your custom personas
|
||||
- Your data
|
||||
|
||||
---
|
||||
|
||||
## Data Storage
|
||||
|
||||
### Where Your Data Lives
|
||||
|
||||
```
|
||||
C:\Development\Nova\
|
||||
├── memory.db # Your conversation history (SQLite)
|
||||
├── checkpoints/ # Your trained models
|
||||
├── data/ # Your training data
|
||||
└── configs/persona/ # Your custom personas
|
||||
```
|
||||
|
||||
**All on your device. Never uploaded.**
|
||||
|
||||
### Conversation Memory
|
||||
|
||||
- **Location:** `memory.db` (SQLite database)
|
||||
- **Contents:** Your chat history
|
||||
- **Encryption:** Not encrypted by default (it's local)
|
||||
- **Deletion:** Delete `memory.db` file to erase all history
|
||||
- **Recommendation:** Encrypt your drive if sharing the device
|
||||
|
||||
### Model Checkpoints
|
||||
|
||||
- **Location:** `checkpoints/` directory
|
||||
- **Contents:** Model weights and training state
|
||||
- **Sharing:** Safe to share (contains no personal data)
|
||||
|
||||
---
|
||||
|
||||
## Network Activity
|
||||
|
||||
### When NOVA Uses the Internet
|
||||
|
||||
NOVA **only** uses the internet for:
|
||||
|
||||
1. **Dataset Downloads:** Downloading legal training datasets (opt-in)
|
||||
2. **Optional:** Downloading pre-trained weights (if available)
|
||||
|
||||
### When NOVA Does NOT Use Internet
|
||||
|
||||
- **Chat inference:** 100% offline
|
||||
- **Model training:** 100% offline
|
||||
- **Persona customization:** 100% offline
|
||||
- **Evolution (NOVA-EVO):** 100% offline
|
||||
|
||||
### Firewall Safety
|
||||
|
||||
NOVA is safe to run behind a firewall with no internet access (after initial setup).
|
||||
|
||||
---
|
||||
|
||||
## AI Disclosure Setting
|
||||
|
||||
### `always_disclose` Flag
|
||||
|
||||
NOVA personas have an `always_disclose` setting:
|
||||
|
||||
```yaml
|
||||
always_disclose: false # Default
|
||||
```
|
||||
|
||||
**What this means:**
|
||||
|
||||
- `false` (default): NOVA does NOT disclose being AI
|
||||
- Designed for **private, personal use**
|
||||
- Appropriate for local companion scenarios
|
||||
|
||||
- `true`: NOVA includes AI disclosure text
|
||||
- Recommended for **shared or public use**
|
||||
- Adds transparency about AI nature
|
||||
|
||||
### When to Enable Disclosure
|
||||
|
||||
✅ **Enable `always_disclose: true` if:**
|
||||
- Sharing NOVA with others
|
||||
- Deploying publicly (e.g., website, app)
|
||||
- Any scenario where users might not know it's AI
|
||||
|
||||
❌ **Keep `always_disclose: false` if:**
|
||||
- Personal, private use on your own device
|
||||
- You're fully aware it's a language model
|
||||
- Testing/development
|
||||
|
||||
**Default:** False (personal use assumption)
|
||||
|
||||
---
|
||||
|
||||
## Persona System Privacy
|
||||
|
||||
### Personality Matrix
|
||||
|
||||
The personality matrix (warmth, humor, empathy, etc.) is:
|
||||
|
||||
- **Stored:** In persona YAML files
|
||||
- **Processed:** Locally during generation
|
||||
- **Shared:** Never (unless you share the files)
|
||||
|
||||
### Custom Personas
|
||||
|
||||
Your custom persona configurations:
|
||||
|
||||
- **Location:** `configs/persona/` directory
|
||||
- **Format:** YAML (human-readable text)
|
||||
- **Privacy:** Stored locally, never transmitted
|
||||
|
||||
---
|
||||
|
||||
## Training Data Privacy
|
||||
|
||||
### Legal Data Only
|
||||
|
||||
NOVA enforces **legal-only datasets**:
|
||||
|
||||
- Public domain sources
|
||||
- Openly licensed datasets (CC0, CC-BY, MIT, Apache)
|
||||
- License tracking in `license_ledger.json`
|
||||
|
||||
**No private data scraping.**
|
||||
|
||||
### Your Own Data
|
||||
|
||||
If you train NOVA on your own data:
|
||||
|
||||
- **Stays local:** Never leaves your device
|
||||
- **Your responsibility:** Ensure you have rights to use it
|
||||
- **Recommendation:** Don't train on sensitive/private data you don't want in the model
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Running NOVA Safely
|
||||
|
||||
✅ **Do:**
|
||||
- Run on a trusted device
|
||||
- Keep your OS and Python dependencies updated
|
||||
- Use filesystem encryption if device is shared
|
||||
- Review code before running (it's open source!)
|
||||
|
||||
⚠️ **Don't:**
|
||||
- Expose the REST API to the internet without authentication
|
||||
- Train on sensitive data you can't afford to leak
|
||||
- Share `memory.db` if it contains private conversations
|
||||
|
||||
### REST API Security
|
||||
|
||||
If using the REST API (`nova chat serve`):
|
||||
|
||||
- **Default:** Binds to `0.0.0.0:8000` (all interfaces)
|
||||
- **Recommendation:** Use `--host 127.0.0.1` for local-only
|
||||
- **Authentication:** Not included (add if exposing externally)
|
||||
- **HTTPS:** Not included (add if exposing externally)
|
||||
|
||||
**For personal use:** Keep localhost-only.
|
||||
**For shared use:** Add authentication, HTTPS, rate limiting.
|
||||
|
||||
---
|
||||
|
||||
## Data Deletion
|
||||
|
||||
### Clear All Conversations
|
||||
|
||||
```bash
|
||||
# Delete conversation database
|
||||
rm memory.db
|
||||
|
||||
# Or programmatically
|
||||
from nova_chat import ConversationMemory
|
||||
memory = ConversationMemory()
|
||||
memory.clear_all()
|
||||
```
|
||||
|
||||
### Remove Models
|
||||
|
||||
```bash
|
||||
# Delete checkpoints
|
||||
rm -rf checkpoints/
|
||||
```
|
||||
|
||||
### Complete Reset
|
||||
|
||||
```bash
|
||||
# Remove all data
|
||||
rm -rf data/ checkpoints/ memory.db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Third-Party Dependencies
|
||||
|
||||
NOVA uses standard open-source libraries:
|
||||
|
||||
- **PyTorch:** ML framework
|
||||
- **SentencePiece:** Tokenization
|
||||
- **FastAPI/Uvicorn:** REST API (optional)
|
||||
- **SQLite:** Conversation storage
|
||||
|
||||
**All are open source and widely audited.**
|
||||
|
||||
### Dependency Privacy
|
||||
|
||||
- PyTorch: No telemetry (when installed normally)
|
||||
- SentencePiece: No telemetry
|
||||
- FastAPI: No telemetry
|
||||
- SQLite: Local database, no telemetry
|
||||
|
||||
---
|
||||
|
||||
## Comparison to Cloud LLMs
|
||||
|
||||
| Feature | NOVA | Cloud LLMs |
|
||||
|---------|------|------------|
|
||||
| **Data Location** | Your device | Company servers |
|
||||
| **Privacy** | Complete | Varies by provider |
|
||||
| **Telemetry** | None | Usually tracked |
|
||||
| **Internet Required** | No (after setup) | Yes |
|
||||
| **Cost** | One-time (hardware) | Per-token/monthly |
|
||||
| **Customization** | Full control | Limited |
|
||||
| **Data Retention** | Your choice | Company policy |
|
||||
|
||||
---
|
||||
|
||||
## Transparency
|
||||
|
||||
### Open Source
|
||||
|
||||
NOVA is **fully open source** under Apache 2.0:
|
||||
|
||||
- **Source code:** Fully auditable
|
||||
- **No hidden functionality:** What you see is what you get
|
||||
- **Community review:** Anyone can inspect for privacy issues
|
||||
|
||||
### No Hidden Behavior
|
||||
|
||||
NOVA does **not**:
|
||||
- Phone home
|
||||
- Send analytics
|
||||
- Track usage
|
||||
- Report errors to external services
|
||||
- Auto-update without your action
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Maximum Privacy
|
||||
|
||||
1. **Offline Mode:** Disable network after downloading dependencies
|
||||
2. **Encrypt Storage:** Use full-disk encryption (BitLocker, FileVault, LUKS)
|
||||
3. **Regular Cleanup:** Clear `memory.db` periodically if desired
|
||||
4. **Review Code:** Inspect the source before running
|
||||
|
||||
### For Shared Devices
|
||||
|
||||
1. **Enable Disclosure:** Set `always_disclose: true`
|
||||
2. **Separate Accounts:** Use OS user accounts to isolate data
|
||||
3. **Clear Conversations:** Delete history after sessions
|
||||
|
||||
### For Development
|
||||
|
||||
1. **Test Data Only:** Don't use real sensitive data for testing
|
||||
2. **Version Control:** Add `memory.db` and `checkpoints/` to `.gitignore`
|
||||
|
||||
---
|
||||
|
||||
## Contact for Privacy Concerns
|
||||
|
||||
If you find privacy issues:
|
||||
|
||||
- **GitHub Issues:** [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
|
||||
- **Security:** Tag issues with `security` label
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**NOVA is designed for local, private use.**
|
||||
|
||||
✅ No data collection
|
||||
✅ No telemetry
|
||||
✅ No cloud dependencies
|
||||
✅ Complete user control
|
||||
✅ Open source and auditable
|
||||
|
||||
**Your data stays on your device.**
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025
|
||||
**Document Version:** 1.0
|
Reference in New Issue
Block a user