# Data Licenses and Attribution NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses. --- ## License Philosophy ### What We Use ✅ **Public Domain:** No restrictions ✅ **CC0:** Public domain dedication ✅ **CC-BY:** Attribution required ✅ **MIT/Apache/BSD:** Permissive open source ### What We DON'T Use ❌ **All Rights Reserved:** Copyrighted without permission ❌ **CC-BY-NC:** Non-commercial restrictions ❌ **CC-BY-ND:** No derivatives restrictions ❌ **Unknown/Unlicensed:** No verified license ❌ **Scraped Web Data:** Without license verification --- ## Approved Dataset Sources ### 1. Wikipedia (English) **License:** CC-BY-SA 3.0 **URL:** https://dumps.wikimedia.org/ **Size:** ~20 GB (compressed) **Language:** English **Description:** English Wikipedia articles **Attribution:** > Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0. **Usage:** Text data for general knowledge --- ### 2. Project Gutenberg **License:** Public Domain **URL:** https://www.gutenberg.org/ **Size:** ~15 GB **Language:** Primarily English **Description:** Public domain books (pre-1928 in US) **Attribution:** > Project Gutenberg. Public domain literary works. **Usage:** Literary text, historical documents --- ### 3. OpenWebText **License:** CC0 1.0 (Public Domain Dedication) **URL:** https://huggingface.co/datasets/Skylion007/openwebtext **Size:** ~38 GB **Language:** English **Description:** Open reproduction of WebText (Reddit links) **Attribution:** > OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal. **Usage:** Web-scraped text (Reddit-filtered) --- ### 4. C4 (Colossal Clean Crawled Corpus) **License:** ODC-BY (Open Data Commons Attribution) **URL:** https://huggingface.co/datasets/c4 **Size:** ~300 GB (en subset) **Language:** English **Description:** Cleaned Common Crawl data **Attribution:** > C4 dataset from Google's T5 paper. ODC-BY license. **Usage:** Large-scale web text --- ### 5. The Pile - ArXiv Subset **License:** Various (mostly permissive for ArXiv subset) **URL:** https://pile.eleuther.ai/ **Size:** ~60 GB (ArXiv subset) **Language:** English **Description:** ArXiv papers (scientific articles) **Attribution:** > The Pile by EleutherAI. ArXiv papers subset. **Usage:** Scientific and technical text **Note:** Only use subsets with verified permissive licenses --- ## License Tracking System ### Ledger File All downloaded datasets tracked in: ``` data/processed/license_ledger.json ``` **Format:** ```json { "sources": [ { "name": "wikipedia-en", "license": "cc-by-sa-3.0", "url": "https://dumps.wikimedia.org/enwiki/", "download_date": "2025-01-15", "size_gb": 20.5, "attribution": "Wikipedia contributors..." } ] } ``` ### Verification Before training, verify licenses: ```bash python -m nova_data.pipeline verify_licenses ``` This checks that all data sources have approved licenses. --- ## Attribution Requirements ### CC-BY Datasets **Required:** - Attribute the original creator - Include license name - Link to license - Indicate if changes were made **Our Attribution:** All NOVA models trained on CC-BY data include: > This model was trained on data including: > - Wikipedia (CC-BY-SA 3.0) > - [Other CC-BY sources] > > Full attributions in DATA_LICENSES.md ### Public Domain **Required:** None (but we attribute anyway for transparency) --- ## Custom Datasets ### User-Provided Data If training NOVA on your own data: **Your Responsibility:** - Ensure you have rights to use the data - Verify any license requirements - Add custom sources to ledger **Example:** ```yaml # configs/data/custom.yaml sources: - name: my-custom-dataset license: mit # or your license path: /path/to/data description: My custom training data ``` --- ## Commercial Use Considerations ### NOVA Code **License:** Apache 2.0 **Commercial Use:** ✅ Allowed ### Training Data Depends on dataset: | Dataset | Commercial Use | |---------|---------------| | Wikipedia | ✅ Allowed (with attribution) | | Project Gutenberg | ✅ Allowed (public domain) | | OpenWebText | ✅ Allowed (CC0) | | C4 | ✅ Allowed (ODC-BY, with attribution) | | The Pile (ArXiv) | ⚠️ Verify per-subset | **Recommendation:** Review each dataset's license for commercial projects. --- ## Excluded Sources ### Why We Don't Use Certain Data **Common Crawl (raw):** - Contains copyrighted material - License status unclear for many pages - We use filtered versions (C4) instead **Social Media (Twitter, etc.):** - Terms of Service restrictions - Privacy concerns - Unclear licensing **Books3/LibGen:** - Contains copyrighted books - Legal issues - Not permissively licensed **YouTube Subtitles:** - Copyright unclear - TOS restrictions --- ## Compliance Checklist Before training NOVA: - [ ] All data sources listed in `license_ledger.json` - [ ] Each source has verified license - [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.) - [ ] Attribution prepared for CC-BY sources - [ ] No excluded sources used --- ## Future Datasets ### Planned Additions We're evaluating these sources: - **BookCorpus:** Open domain books (pending license review) - **Stack Exchange:** CC-BY-SA (with attribution) - **OpenSubtitles:** Public domain/permissive subset - **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD) **Criteria:** - Clear, permissive license - High quality - Legally distributable --- ## Dataset Removal Requests If you believe we've incorrectly listed a dataset: 1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues) 2. Include: - Dataset name - License concern - Supporting documentation 3. We'll review and respond within 7 days --- ## Legal Disclaimer **This project aims for legal compliance, but:** - We're not lawyers - License interpretation may vary by jurisdiction - Users are responsible for their own compliance - Consult legal counsel for commercial use **NOVA project provides this information for transparency, but makes no warranties about legal compliance.** --- ## References ### License Texts - **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/ - **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/ - **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0 - **MIT:** https://opensource.org/licenses/MIT - **ODC-BY:** https://opendatacommons.org/licenses/by/ ### Resources - Creative Commons: https://creativecommons.org/ - Open Data Commons: https://opendatacommons.org/ - OSI Licenses: https://opensource.org/licenses --- **Last Updated:** 2025 **Document Version:** 1.0 **Review Frequency:** Quarterly