# Data Licenses and Attribution

NOVA is committed to using **only legally licensed datasets** for training. This document tracks all approved data sources and their licenses.

---

## License Philosophy

### What We Use

✅ **Public Domain:** No restrictions
✅ **CC0:** Public domain dedication
✅ **CC-BY:** Attribution required
✅ **MIT/Apache/BSD:** Permissive open source

### What We DON'T Use

❌ **All Rights Reserved:** Copyrighted without permission
❌ **CC-BY-NC:** Non-commercial restrictions
❌ **CC-BY-ND:** No derivatives restrictions
❌ **Unknown/Unlicensed:** No verified license
❌ **Scraped Web Data:** Without license verification

---

## Approved Dataset Sources

### 1. Wikipedia (English)

**License:** CC-BY-SA 3.0
**URL:** https://dumps.wikimedia.org/
**Size:** ~20 GB (compressed)
**Language:** English
**Description:** English Wikipedia articles

**Attribution:**
> Wikipedia contributors. English Wikipedia. Wikimedia Foundation. Licensed under CC-BY-SA 3.0.

**Usage:** Text data for general knowledge

---

### 2. Project Gutenberg

**License:** Public Domain
**URL:** https://www.gutenberg.org/
**Size:** ~15 GB
**Language:** Primarily English
**Description:** Public domain books (pre-1928 in US)

**Attribution:**
> Project Gutenberg. Public domain literary works.

**Usage:** Literary text, historical documents

---

### 3. OpenWebText

**License:** CC0 1.0 (Public Domain Dedication)
**URL:** https://huggingface.co/datasets/Skylion007/openwebtext
**Size:** ~38 GB
**Language:** English
**Description:** Open reproduction of WebText (Reddit links)

**Attribution:**
> OpenWebText dataset by Aaron Gokaslan and Vanya Cohen. CC0 1.0 Universal.

**Usage:** Web-scraped text (Reddit-filtered)

---

### 4. C4 (Colossal Clean Crawled Corpus)

**License:** ODC-BY (Open Data Commons Attribution)
**URL:** https://huggingface.co/datasets/c4
**Size:** ~300 GB (en subset)
**Language:** English
**Description:** Cleaned Common Crawl data

**Attribution:**
> C4 dataset from Google's T5 paper. ODC-BY license.

**Usage:** Large-scale web text

---

### 5. The Pile - ArXiv Subset

**License:** Various (mostly permissive for ArXiv subset)
**URL:** https://pile.eleuther.ai/
**Size:** ~60 GB (ArXiv subset)
**Language:** English
**Description:** ArXiv papers (scientific articles)

**Attribution:**
> The Pile by EleutherAI. ArXiv papers subset.

**Usage:** Scientific and technical text

**Note:** Only use subsets with verified permissive licenses

---

## License Tracking System

### Ledger File

All downloaded datasets tracked in:
```
data/processed/license_ledger.json
```

**Format:**
```json
{
  "sources": [
    {
      "name": "wikipedia-en",
      "license": "cc-by-sa-3.0",
      "url": "https://dumps.wikimedia.org/enwiki/",
      "download_date": "2025-01-15",
      "size_gb": 20.5,
      "attribution": "Wikipedia contributors..."
    }
  ]
}
```

### Verification

Before training, verify licenses:

```bash
python -m nova_data.pipeline verify_licenses
```

This checks that all data sources have approved licenses.

---

## Attribution Requirements

### CC-BY Datasets

**Required:**
- Attribute the original creator
- Include license name
- Link to license
- Indicate if changes were made

**Our Attribution:**

All NOVA models trained on CC-BY data include:

> This model was trained on data including:
> - Wikipedia (CC-BY-SA 3.0)
> - [Other CC-BY sources]
>
> Full attributions in DATA_LICENSES.md

### Public Domain

**Required:** None (but we attribute anyway for transparency)

---

## Custom Datasets

### User-Provided Data

If training NOVA on your own data:

**Your Responsibility:**
- Ensure you have rights to use the data
- Verify any license requirements
- Add custom sources to ledger

**Example:**
```yaml
# configs/data/custom.yaml
sources:
  - name: my-custom-dataset
    license: mit  # or your license
    path: /path/to/data
    description: My custom training data
```

---

## Commercial Use Considerations

### NOVA Code

**License:** Apache 2.0
**Commercial Use:** ✅ Allowed

### Training Data

Depends on dataset:

| Dataset | Commercial Use |
|---------|---------------|
| Wikipedia | ✅ Allowed (with attribution) |
| Project Gutenberg | ✅ Allowed (public domain) |
| OpenWebText | ✅ Allowed (CC0) |
| C4 | ✅ Allowed (ODC-BY, with attribution) |
| The Pile (ArXiv) | ⚠️ Verify per-subset |

**Recommendation:** Review each dataset's license for commercial projects.

---

## Excluded Sources

### Why We Don't Use Certain Data

**Common Crawl (raw):**
- Contains copyrighted material
- License status unclear for many pages
- We use filtered versions (C4) instead

**Social Media (Twitter, etc.):**
- Terms of Service restrictions
- Privacy concerns
- Unclear licensing

**Books3/LibGen:**
- Contains copyrighted books
- Legal issues
- Not permissively licensed

**YouTube Subtitles:**
- Copyright unclear
- TOS restrictions

---

## Compliance Checklist

Before training NOVA:

- [ ] All data sources listed in `license_ledger.json`
- [ ] Each source has verified license
- [ ] Licenses are permissive (CC-BY, MIT, Apache, public domain, etc.)
- [ ] Attribution prepared for CC-BY sources
- [ ] No excluded sources used

---

## Future Datasets

### Planned Additions

We're evaluating these sources:

- **BookCorpus:** Open domain books (pending license review)
- **Stack Exchange:** CC-BY-SA (with attribution)
- **OpenSubtitles:** Public domain/permissive subset
- **Code datasets:** GitHub permissive licenses (MIT, Apache, BSD)

**Criteria:**
- Clear, permissive license
- High quality
- Legally distributable

---

## Dataset Removal Requests

If you believe we've incorrectly listed a dataset:

1. Open an issue: [github.com/yourusername/nova/issues](https://github.com/yourusername/nova/issues)
2. Include:
   - Dataset name
   - License concern
   - Supporting documentation
3. We'll review and respond within 7 days

---

## Legal Disclaimer

**This project aims for legal compliance, but:**

- We're not lawyers
- License interpretation may vary by jurisdiction
- Users are responsible for their own compliance
- Consult legal counsel for commercial use

**NOVA project provides this information for transparency, but makes no warranties about legal compliance.**

---

## References

### License Texts

- **CC-BY 4.0:** https://creativecommons.org/licenses/by/4.0/
- **CC0 1.0:** https://creativecommons.org/publicdomain/zero/1.0/
- **Apache 2.0:** https://www.apache.org/licenses/LICENSE-2.0
- **MIT:** https://opensource.org/licenses/MIT
- **ODC-BY:** https://opendatacommons.org/licenses/by/

### Resources

- Creative Commons: https://creativecommons.org/
- Open Data Commons: https://opendatacommons.org/
- OSI Licenses: https://opensource.org/licenses

---

**Last Updated:** 2025
**Document Version:** 1.0
**Review Frequency:** Quarterly