Files
Vivi-Speech/.planning/research/FEATURES.md
Dani B 901574f8c8 docs: complete project research (STACK, FEATURES, ARCHITECTURE, PITFALLS, SUMMARY)
Synthesized research findings from 4 parallel researcher agents:

Key Findings:
- Stack: discord.py 2.6.4 + PostgreSQL/SQLite with webhook-driven PluralKit integration
- Architecture: 7-component system with clear separation of concerns, async-native
- Features: Rule-based learning system starting simple, avoiding context inference and ML
- Pitfalls: 8 critical risks identified with phase assignments and prevention strategies

Recommended Approach:
- 5-phase build order (detection → translation → teaching → config → polish)
- Focus on dysgraphia accessibility for teaching interface
- Start with message detection reliability (Phase 1, load-bearing)
- Shared emoji dictionary (Phase 1-3); per-server overrides deferred to Phase 4+

Confidence Levels:
- Tech Stack: VERY HIGH (all production-proven, no experimental choices)
- Architecture: VERY HIGH (mirrors successful production bots)
- Features: HIGH (tight scope, transparent approach)
- Roadmap: HIGH (logical phase progression with value delivery)

Gaps to Address in Requirements:
- Vivi's teaching UX preferences (dysgraphia-specific patterns)
- Exact emoji coverage and naming conventions
- Moderation/teaching permissions model
- Multi-system scope and per-system customization needs

Ready for requirements definition and roadmap creation.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-01-29 11:02:32 -05:00

23 KiB

Features Research: Vivi Speech Translator

Executive Summary

The Vivi Speech Translator should combine table-stakes Discord bot features with a hybrid translation approach that starts rule-based but can learn and improve over time. The core differentiator is intelligent emoji-to-text translation that understands Vivi's unique communication system, enabled by a learning mechanism that teaches the bot new emoji meanings while maintaining deterministic, debuggable behavior for known sequences.

Table Stakes Features

These are features Discord bot users expect and that are necessary for basic functionality:

Message Event Handling

  • Description: Detect and respond to messages containing emojis from Vivi
  • Complexity: Low
  • Dependencies: Discord.py or discord.js framework
  • Why Essential: Without this, the bot cannot observe the messages it needs to translate
  • Implementation: Monitor on_message events and filter for messages containing emoji sequences

Emoji Parsing

  • Description: Parse both standard Unicode emojis and Discord custom emojis from message content
  • Complexity: Low
  • Dependencies: Emoji library (emoji, discord.py emoji utilities)
  • Why Essential: Must accurately identify which emojis are present to begin translation
  • Implementation: Use regex patterns to find \d+ custom emoji IDs and standard emoji Unicode

Reply/Response Messages

  • Description: Send translation responses back to chat or as replies
  • Complexity: Low
  • Dependencies: Message API (Discord.py message.reply() or create_message())
  • Why Essential: Users need to see the translation output
  • Implementation: Format translations as clear, readable text messages; optionally use embeds for rich formatting

Command Interface

  • Description: Allow teaching the bot new emoji meanings via commands
  • Complexity: Medium
  • Dependencies: Command parser, permission checking
  • Why Essential: Enables the learning system that makes the bot scale without hardcoding every emoji
  • Implementation: Slash commands or hybrid prefix/slash commands (see "Message Handling Modes" below)

Per-Server Configuration

  • Description: Store server-specific settings (translation mode, custom emoji meanings)
  • Complexity: Medium
  • Dependencies: Database (SQLite for small scale, PostgreSQL for production)
  • Why Essential: Different servers have different preferences for how verbose translations should be
  • Implementation: Simple key-value store per guild_id (server ID)

Rate Limiting & Error Handling

  • Description: Gracefully handle Discord API limits and network errors
  • Complexity: Medium
  • Dependencies: Framework error handling
  • Why Essential: Prevents bot crashes and makes service reliable
  • Implementation: Exponential backoff for failed API calls, catch timeouts

Differentiating Features

These features set Vivi's translator apart from generic Discord bots:

Learning System (Rule-Based Foundation with Growth)

  • Description: Bot learns emoji meanings when explicitly taught by users
  • Complexity: Medium
  • Why This Differentiates: Makes the translator sustainable for Vivi's unique and evolving emoji language without requiring the developer to manually hardcode every sequence
  • Constraints:
    • No implicit learning: Never infer emoji meanings from context—require explicit teaching
    • Explicit confirmation: Always confirm back what was learned so users can verify correctness
    • Easy correction: Provide /correct :emoji: new_meaning for when meanings change
    • Community contribution: Allow anyone in the server to teach, not just Vivi
  • Database: Store emoji meanings with metadata (who taught it, when, how many uses)
  • Feedback loop: Track which translations are most helpful, identify ambiguous emojis

Emoji Sequence Detection

  • Description: Recognize sequences of emojis that form compound meanings (e.g., 👩‍💻📱 = "coding on phone")
  • Complexity: Medium
  • Why This Differentiates: Vivi likely uses emoji clusters; simple one-to-one mapping isn't enough
  • Implementation Approach:
    • Define sequence patterns (consecutive emojis with or without separators)
    • Store meanings for common sequences
    • Fall back to individual emoji meanings if no sequence match
    • Allow users to teach sequences via /teach :emoji1: :emoji2: compound_meaning

Context-Aware Translation Formatting

  • Description: Vary translation output based on conversational context
  • Complexity: High
  • Why This Differentiates: Translations that feel natural, not robotic; adapts tone to channel context
  • Examples:
    • In #general: "Vivi is having fun 😊"
    • In #serious-discussion: "Expressing contentment and readiness"
    • Response variation: Sometimes expand, sometimes summarize based on recent context
  • Implementation: Store channel context settings; analyze surrounding messages for tone
  • Description: Detect which alter is communicating via PluralKit webhook metadata
  • Complexity: High
  • Why This Differentiates: Essential for communities where Vivi shares an account with headmates; respects system identity
  • How It Works:
    • PluralKit creates webhooks with system/member metadata in username
    • Bot can parse webhook source_guild_id and custom embed footers to identify system member
    • Enables "Vivi says: [translation]" vs generic bot response
  • Limitations: Requires PluralKit to be active in server; only works with PluralKit proxy format

Translation Quality Tracking

  • Description: Monitor which translations get positive vs negative reactions
  • Complexity: Medium
  • Why This Differentiates: Enables continuous improvement and identifies ambiguous emojis needing clarification
  • Implementation:
    • Store emoji usage statistics (frequency, accuracy ratings)
    • Provide admin dashboard /emoji_stats to see problematic emojis
    • Optionally flag ambiguous emojis for human review

Global Dictionary Option (Future Differentiator)

  • Description: Share emoji meanings across servers with opt-in
  • Complexity: High
  • Why This Differentiates: Could benefit other systems and communities; positions Vivi's system as a standard
  • Constraints:
    • Privacy-first: Only share if opted-in by server admin
    • Vivi-specific: Focus on Vivi's emoji system, not generic emoji translation
    • Conflict resolution: Last-teaching-wins or voting system for disagreements

Translation Approaches Analyzed

Rule-Based Pattern Matching

How it works: Explicit rules define emoji → text mappings. New mappings must be explicitly added.

Pros:

  • Fully deterministic and debuggable
  • Fast performance (no ML overhead)
  • Transparent to users (users understand why emoji means what)
  • Easy to correct mistakes (just update the rule)

Cons:

  • Requires explicit teaching for every emoji/sequence
  • Doesn't generalize to patterns the bot hasn't seen
  • Becomes unwieldy with hundreds of emoji combinations

Statistical/ML-Based Approach

How it works: Train a model on known emoji → text pairs, predict meanings for unknown emojis using similarity or context.

Pros:

  • Can handle novel emoji combinations through pattern inference
  • Generalizes from limited training data
  • Captures semantic relationships between emoji meanings

Cons:

  • Black-box behavior ("why did the bot translate it that way?")
  • Requires significant training data to work well
  • Harder to debug when translations are wrong
  • Users don't understand or trust the mechanism

Phase 1 (MVP): Rule-Based with Community Learning

  • Start with hardcoded emoji meanings for common sequences
  • Allow users to teach new emojis via /teach command
  • 100% transparent: users know exactly why each translation happens
  • Fast, reliable, debuggable

Phase 2 (Future): Statistical Fallback

  • Analyze emoji usage patterns in learned meanings
  • If emoji appears in multiple compounds, infer partial meanings
  • Use embedding-based similarity to suggest translations for unknown emoji sequences
  • Always show confidence scores; require confirmation before using inferred meanings

Phase 3 (Long-term): Continuous Learning

  • Track user corrections and positive/negative reactions
  • Retrain fallback model on accepted vs rejected translations
  • Identify consistently ambiguous emojis for human review
  • Adjust translation format based on what's most helpful per server

Why This Is Best:

  • Starts simple and user-friendly
  • Scales to hundreds of emojis through learning
  • Maintains trust through transparency
  • Enables improvement over time without requiring ML expertise

Message Handling Modes

Discord bots can respond to messages in different ways. Choose the approach that best serves Vivi's community:

Mode 1: Automatic Translation (On Every Message)

How it works: Bot automatically translates every message from Vivi (or messages with emoji content)

Pros:

  • Instant understanding without extra steps
  • No friction for casual readers
  • Good for channels where translation is the main purpose

Cons:

  • Can be noisy in mixed-audience channels
  • Spoils the "reading Vivi directly" experience for community members who prefer it
  • Uses Discord API quota faster

Best For: Translation channels (#vivi-translated or similar)

Mode 2: On-Demand Translation (Reaction or Command)

How it works: Users react with a specific emoji or use /translate command to request translation

Pros:

  • Keeps channels clean by default
  • Respects users who want to interpret emojis themselves
  • Lower API usage
  • More intentional interaction

Cons:

  • Extra step for users
  • May miss important messages if people forget to request
  • Less discoverable for new community members

Best For: Social channels where emoji is part of fun, not solely for understanding

Mode 3: Toggle (Per-Server Setting)

How it works: Server admins choose between automatic or on-demand via /settings

Pros:

  • Respects different community preferences
  • Maximizes adoption across servers with different cultures
  • Can differentiate channels (auto in #vivi-translations, on-demand in #general)

Cons:

  • More complex to implement
  • Users must learn about settings

Recommendation for V1: Implement Mode 3 with default Mode 1 (automatic). Let server admins customize via:

  • /settings translation-mode [auto|on-demand]
  • /settings translate-channels [list of channel IDs] for auto mode

Implementation: Message Handling

  • Event: Discord on_message event
  • Filter: Check for emoji content using regex: <a?:[a-zA-Z0-9_]{1,32}:[0-9]{18,20}> (custom) and Unicode emoji
  • Action: Call translation engine, format output, send reply

Command Interface: Slash Commands vs Prefix

Recommendation: Use slash commands as primary, offer hybrid support

Why slash commands:

  • Modern Discord standard (easier discoverability)
  • No Message Content intent required (better privacy)
  • Built-in autocomplete for parameters
  • Better for /teach :emoji: meaning (emoji picker integration)

Key slash commands for Vivi:

  • /teach :emoji: meaning — Add emoji to dictionary
  • /translate [emoji-string] — Manually trigger translation
  • /what :emoji: — Look up emoji meaning
  • /correct :emoji: new_meaning — Fix a taught emoji
  • /settings — Server configuration
  • /emoji-stats — View accuracy/usage statistics

Learning Interface & Feedback System

The learning system is crucial for scalability and adoption. Here's the recommended approach:

Teaching Commands

/teach :emoji1: meaning
→ "Learned! 🎓 I'll translate :emoji1: as 'meaning'"

/teach :emoji1: :emoji2: compound meaning
→ "Learned! 🎓 I'll translate :emoji1: :emoji2: as 'compound meaning'"

Correction System

/correct :emoji: new_meaning
→ "Updated! ✏️ I'll now translate :emoji: as 'new_meaning' (was 'old meaning')"

Query System

/what :emoji:
→ "Emoji meaning for :emoji: is 'definition'\nTaught by @User on 2024-01-15\nUsed 47 times this month"

Validation & Confirmation

  • Always repeat back what was learned
  • Show who taught it and when (build community recognition)
  • Optionally show confidence if from ML fallback
  • Highlight ambiguous meanings if same emoji has multiple teachings

Feedback Mechanisms

  1. Reaction-based: Users react with / to translations

    • Track which emojis get positive vs negative reactions
    • Identify consistently wrong translations for correction
  2. Correction Commands: /correct :emoji: new_meaning explicitly fixes errors

    • Creates audit trail of meaning changes
    • Enables tracking learning over time
  3. Conflict Resolution: If multiple teachings for same emoji

    • Show all known meanings with vote counts
    • Use most recent teaching by default, surface conflicts
    • Option: /disambiguate :emoji: choose_meaning to select preferred one

Best Practices

  • Community over individual: Encourage anyone to teach, not just Vivi or admins
  • Transparency: Always show source of taught meanings
  • Auditability: Maintain history of meaning changes
  • Disambiguation: Flag emojis with conflicting meanings early
  • Escalation: Provide /report-ambiguous :emoji: for admin review

Accessibility Considerations

Translation bots serve accessibility functions, so they must be accessible themselves:

Text Accessibility

  • Plain text output: Always provide plain text translations, not just embeds
  • No emoji-only responses: Never respond with just emoji; always include text
  • Clear language: Use simple, direct language in translations (avoid jargon)
  • Consistent formatting: Same emoji always translates the same way (aids screen reader prediction)

Discord Accessibility

  • Slash commands: Easier for keyboard navigation than prefix commands
  • Accessible embeds: If using embeds for formatted output:
    • Include plain text alternative in message content
    • Avoid using embeds for critical information
    • Note: Discord embeds cannot have alt-text for images—only use text-based embeds

Screen Reader Compatibility

  • Emoji descriptions: Include what each emoji is called (e.g., "woman technologist emoji")
  • Sequence clarity: When translating compound sequences, explain the combination
  • No hidden information: Never put crucial meaning in embed footers or nested fields

Examples of Accessible Responses

Good:

Vivi: 👩‍💻📱
Bot: (woman technologist emoji, mobile phone emoji)
Translates to: "coding on phone" or "responding to work messages"

Bad:

Vivi: 👩‍💻📱
Bot: [embed with only emoji in footer, no text explanation]

Implementation

  • Test output with screen readers (NVDA, JAWS)
  • Provide alternative text format via /translate --verbose for complex sequences
  • Include emoji names in debug/development output

Anti-Features (What NOT to Build)

These features sound good but should be avoided:

Anti-Feature 1: Persistent Context Learning

  • What it is: Bot infers emoji meanings from conversation context without explicit teaching
  • Why not:
    • Creates non-deterministic behavior (same emoji means different things in different contexts)
    • Impossible to debug or correct
    • Users don't understand the bot's logic
    • High error rate leads to loss of trust
  • Better approach: Explicit /teach commands only

Anti-Feature 2: Cross-Discord Emoji Translation

  • What it is: Translate emojis the same way across all Discord servers
  • Why not:
    • Emoji meanings are highly personal and context-dependent
    • Vivi's system is specific to her community
    • Would bloat the dictionary with conflicting meanings
    • Not scalable for other users' emoji systems
  • Better approach: Per-server dictionaries with optional public sharing for Vivi's specific system

Anti-Feature 3: Real-Time Chat Simulation

  • What it is: Bot attempts to continue Vivi's conversation or generate new emoji sequences
  • Why not:
    • Out of scope (translation, not generation)
    • Risk of impersonation
    • Confusion about what Vivi actually said vs what bot generated
    • Community prefers Vivi's authentic communication
  • Better approach: Stick to translating Vivi's actual messages

Anti-Feature 4: Full NLP Context Analysis

  • What it is: Use complex NLP to understand message context and vary translations
  • Why not:
    • Over-engineering for the problem
    • Adds maintenance burden
    • Makes behavior unpredictable
    • Initial rule-based approach is more trustworthy
  • Better approach: Simple context hints (channel type, time of day) with explicit teaching

Configuration Per-Server

Different servers may have different translation preferences:

Server Settings to Store

guild_id: 12345678
settings:
  translation_mode: "auto"  # or "on-demand"
  auto_channels: [chan_id_1, chan_id_2]  # channels where auto-translation is enabled
  verbose_translations: false  # expand vs summarize
  show_confidence: false  # show certainty for learned meanings
  allow_community_teaching: true  # can non-mods teach emoji meanings?
  default_language: "en"  # future: support other languages
  include_emoji_names: true  # include emoji name for accessibility

Database Schema

Emoji_Meanings:
  id: UUID
  guild_id: int
  emoji_unicode: str (or custom_emoji_id)
  meaning: str
  taught_by_user_id: int
  taught_at: timestamp
  usage_count: int
  accuracy_rating: float (0-1, from reactions)
  is_sequence: bool
  confidence: float (1.0 for taught, < 1.0 for inferred)

Guild_Settings:
  guild_id: int
  translation_mode: str
  auto_channels: json
  ...

Emoji_Statistics:
  emoji_id: UUID
  guild_id: int
  total_uses: int
  positive_reactions: int
  negative_reactions: int
  conflicts_count: int
  last_updated: timestamp

Database Choice Recommendation

  • Development/Small Scale (< 100 servers): SQLite with Keyv abstraction
  • Production (100+ servers): PostgreSQL with connection pooling
  • Real-Time Stats (future): Redis for caching popular emoji definitions

Configuration Commands

/settings translation-mode [auto|on-demand]
/settings auto-channels [#channel-list]
/settings verbose [true|false]
/settings allow-teaching [true|false]

Implementation Roadmap

MVP (Phase 1): Foundation

  • Features: Message detection, basic emoji dictionary, /teach command, on-demand translation
  • Complexity: Medium
  • Timeline: 2-3 weeks
  • Core: Rule-based translations only; no ML

V1 (Phase 2): Polished & Learning-Ready

  • Add: Per-server settings, /what query, /correct command, reaction-based feedback
  • Add: Accessibility improvements (emoji names, plain text)
  • Add: Basic statistics (/emoji-stats)
  • Timeline: 3-4 weeks
  • Focus: Community testing and meaning refinement

V2 (Phase 3): Smart Fallback (Optional)

  • Add: Statistical fallback for unknown emoji sequences
  • Add: Confidence scores for inferred meanings
  • Add: Emoji conflict detection and disambiguation
  • Add: Optional global dictionary sharing
  • Timeline: 4-6 weeks
  • Focus: Scalability and reduced manual maintenance

Summary

The Vivi Speech Translator should be built as a hybrid system:

  1. Start with deterministic, rule-based translation that's fully transparent and debuggable
  2. Enable community learning via simple /teach commands that grow the dictionary organically
  3. Provide feedback mechanisms (reactions, corrections) to improve accuracy over time
  4. Remain focused on Vivi's specific emoji system, not generic emoji translation
  5. Prioritize accessibility since translation itself is an accessibility feature
  6. Leave room for future ML enhancement but don't build it until needed

The core differentiator is not sophisticated AI, but intentional design for learning and community participation. The bot becomes more valuable as more people teach it, creating network effects that benefit the whole community.

  • Message detection and emoji parsing
  • Rule-based translation with /teach command
  • Per-server configuration (auto vs on-demand mode)
  • Correction and query commands (/what, /correct)
  • Reaction-based feedback (/)
  • Accessible output (plain text, emoji names)
  • PluralKit integration (if Vivi's community uses it)
  • ⏭️ Statistics dashboard (Phase 2)
  • ⏭️ ML fallback (Phase 3+, if needed)

Sources & References

Discord Bot Development

Slash Commands & Command Interfaces

Emoji & NLP Processing

Translation Approaches

Accessibility

PluralKit Integration

Database & Configuration

Feedback Systems