chore: initial commit

2025-11-15 23:42:12 +00:00
commit 3c456abadc
47 changed files with 14645 additions and 0 deletions
--- a/docs/rfcs/0001-crdt-gossip-sync.md
+++ b/docs/rfcs/0001-crdt-gossip-sync.md
--- a/docs/rfcs/0002-persistence-strategy.md
+++ b/docs/rfcs/0002-persistence-strategy.md
@@ -0,0 +1,566 @@
+# RFC 0002: Persistence Strategy for Battery-Efficient State Management
+
+**Status:** Draft
+**Authors:** Sienna
+**Created:** 2025-11-15
+**Related:** RFC 0001 (CRDT Sync Protocol)
+
+## Abstract
+
+This RFC defines a persistence strategy that balances data durability with battery efficiency for mobile platforms (iPad). The core challenge: Bevy runs at 60fps and generates continuous state changes, but we can't write to SQLite on every frame without destroying battery life and flash storage.
+
+## The Problem
+
+**Naive approach (bad)**:
+```rust
+fn sync_to_db_system(query: Query<&NetworkedEntity, Changed<Transform>>) {
+    for entity in query.iter() {
+        db.execute("UPDATE components SET data = ? WHERE entity_id = ?", ...)?;
+        // This runs 60 times per second!
+        // iPad battery: 💀
+    }
+}
+```
+
+**Why this is terrible**:
+- SQLite writes trigger `fsync()` syscalls (flush to physical storage)
+- Each `fsync()` on iOS can take 5-20ms and drains battery significantly
+- At 60fps with multiple entities, we'd be doing hundreds of disk writes per second
+- Flash wear: mobile devices have limited write cycles
+- User moves object around → hundreds of unnecessary writes of intermediate positions
+
+## Requirements
+
+1. **Survive crashes**: If the app crashes, user shouldn't lose more than a few seconds of work
+2. **Battery efficient**: Minimize disk I/O, especially `fsync()` calls
+3. **Flash-friendly**: Reduce write amplification on mobile storage
+4. **Low latency**: Persistence shouldn't block rendering or input
+5. **Recoverable**: On startup, we should be able to reconstruct recent state
+
+## Categorizing Data by Persistence Needs
+
+Not all data is equal. We need to categorize by how critical immediate persistence is:
+
+### Tier 1: Critical State (Persist Immediately)
+
+**What**: State that's hard or impossible to reconstruct if lost
+- User-created entities (the fact that they exist)
+- Operation log entries (for CRDT sync)
+- Vector clock state (for causality tracking)
+- Document metadata (name, creation time, etc.)
+
+**Why**: These are the "source of truth" - if we lose them, data is gone
+
+**Strategy**: Write to database within ~1 second of creation, but still batched
+
+### Tier 2: Derived State (Defer and Batch)
+
+**What**: State that can be reconstructed or is constantly changing
+- Entity positions during drag operations
+- Transform components (position, rotation, scale)
+- UI state (selected items, viewport position)
+- Temporary drawing strokes in progress
+
+**Why**: These change rapidly and the intermediate states aren't valuable
+
+**Strategy**: Batch writes, flush every 5-10 seconds or on specific events
+
+### Tier 3: Ephemeral State (Never Persist)
+
+**What**: State that only matters during current session
+- Remote peer cursors
+- Presence indicators (who's online)
+- Network connection status
+- Frame-rate metrics
+
+**Why**: These are meaningless after restart
+
+**Strategy**: Keep in-memory only (Bevy resources, not components)
+
+## Write Strategy: The Three-Buffer System
+
+We use a three-tier approach to minimize disk writes while maintaining durability:
+
+### Layer 1: In-Memory Dirty Tracking (0ms latency)
+
+Bevy change detection marks components as dirty, but we don't write immediately. Instead, we maintain a dirty set:
+
+```rust
+#[derive(Resource)]
+struct DirtyEntities {
+    // Entities with changes not yet in write buffer
+    entities: HashSet<Uuid>,
+    components: HashMap<Uuid, HashSet<String>>,  // entity → dirty component types
+    last_modified: HashMap<Uuid, Instant>,       // when was it last changed
+}
+```
+
+**Update frequency**: Every frame (cheap - just memory operations)
+
+### Layer 2: Write Buffer (100ms-1s batching)
+
+Periodically (every 100ms-1s), we collect dirty entities and prepare a write batch:
+
+```rust
+#[derive(Resource)]
+struct WriteBuffer {
+    // Pending writes not yet committed to SQLite
+    pending_operations: Vec<PersistenceOp>,
+    last_flush: Instant,
+}
+
+enum PersistenceOp {
+    UpsertEntity { id: Uuid, data: EntityData },
+    UpsertComponent { entity_id: Uuid, component_type: String, data: Vec<u8> },
+    LogOperation { node_id: NodeId, seq: u64, op: Vec<u8> },
+    UpdateVectorClock { node_id: NodeId, counter: u64 },
+}
+```
+
+**Update frequency**: Every 100ms-1s (configurable based on battery level)
+
+**Strategy**: Accumulate operations in memory, then batch-write them
+
+### Layer 3: SQLite with WAL Mode (5-10s commit interval)
+
+Write buffer is flushed to SQLite, but we don't call `fsync()` immediately. Instead, we use WAL mode and control checkpoint timing:
+
+```sql
+-- Enable Write-Ahead Logging
+PRAGMA journal_mode = WAL;
+
+-- Don't auto-checkpoint on every transaction
+PRAGMA wal_autocheckpoint = 0;
+
+-- Synchronous = NORMAL (fsync WAL on commit, but not every write)
+PRAGMA synchronous = NORMAL;
+```
+
+**Update frequency**: Manual checkpoints every 5-10 seconds (or on specific events)
+
+## Flush Events: When to Force Persistence
+
+Certain events require immediate persistence (within 1 second):
+
+### 1. Entity Creation
+When user creates a new entity, we need to persist its existence quickly:
+- Add to write buffer immediately
+- Trigger flush within 1 second
+
+### 2. Major User Actions
+Actions that represent "savepoints" in user's mental model:
+- Finishing a drawing stroke (stroke start → immediate, intermediate points → batched, stroke end → flush)
+- Deleting entities
+- Changing document metadata
+- Undo/redo operations
+
+### 3. Application State Transitions
+State changes that might precede app termination:
+- App going to background (iOS `applicationWillResignActive`)
+- Low memory warning
+- User explicitly saving (if we have a save button)
+- Switching documents/workspaces
+
+### 4. Network Events
+Sync protocol events that need persistence:
+- Receiving operation log entries from peers
+- Vector clock updates (every 5 operations or 5 seconds, whichever comes first)
+
+### 5. Periodic Background Flush
+Even if no major events happen:
+- Flush every 10 seconds during active use
+- Flush every 30 seconds when idle (no user input for >1 minute)
+
+## Battery-Adaptive Flushing
+
+Different flush strategies based on battery level:
+
+```rust
+fn get_flush_interval(battery_level: f32, is_charging: bool) -> Duration {
+    if is_charging {
+        Duration::from_secs(5)  // Aggressive - power available
+    } else if battery_level > 0.5 {
+        Duration::from_secs(10)  // Normal
+    } else if battery_level > 0.2 {
+        Duration::from_secs(30)  // Conservative
+    } else {
+        Duration::from_secs(60)  // Very conservative - low battery
+    }
+}
+```
+
+**On iOS**: Use `UIDevice.current.batteryLevel` and `UIDevice.current.batteryState`
+
+## SQLite Optimizations for Mobile
+
+### Transaction Batching
+
+Group multiple writes into a single transaction:
+
+```rust
+async fn flush_write_buffer(buffer: &WriteBuffer, db: &Connection) -> Result<()> {
+    let tx = db.transaction()?;
+
+    // All writes in one transaction
+    for op in &buffer.pending_operations {
+        match op {
+            PersistenceOp::UpsertEntity { id, data } => {
+                tx.execute("INSERT OR REPLACE INTO entities (...) VALUES (...)", ...)?;
+            }
+            PersistenceOp::UpsertComponent { entity_id, component_type, data } => {
+                tx.execute("INSERT OR REPLACE INTO components (...) VALUES (...)", ...)?;
+            }
+            // ...
+        }
+    }
+
+    tx.commit()?;  // Single fsync for entire batch
+}
+```
+
+**Impact**: 100 individual writes = 100 fsyncs. 1 transaction with 100 writes = 1 fsync.
+
+### WAL Mode Checkpoint Control
+
+```rust
+async fn checkpoint_wal(db: &Connection) -> Result<()> {
+    // Manually checkpoint WAL to database file
+    db.execute("PRAGMA wal_checkpoint(PASSIVE)", [])?;
+}
+```
+
+**PASSIVE checkpoint**: Doesn't block readers, syncs when possible
+**When to checkpoint**: Every 10 seconds, or when WAL exceeds 1MB
+
+### Index Strategy
+
+Be selective about indexes - they increase write cost:
+
+```sql
+-- Only index what we actually query frequently
+CREATE INDEX idx_components_entity ON components(entity_id);
+CREATE INDEX idx_oplog_node_seq ON operation_log(node_id, sequence_number);
+
+-- DON'T index everything just because we can
+-- Every index = extra writes on every INSERT/UPDATE
+```
+
+### Page Size Optimization
+
+```sql
+-- Larger page size = fewer I/O operations for sequential writes
+-- Default is 4KB, but 8KB or 16KB can be better for mobile
+PRAGMA page_size = 8192;
+```
+
+**Caveat**: Must be set before database is created (or VACUUM to rebuild)
+
+## Recovery Strategy
+
+What happens if app crashes before flush?
+
+### What We Lose
+
+**Worst case**: Up to 10 seconds of component updates (positions, transforms)
+
+**What we DON'T lose**:
+- Entity existence (flushed within 1 second of creation)
+- Operation log entries (flushed with vector clock updates)
+- Any data from before the last checkpoint
+
+### Recovery on Startup
+
+```mermaid
+graph TB
+    A[App Starts] --> B[Open SQLite]
+    B --> C{Check WAL file}
+    C -->|WAL exists| D[Recover from WAL]
+    C -->|No WAL| E[Load from main DB]
+    D --> F[Load entities from DB]
+    E --> F
+    F --> G[Load operation log]
+    G --> H[Rebuild vector clock]
+    H --> I[Connect to gossip]
+    I --> J[Request sync from peers]
+    J --> K[Fill any gaps via anti-entropy]
+    K --> L[Fully recovered]
+```
+
+**Key insight**: Even if we lose local state, gossip sync repairs it. Peers send us missing operations.
+
+### Crash Detection
+
+On startup, detect if previous session crashed:
+
+```sql
+CREATE TABLE session_state (
+    key TEXT PRIMARY KEY,
+    value TEXT
+);
+
+-- On startup, check if previous session closed cleanly
+SELECT value FROM session_state WHERE key = 'clean_shutdown';
+
+-- If not found or 'false', we crashed
+-- Trigger recovery procedures
+```
+
+## Platform-Specific Concerns
+
+### iOS / iPadOS
+
+**Background app suspension**: iOS aggressively suspends apps. We have ~5 seconds when moving to background:
+
+```rust
+// When app moves to background:
+fn handle_background_event() {
+    // Force immediate flush
+    flush_write_buffer().await?;
+    checkpoint_wal().await?;
+
+    // Mark clean shutdown
+    db.execute("INSERT OR REPLACE INTO session_state VALUES ('clean_shutdown', 'true')", [])?;
+}
+```
+
+**Low Power Mode**: Detect and reduce flush frequency:
+```swift
+// iOS-specific detection
+if ProcessInfo.processInfo.isLowPowerModeEnabled {
+    set_flush_interval(Duration::from_secs(60));
+}
+```
+
+### Desktop (macOS/Linux/Windows)
+
+More relaxed constraints:
+- Battery life less critical on plugged-in desktops
+- Can use more aggressive flush intervals (every 5 seconds)
+- Larger WAL sizes acceptable (up to 10MB before checkpoint)
+
+## Monitoring & Metrics
+
+Track these metrics to tune persistence:
+
+```rust
+struct PersistenceMetrics {
+    // Write volume
+    total_writes: u64,
+    bytes_written: u64,
+
+    // Timing
+    flush_count: u64,
+    avg_flush_duration: Duration,
+    checkpoint_count: u64,
+    avg_checkpoint_duration: Duration,
+
+    // WAL health
+    wal_size_bytes: u64,
+    max_wal_size_bytes: u64,
+
+    // Recovery
+    crash_recovery_count: u64,
+    clean_shutdown_count: u64,
+}
+```
+
+**Alerts**:
+- Flush duration >50ms (disk might be slow or overloaded)
+- WAL size >5MB (checkpoint more frequently)
+- Crash recovery rate >10% (need more aggressive flushing)
+
+## Write Coalescing: Deduplication
+
+When the same entity is modified multiple times before flush, we only keep the latest:
+
+```rust
+fn add_to_write_buffer(op: PersistenceOp, buffer: &mut WriteBuffer) {
+    match op {
+        PersistenceOp::UpsertComponent { entity_id, component_type, data } => {
+            // Remove any existing pending write for this entity+component
+            buffer.pending_operations.retain(|existing_op| {
+                !matches!(existing_op,
+                    PersistenceOp::UpsertComponent {
+                        entity_id: e_id,
+                        component_type: c_type,
+                        ..
+                    } if e_id == &entity_id && c_type == &component_type
+                )
+            });
+
+            // Add the new one (latest state)
+            buffer.pending_operations.push(op);
+        }
+        // ...
+    }
+}
+```
+
+**Impact**: User drags object for 5 seconds @ 60fps = 300 transform updates → coalesced to 1 write
+
+## Persistence vs Sync: Division of Responsibility
+
+Important distinction:
+
+**Persistence layer** (this RFC):
+- Writes to local SQLite
+- Optimized for durability and battery life
+- Only cares about local state survival
+
+**Sync layer** (RFC 0001):
+- Broadcasts operations via gossip
+- Maintains operation log for anti-entropy
+- Ensures eventual consistency across peers
+
+**Key insight**: These operate independently. An operation can be:
+1. Logged to operation log (for sync) - happens immediately
+2. Applied to ECS (for rendering) - happens immediately
+3. Persisted to SQLite (for durability) - happens on flush schedule
+
+If local state is lost due to delayed flush, sync layer repairs it from peers.
+
+## Configuration Schema
+
+Expose configuration for tuning:
+
+```toml
+[persistence]
+# Base flush interval (may be adjusted by battery level)
+flush_interval_secs = 10
+
+# Max time to defer critical writes (entity creation, etc.)
+critical_flush_delay_ms = 1000
+
+# WAL checkpoint interval
+checkpoint_interval_secs = 30
+
+# Max WAL size before forced checkpoint
+max_wal_size_mb = 5
+
+# Adaptive flushing based on battery
+battery_adaptive = true
+
+# Flush intervals per battery tier
+[persistence.battery_tiers]
+charging = 5
+high = 10      # >50%
+medium = 30    # 20-50%
+low = 60       # <20%
+
+# Platform overrides
+[persistence.ios]
+background_flush_timeout_secs = 5
+low_power_mode_interval_secs = 60
+```
+
+## Example System Implementation
+
+```rust
+fn persistence_system(
+    dirty: Res<DirtyEntities>,
+    mut write_buffer: ResMut<WriteBuffer>,
+    db: Res<DatabaseConnection>,
+    time: Res<Time>,
+    battery: Res<BatteryStatus>,
+    query: Query<(Entity, &NetworkedEntity, &Transform, &/* other components */)>,
+) {
+    // Step 1: Check if it's time to collect dirty entities
+    let flush_interval = get_flush_interval(battery.level, battery.is_charging);
+
+    if time.elapsed() - write_buffer.last_flush < flush_interval {
+        return;  // Not time yet
+    }
+
+    // Step 2: Collect dirty entities into write buffer
+    for entity_uuid in &dirty.entities {
+        if let Some((entity, net_entity, transform, /* ... */)) =
+            query.iter().find(|(_, ne, ..)| ne.network_id == *entity_uuid)
+        {
+            // Serialize component
+            let transform_data = bincode::serialize(transform)?;
+
+            // Add to write buffer (coalescing happens here)
+            write_buffer.add(PersistenceOp::UpsertComponent {
+                entity_id: *entity_uuid,
+                component_type: "Transform".to_string(),
+                data: transform_data,
+            });
+        }
+    }
+
+    // Step 3: Flush write buffer to SQLite (async, non-blocking)
+    if write_buffer.pending_operations.len() > 0 {
+        let ops = std::mem::take(&mut write_buffer.pending_operations);
+
+        // Spawn async task to write to SQLite
+        spawn_blocking(move || {
+            flush_to_sqlite(&ops, &db)
+        });
+
+        write_buffer.last_flush = time.elapsed();
+    }
+
+    // Step 4: Clear dirty tracking (they're now in write buffer/SQLite)
+    dirty.entities.clear();
+}
+```
+
+## Trade-offs and Decisions
+
+### Why WAL Mode?
+
+**Alternatives**:
+- DELETE mode (traditional journaling)
+- MEMORY mode (no durability)
+
+**Decision**: WAL mode because:
+- Better write concurrency (readers don't block writers)
+- Fewer `fsync()` calls (only on checkpoint)
+- Better crash recovery (WAL can be replayed)
+
+### Why Not Use a Dirty Flag on Components?
+
+We could mark components with a `#[derive(Dirty)]` flag, but:
+- Bevy's `Changed<T>` already gives us change detection for free
+- A separate dirty flag adds memory overhead
+- We'd need to manually clear flags after persistence
+
+**Decision**: Use Bevy's change detection + our own dirty tracking resource
+
+### Why Not Use a Separate Persistence Thread?
+
+We could run SQLite writes on a dedicated thread:
+
+**Pros**: Never blocks main thread
+**Cons**: More complex synchronization, harder to guarantee flush order
+
+**Decision**: Use `spawn_blocking` from async runtime (Tokio). Simpler, good enough.
+
+## Open Questions
+
+1. **Write ordering**: Do we need to guarantee operation log entries are persisted before entity state? Or can they be out of order?
+2. **Compression**: Should we compress component data before writing to SQLite? Trade-off: CPU vs I/O
+3. **Memory limits**: On iPad with 2GB RAM, how large can the write buffer grow before we force a flush?
+
+## Success Criteria
+
+We'll know this is working when:
+- [ ] App can run for 30 minutes with <5% battery drain attributed to persistence
+- [ ] Crash recovery loses <10 seconds of work
+- [ ] No perceptible frame drops during flush operations
+- [ ] SQLite file size grows linearly with user data, not explosively
+- [ ] WAL checkpoints complete in <100ms
+
+## Implementation Phases
+
+1. **Phase 1**: Basic in-memory dirty tracking + batched writes
+2. **Phase 2**: WAL mode + manual checkpoint control
+3. **Phase 3**: Battery-adaptive flushing
+4. **Phase 4**: iOS background handling
+5. **Phase 5**: Monitoring and tuning based on metrics
+
+## References
+
+- [SQLite WAL Mode](https://www.sqlite.org/wal.html)
+- [iOS Background Execution](https://developer.apple.com/documentation/uikit/app_and_environment/scenes/preparing_your_ui_to_run_in_the_background)
+- [Bevy Change Detection](https://docs.rs/bevy/latest/bevy/ecs/change_detection/)
--- a/docs/rfcs/README.md
+++ b/docs/rfcs/README.md
@@ -0,0 +1,39 @@
+# RFCs
+
+Request for Comments (RFCs) for major design decisions in the Lonni project.
+
+## Active RFCs
+
+- [RFC 0001: CRDT Synchronization Protocol over iroh-gossip](./0001-crdt-gossip-sync.md) - Draft
+
+## RFC Process
+
+1. **Draft**: Initial proposal, open for discussion
+2. **Review**: Team reviews and provides feedback
+3. **Accepted**: Approved for implementation
+4. **Implemented**: Design has been built
+5. **Superseded**: Replaced by a newer RFC
+
+RFCs are living documents - they can be updated as we learn during implementation.
+
+## When to Write an RFC
+
+Write an RFC when:
+- Making architectural decisions that affect multiple parts of the system
+- Choosing between significantly different approaches
+- Introducing new protocols or APIs
+- Making breaking changes
+
+Don't write an RFC for:
+- Small bug fixes
+- Minor refactors
+- Isolated feature additions
+- Experimental prototypes
+
+## RFC Format
+
+- **Narrative first**: Tell the story of why and how
+- **Explain trade-offs**: What alternatives were considered?
+- **API examples**: Show how it would be used (not full implementations)
+- **Open questions**: What's still unclear?
+- **Success criteria**: How do we know it works?