This RFC proposes a gossip-based CRDT synchronization protocol for building eventually-consistent multiplayer collaborative applications. The protocol enables 2-5 concurrent users to collaborate in real-time on shared entities and resources with support for presence, cursors, selections, and drawing operations.
## Motivation
We want to build a multiplayer collaborative system where:
- Two people can draw/edit together and see each other's changes in real-time
- Changes work offline and sync automatically when reconnected
- There's no "server" - all peers are equal
- State is eventually consistent across all peers
- We can see cursors, presence, selections, etc.
### Why iroh-gossip?
**Message bus > point-to-point**: In a 3-person session, gossip naturally handles the "multicast" pattern better than explicit peer-to-peer connections. Each message gets epidemic-broadcast to all peers efficiently.
**Built-in mesh networking**: iroh-gossip handles peer discovery and maintains the mesh topology automatically. We just subscribe to a topic.
**QUIC underneath**: Low latency, multiplexed, encrypted by default.
### Why CRDTs?
**The core problem**: When two people edit the same thing simultaneously, how do we merge the changes without coordination?
**CRDT solution**: Operations are designed to be commutative (order doesn't matter), associative (grouping doesn't matter), and idempotent (re-applying is safe). This means:
**For**: Collections where items can be added/removed (selections, tags, participants)
**Concept**: Each add gets a unique token. Removes specify which add tokens to remove. This prevents the "add-remove" problem where concurrent operations conflict.
**Example**: If Alice adds "red" and Bob removes "red" concurrently, we need to know if Bob saw Alice's add or not.
```rust
// From crdts library
let mut set: Orswot<String, NodeId> = Default::default();
// Adding
let add_ctx = set.read_ctx().derive_add_ctx("alice");
set.apply(set.add("red".to_string(), add_ctx));
// Removing
let rm_ctx = set.read_ctx().derive_rm_ctx("bob");
set.apply(set.rm("red".to_string(), rm_ctx));
```
### Map CRDT
**For**: Nested key-value data (metadata, properties, nested objects)
**Concept**: Each key can have its own CRDT semantics. The map handles add/remove of keys.
```rust
// From crdts library
let mut map: Map<String, LWWReg<i32, NodeId>, NodeId> = Map::new();
PeerB->>PeerB: Compare clocks<br/>{A:10, B:5} vs {A:8, B:7}
Note over PeerB: PeerA missing: B[6,7]<br/>PeerB missing: A[9,10]
PeerB->>OpLog_B: Query operations<br/>where node_id=B AND seq IN (6,7)
OpLog_B-->>PeerB: Return ops
PeerB->>Gossip: MissingDeltas<br/>[B:6, B:7]
Gossip-->>PeerA: Receive
PeerA->>PeerA: Apply missing operations
PeerA->>OpLog_A: Store in log
Note over PeerA: Now has B:6, B:7
PeerA->>OpLog_A: Query operations<br/>where node_id=A AND seq IN (9,10)
OpLog_A-->>PeerA: Return ops
PeerA->>Gossip: MissingDeltas<br/>[A:9, A:10]
Gossip-->>PeerB: Receive
PeerB->>PeerB: Apply missing operations
Note over PeerA,PeerB: Clocks now match: {A:10, B:7}
```
**Why this works**:
- Vector clocks tell us exactly what's missing
- Operation log preserves operations for catchup
- Anti-entropy repairs any message loss
- Eventually all peers converge
## Operation Log: The Anti-Entropy Engine
The operation log is critical for ensuring eventual consistency. It's an append-only log of all CRDT operations that allows peers to catch up after network partitions or disconnections.
### Structure
```rust
struct OperationLogEntry {
// Primary key: (node_id, sequence_number)
node_id: NodeId,
sequence_number: u64,
// The actual operation
operation: Delta, // EntityDelta or ResourceDelta
// When this operation was created
timestamp: DateTime<Utc>,
// How many bytes (for storage metrics)
size_bytes: u32,
}
```
### Storage Schema (SQLite)
```sql
CREATE TABLE operation_log (
-- Composite primary key ensures uniqueness per node
node_id TEXT NOT NULL,
sequence_number INTEGER NOT NULL,
-- Operation type for filtering
operation_type TEXT NOT NULL, -- 'EntityDelta' | 'ResourceDelta'
-- The serialized Delta (bincode)
operation_blob BLOB NOT NULL,
-- Timestamp for pruning
created_at INTEGER NOT NULL,
-- Size tracking
size_bytes INTEGER NOT NULL,
PRIMARY KEY (node_id, sequence_number)
);
-- Index for time-based queries (pruning old operations)
CREATE INDEX idx_operation_log_timestamp
ON operation_log(created_at);
-- Index for node lookups (common query pattern)
CREATE INDEX idx_operation_log_node
ON operation_log(node_id, sequence_number);
```
### Write Path
Every time we generate an operation:
```mermaid
graph LR
A[Generate Op] --> B[Increment Vector Clock]
B --> C[Serialize Operation]
C --> D[Write to OpLog Table]
D --> E[Broadcast via Gossip]
style D fill:#bfb,stroke:#333,stroke-width:2px
```
**Atomicity**: The vector clock increment and operation log write should be atomic. If we fail after incrementing the clock but before logging, we'll have a gap in the sequence.
**Solution**: Use SQLite transaction:
```sql
BEGIN TRANSACTION;
UPDATE vector_clock SET counter = counter + 1 WHERE node_id = ?;
INSERT INTO operation_log (...) VALUES (...);
COMMIT;
```
### Query Path (Anti-Entropy)
When we receive a SyncRequest with a vector clock that's behind ours:
```mermaid
graph TB
A[Receive SyncRequest] --> B{Compare Clocks}
B -->|Behind| C[Calculate Missing Ranges]
C --> D[Query OpLog]
D --> E[Serialize MissingDeltas]
E --> F[Broadcast to Peer]
style D fill:#bfb,stroke:#333,stroke-width:2px
```
**Query example**:
```sql
-- Peer is missing operations from node 'alice' between 5-10
SELECT operation_blob
FROM operation_log
WHERE node_id = 'alice'
AND sequence_number BETWEEN 5 AND 10
ORDER BY sequence_number ASC;
```
**Batch limits**: If the gap is too large (>1000 operations?), might be faster to send FullState instead.
### Pruning Strategy
We can't keep operations forever. Three pruning strategies:
#### 1. Time-based Pruning (Simple)
Delete operations older than 24 hours:
```sql
DELETE FROM operation_log
WHERE created_at < (strftime('%s', 'now') - 86400) * 1000;
```
**Pros**: Simple, predictable storage
**Cons**: Peers offline for >24 hours need full resync
#### 2. Vector Clock-based Pruning (Smart)
Only delete operations that **all known peers have seen**:
```mermaid
graph TB
A[Collect All Peer Clocks] --> B[Find Minimum Seq Per Node]
B --> C[Delete Below Minimum]
style C fill:#fbb,stroke:#333,stroke-width:2px
```
**Example**:
- Alice's clock: `{alice: 100, bob: 50}`
- Bob's clock: `{alice: 80, bob: 60}`
- Minimum: `{alice: 80, bob: 50}`
- Can delete: Alice's ops 1-79, Bob's ops 1-49
**Pros**: Maximally efficient storage
**Cons**: Complex, requires tracking peer clocks
#### 3. Hybrid (Recommended)
- Delete ops older than 24 hours AND seen by all peers
- Keep at least last 1000 operations per node (safety buffer)
```sql
WITH peer_min AS (
SELECT node_id, MIN(counter) as min_seq
FROM vector_clock
GROUP BY node_id
)
DELETE FROM operation_log
WHERE (node_id, sequence_number) IN (
SELECT ol.node_id, ol.sequence_number
FROM operation_log ol
JOIN peer_min pm ON ol.node_id = pm.node_id
WHERE ol.sequence_number < pm.min_seq
AND ol.created_at < (strftime('%s', 'now') - 86400) * 1000
AND ol.sequence_number < (
SELECT MAX(sequence_number) - 1000
FROM operation_log ol2
WHERE ol2.node_id = ol.node_id
)
);
```
### Operation Log Cleanup Methodology
The cleanup methodology defines **when**, **how**, and **what** to delete while maintaining correctness.
#### Triggering Conditions
Cleanup triggers (whichever comes first):
- **Time-based**: Every 1 hour
- **Storage threshold**: When log exceeds 50 MB
- **Operation count**: When exceeding 100,000 ops
- **Manual trigger**: Via admin API
**Why multiple triggers?**: Different workloads have different patterns. Active sessions hit operation count first, idle sessions hit time first.
**Decision**: LWW is simple and works for most use cases. Timestamp conflicts are rare. For critical data (where "last write wins" isn't good enough), use proper CRDTs.
### Why Operation Log?
**Alternative**: State-based sync only (send full state periodically)
**Decision**: Hybrid - operations for real-time (small), full state for joins (comprehensive), operation log for anti-entropy (reliable).
## Security and Identity
Even for a private, controlled deployment, the protocol must define security boundaries to prevent accidental data corruption and ensure system integrity.
### Authentication: Cryptographic Node Identity
**Problem**: As written, NodeId could be any string. A buggy client or malicious peer could forge messages with another node's ID, corrupting vector clocks and CRDT state.
**Solution**: Use cryptographic identities built into iroh.
```rust
use iroh::NodeId; // iroh's NodeId is already a cryptographic public key
// Each peer has a keypair
// NodeId is derived from the public key
// All gossip messages are implicitly authenticated by QUIC's transport layer
```
**Key properties**:
- NodeId = public key (or hash thereof)
- iroh-gossip over QUIC already provides message authenticity (peer signatures)
- No additional signing needed - the transport guarantees messages come from the claimed NodeId
**Implementation**: Just use iroh's `NodeId` type consistently instead of `String`.
### Authorization: Session Membership
**Problem**: How do we ensure only authorized peers can join a session and access the data?
**Solution**: Pre-shared topic secret (simplest for 2-5 users).
```rust
// Generate a session secret when creating a new document/workspace
use iroh_gossip::proto::TopicId;
use blake3;
let session_secret = generate_random_bytes(32); // Share out-of-band
let topic = TopicId::from_bytes(blake3::hash(&session_secret).as_bytes());
// To join, a peer must know the session_secret
// They derive the same TopicId and subscribe
```
**Access control**:
- Secret is shared via invite link, QR code, or manual entry
- Without the secret, you can't derive the TopicId, so you can't join the gossip
- For better UX, encrypt the secret into a shareable URL: `lonni://join/<base64-secret>`
**Revocation**: If someone leaves the session and shouldn't have access:
- Generate a new secret and topic
- Broadcast a "migration" message on the old topic
- All peers move to the new topic
- The revoked peer is left on the old topic (no data flow)
### Encryption: Application-Level Data Protection
**Question**: Is the gossip topic itself private, or could passive observers see it?
**Answer**: iroh-gossip over QUIC is transport-encrypted, but:
- Any peer in the gossip network can see message content
- If you don't trust all peers (or future peers after revocation), add application-level encryption
**Proposal** (for future consideration):
```rust
// Encrypt FullState and EntityDelta messages with session key derived from session_secret
let encrypted_payload = encrypt_aead(
&session_secret,
&bincode::serialize(&sync_msg)?
)?;
```
**For v1**: Skip application-level encryption. Transport encryption + topic secret is sufficient for a private, trusted peer group.
### Resource Limits: Preventing Abuse
**Problem**: What if a buggy client creates 100,000 entities or floods with operations?
let node_entity_count = self.entities_by_owner(&delta.source_node).count();
if node_entity_count >= self.limits.max_entities_per_node {
return Err(RateLimitError::TooManyEntities);
}
// Check operation rate (requires tracking ops/sec per node)
if self.operation_rate_exceeds_limit(&delta.source_node) {
return Err(RateLimitError::TooManyOperations);
}
// Apply the operation...
}
```
**Handling violations**:
- Log a warning
- Ignore the operation (don't apply)
- Optionally: display a warning to the user ("Peer X is sending too many operations")
**Why this works**: CRDTs allow us to safely ignore operations. Other peers will still converge correctly.
## Entity Deletion and Garbage Collection
**Problem**: Without deletion, the dataset grows indefinitely. Users can't remove unwanted entities.
**Solution**: Tombstones (the standard CRDT deletion pattern).
### Deletion Protocol
**Strategy**: Despawn from ECS, keep tombstone in database only.
When an entity is deleted:
1.**Despawn from Bevy's ECS** - The entity is removed from the world entirely using `despawn_recursive()`
2.**Store tombstone in SQLite** - Record the deletion (entity ID, timestamp, node ID) in a `deleted_entities` table
3.**Remove from network map** - Clean up the UUID → Entity mapping
4.**Broadcast deletion operation** - Send a "Deleted" component operation via gossip so other peers delete it too
**Why this works:**
- **Zero footguns**: Deleted entities don't exist in ECS, so queries can't accidentally find them. No need to remember `Without<Deleted>` on every query.
- **Better performance**: ECS doesn't waste time iterating over tombstones during queries.
- **Tombstones live where they belong**: In the persistent database, not cluttering the real-time ECS.
**Receiving deletions from peers:**
When we receive a deletion operation:
1. Check if the entity exists in our ECS (via the network map)
2. If yes: despawn it, record tombstone, remove from map
3. If no: just record the tombstone (peer is catching us up on something we never spawned)
**Receiving operations for deleted entities:**
When we receive an operation for an entity UUID that's not in our ECS map:
1. Check the database for a tombstone
2. If tombstone exists and operation timestamp < deletion timestamp: **Ignore** (stale operation)
3. If tombstone exists and operation timestamp > deletion timestamp: **Resurrection conflict** - log warning, potentially notify user
4. If no tombstone: Spawn the entity normally (we're just learning about it for the first time)
### Garbage Collection: Operation Log Only
**Critical rule**: Never delete entity rows from the database. Only garbage collect their operation log entries.
Tombstones (the `Deleted` component and the entity row) must be kept **forever**. They're tiny (just metadata) and prevent the resurrection problem.
**What gets garbage collected**:
- Operation log entries for deleted entities (once all peers have seen the deletion)
- Blob data referenced by deleted entities (see Blob Garbage Collection section)
**What never gets deleted**:
- Entity rows (even with `Deleted` component)
- The `Deleted` component itself
**Operation log cleanup for deleted entities**:
```sql
-- During operation log cleanup, we CAN delete operations for deleted entities
-- once all peers have acknowledged the deletion
WITH peer_min AS (
SELECT node_id, MIN(counter) as min_seq
FROM vector_clock
GROUP BY node_id
),
deleted_entities AS (
SELECT e.network_id, e.owner_node_id, c.updated_at as deleted_at
FROM entities e
JOIN components c ON e.network_id = c.entity_id
WHERE c.component_type = 'Deleted'
AND c.updated_at < (strftime('%s', 'now') - 86400) * 1000 -- Deleted >24h ago
)
DELETE FROM operation_log
WHERE (node_id, sequence_number) IN (
SELECT ol.node_id, ol.sequence_number
FROM operation_log ol
JOIN deleted_entities de ON ol.entity_id = de.network_id
JOIN peer_min pm ON ol.node_id = pm.node_id
WHERE ol.sequence_number < pm.min_seq -- All peers have seen this operation
);
```
**Why keep entity rows forever?**:
- Prevents resurrection conflicts (see next section)
- Tombstone size is negligible (~100 bytes per entity)
- Allows us to detect late operations and handle them correctly
### Edge Case: Resurrection
What if a peer deletes an entity, but a partitioned peer still has the old version?
```mermaid
sequenceDiagram
participant Alice
participant Bob
Note over Alice,Bob: Both have entity E
Alice->>Alice: Delete E (add Deleted component)
Note over Bob: Network partition (offline)
Alice->>Alice: Garbage collect E
Note over Alice: E is gone from database
Bob->>Bob: Edit E (change position)
Note over Bob: Reconnects
Bob->>Alice: EntityDelta for E
Alice->>Alice: Entity E doesn't exist!<br/>Recreate from delta? Or ignore?
```
**Solution**: Never garbage collect entities. Only garbage collect their operation log entries. Keep the entity row and the `Deleted` component forever (they're tiny - just metadata).
This way, if a late operation arrives for a deleted entity, we can detect the conflict:
let image_data = blobs.read_to_bytes(hash).await?;
// ... render in Bevy
}
```
### Blob Discovery
iroh-blobs has built-in peer discovery and transfer. As long as peers are connected (via gossip or direct connections), they can find and fetch blobs from each other.
**Optimization**: When broadcasting the metadata, include a "providers" list:
```rust
struct ImageComponent {
blob_hash: String,
providers: Vec<NodeId>, // Which peers currently have this blob
width: u32,
height: u32,
}
```
This helps peers find the blob faster.
### Threshold
Define a threshold for what goes through blobs vs gossip:
**Problem**: When an entity with a blob reference is deleted, the blob becomes orphaned. Without GC, blobs accumulate indefinitely.
**Solution**: Mark-and-sweep garbage collection using vector clock synchronization.
#### How It Works
**Mark Phase** - Scan database for all blob hashes currently referenced by non-deleted entities. Build a `HashSet<BlobHash>` of active blobs.
**Sweep Phase** - List all blobs in iroh-blobs. For each blob not in the active set:
- Check if all peers have seen the deletion (using vector clock + blob_references table)
- Apply 24-hour safety window
- If safe, delete the blob and record bytes freed
**Safety coordination**: The critical challenge is knowing when all peers have seen a blob deletion. We solve this by:
1. Recording blob add/remove events in a `blob_references` table with (node_id, sequence_number)
2. Using vector clock logic: only delete if all peers' sequence numbers exceed the removal operation's sequence
3. Adding a 24-hour time window for additional safety
#### Database Schema
Track blob lifecycle events:
```sql
CREATE TABLE blob_references (
id INTEGER PRIMARY KEY AUTOINCREMENT,
blob_hash TEXT NOT NULL,
entity_id TEXT NOT NULL,
event_type TEXT NOT NULL, -- 'added' | 'removed'
node_id TEXT NOT NULL,
sequence_number INTEGER NOT NULL,
timestamp INTEGER NOT NULL
);
```
**Optimization**: Maintain reference counts for efficiency:
```sql
CREATE TABLE blob_refcount (
blob_hash TEXT PRIMARY KEY,
ref_count INTEGER NOT NULL DEFAULT 0,
last_accessed INTEGER NOT NULL,
total_size_bytes INTEGER NOT NULL
);
```
This avoids full component scans - GC only checks blobs where `ref_count = 0`.
#### Scheduling
Run every 4 hours (less frequent than operation log cleanup, since it's more expensive). Process max 100 blobs per run to avoid blocking.
#### Key Edge Cases
**Offline peer**: Bob is offline. Alice adds and deletes a 50MB blob. GC runs and deletes it. Bob reconnects and gets FullState - no blob reference, no problem.
**Concurrent add/delete**: Alice adds entity E with blob B. Bob deletes E. CRDT resolves to deleted. Both peers have orphaned blob B, which gets cleaned up after the 24-hour safety period.
**Why it works**: Deletion tombstones win (LWW timestamp), and vector clock coordination ensures both peers converge and safely GC the blob.
## Schema Evolution and Data Migration
**Problem**: The version envelope only handles message format changes, not data semantics changes.