How we built a real-time collaboration platform for 5K+ workspaces and 25K users — with operational transform, presence, permissions, and offline sync achieving <100ms sync latency.
OT · Presence · Offline Sync
A productivity startup wanted to compete with Notion and Google Docs — documents and chat in one place, with real-time collaboration. The MVP used polling and had 2–3 second lag when multiple users typed. Offline support was non-existent.
They wanted sub-100ms sync for cursor presence and document edits, granular permissions (view vs edit vs comment), and offline sync so users could work without connectivity and merge changes when back online — without conflicts or data loss.
When two users edit the same paragraph simultaneously, changes had to merge correctly. Plain last-write-wins caused overwrites. OT or CRDT was required for conflict-free merge.
Users needed to see who else was viewing a document and where their cursors were. High-frequency updates (cursor position) couldn't flood the server. Throttling and batching were critical.
Workspaces had folders and documents with inherited permissions. View, edit, comment, and admin levels. Sharing links with expiry. Permissions had to be checked on every operation without slowing sync.
Users on trains or unreliable networks needed to keep working. Local edits had to be queued and merged when reconnected. Divergent edits required conflict resolution — CRDT could handle this automatically.
We built a real-time collaboration platform with WebSocket connections, CRDT-based document sync, and Redis for presence broadcast. Documents use a CRDT (Yjs) for conflict-free merging. Presence updates are throttled and broadcast via Redis pub/sub. Permissions are cached and checked at sync layer. Offline queue replays on reconnect.
CRDT (Yjs) was the right choice for document sync. Unlike OT, it doesn't require a central server to transform operations — each client can merge independently. That enables offline editing: local changes are stored, and when the client reconnects, the merged state is computed. We persist document state to PostgreSQL on a debounced schedule and on room disconnect.
Evaluated OT vs CRDT. Chose Yjs for CRDT-based sync. Designed permission model and workspace hierarchy. Defined WebSocket room and message schemas.
Built WebSocket server with room routing. Integrated Yjs for document sync. Implemented presence with throttling. Built workspace, doc, and permission APIs.
Implemented client-side offline queue and merge on reconnect. Built chat with Redis pub/sub. Added permission checks at sync layer. Persisted document state to PostgreSQL.
Optimized for <100ms sync. Load-tested with 500 concurrent doc editors. Phased rollout — beta users first, then general availability. Monitored latency and conflict rates.
CRDT (Yjs) vs OT is a fundamental choice. OT requires a central authority to transform operations — good for simple cases but complex for offline. CRDT allows merge without server — any two replicas can converge. The tradeoff is CRDT state size grows with history; we use Yjs's garbage collection to bound it. For documents under 1MB, it works well.
Presence is high-frequency and must be throttled. We send cursor updates at most every 100ms per user. Presence is ephemeral — stored in Redis with 30s TTL. When a user disconnects, we don't persist; they simply disappear from the presence list. Redis pub/sub broadcasts to all clients in the document room.
Permissions are checked at the WebSocket layer before accepting document updates. We cache permission results in Redis (short TTL) to avoid hitting PostgreSQL on every keystroke. Folder inheritance is computed on permission grant/revoke and stored — we don't walk the tree on each sync. When permissions change, we invalidate the cache and disconnect affected clients so they re-auth.
We help SaaS companies build production-grade real-time and offline-first applications. Let's talk about your architecture.