Flow - Architecture Improvements & Technical Debt
Document Version: 1.0.0 Last Updated: March 10, 2026 Status: Active Development Owner: Tech Lead
Documento pre-migrazione Supabase-only
Il technical-debt register qui elencato copre i microservizi legacy. La migrazione Supabase-only del 2026-03-29 ha risolto (o reso irrilevanti) la maggior parte degli item segnati come critici: il debt su gateway, service mesh, broker di eventi, e consistency Mongo↔Postgres non si applica all’architettura attuale.
Il debt sopravvissuto (es. test coverage, observability) va ri-triaged nel contesto Supabase. Architettura corrente: Architecture Overview.
Executive Summary
This document consolidates all identified architectural improvements, technical debt items, and infrastructure enhancements for the Flow project. It serves as the central tracking document for technical improvements across the platform.
Key Areas of Focus:
- Inter-service communication and event bus implementation
- Database strategy and data consistency
- Real-time service scalability
- AI/ML data pipeline architecture
- API Gateway evolution
- Observability and resilience
- Performance and caching strategies
- Security and secrets management
- Testing and quality assurance
- Schema governance and migrations
Table of Contents
- High Priority Issues
- Medium Priority Issues
- Low Priority Issues
- 90-Day Roadmap
- Risk Assessment
- Detailed Issue Breakdown
High Priority Issues
Issue #1: Event Bus Implementation
Priority: 🔴 Critical Impact: High (prevents failure cascades, improves resilience) Effort: Medium (2-3 weeks) Timeline: Q2 2026 (Month 1-2)
Problem: Synchronous HTTP calls between microservices (user ↔ event ↔ social) create:
- High latency
- Tight coupling
- Failure cascades
- “Distributed monolith” anti-pattern
Proposed Solution:
-
Phase 1 (Quick Win): Implement Redis Streams (already available)
- Define event contracts:
EventCreated,EventUpdated,EventCancelled,UserProfileUpdated,NotificationRequested - Update services to publish/consume events
- Implement Dead Letter Queue (DLQ) and retry logic
- Add metrics for lag and consumer health
- Define event contracts:
-
Phase 2 (Production): Evaluate and migrate to Kafka
- Better throughput for high volumes
- Event replay capabilities
- Mature ecosystem and tooling
Benefits:
- ✅ Backpressure handled by message broker
- ✅ Independent consumer services
- ✅ Natural audit log of events
- ✅ Integrated retry and DLQ mechanisms
- ✅ Improved system resilience
Implementation Steps:
- Define event schema and contracts
- Set up Redis Streams infrastructure
- Implement event publishers in each service
- Implement event consumers with error handling
- Add monitoring and metrics
- Load test event throughput
- Document event-driven architecture
Reference: 1-event-bus.md
Issue #2: Redis Adapter for Socket.IO
Priority: 🔴 Critical Impact: High (required for horizontal scaling) Effort: Low (1 week) Timeline: Q2 2026 (Month 1)
Problem: Without Redis adapter, Socket.IO cannot scale horizontally. Real-time service limited to single instance, creating:
- Single point of failure
- Limited concurrent user capacity
- No high availability
Proposed Solution:
- Enable
socket.io-redis-adapterin production - Configure sticky sessions on load balancer
- Test horizontal scaling with multiple realtime service instances
- Implement rate limiting for WebSocket connections
- Separate namespaces (
chat,presence,typing)
Implementation Steps:
- Install and configure socket.io-redis-adapter
- Update realtime service configuration
- Configure load balancer with sticky sessions
- Test with multiple service instances
- Implement per-connection rate limiting
- Add WebSocket-specific monitoring
- Document scaling architecture
Reference: 2-redis-socketio.md
Issue #12: Performance and Indexing
Priority: 🔴 Critical Impact: High (query performance) Effort: Low (1-2 weeks) Timeline: Q2 2026 (Month 2-3)
Problem: Missing MongoDB indexes cause slow queries, especially for:
- Event searches by location
- Event filtering by category and date
- User lookups by various fields
- Social graph queries
Proposed Solution:
- Enable MongoDB profiling in staging environment
- Create indexes for common query patterns:
events:slug,organizer.id,datetime.start,location.coordinates(geospatial),categoryusers:email,username,location.coordinatessocial_connections:userId,status,createdAt
- Add compound indexes for complex queries
- Benchmark performance improvements
- Document indexing strategy
Implementation Steps:
- Enable MongoDB profiling
- Analyze slow queries
- Create single-field indexes
- Create compound indexes
- Test query performance improvements
- Document index strategy
- Add index creation to migration scripts
Reference: 12-performance-indexing.md
Issue #6: Observability with OpenTelemetry
Priority: 🔴 High Impact: High (debugging, performance monitoring) Effort: Medium (2-3 weeks) Timeline: Q2 2026 (Month 2-3)
Problem: Limited visibility into distributed system behavior:
- Difficult to trace requests across services
- No centralized metrics collection
- Hard to debug performance issues
- Lack of distributed tracing
Proposed Solution:
- Implement OpenTelemetry SDK in all Node.js services
- Propagate
trace-idacross service boundaries (in addition tocorrelation-id) - Export traces to Jaeger or Google Cloud Trace
- Export metrics to Prometheus
- Create Grafana dashboards for:
- Request latency (p50, p95, p99)
- Error rates
- Service health
- Database performance
- Cache hit rates
Implementation Steps:
- Install OpenTelemetry SDKs
- Configure trace propagation
- Set up trace collection (Jaeger/Cloud Trace)
- Set up metrics collection (Prometheus)
- Create Grafana dashboards
- Add instrumentation to critical paths
- Document observability architecture
Reference: 6-observability-otel.md
Issue #3: Port Alignment
Priority: 🟡 Medium Impact: Low (consistency) Effort: Low (< 1 day) Timeline: Q2 2026 (Month 1)
Problem: Inconsistent port configuration across services and documentation.
Proposed Solution:
- Standardize notification service to port 3004
- Update all environment files
- Update docker-compose.yml
- Update documentation
- Update Kubernetes manifests
Standard Port Mapping:
- API Gateway: 3000
- User Service: 3001
- Event Service: 3002
- Social Service: 3003
- Notification Service: 3004
- Realtime Service: 3005
- Recommendation Engine (AI): 8001
- Matchmaking Service (AI): 8002
Reference: 3-notification-port-align.md
Medium Priority Issues
Issue #4: CDC Pipeline (Mongo to Postgres)
Priority: 🟡 Medium Impact: Medium (data consistency) Effort: High (3-4 weeks) Timeline: Q3 2026
Problem: Data split between MongoDB (operational) and PostgreSQL (admin/analytics) creates:
- Data inconsistency risks
- Duplicate data models
- Slow cross-database queries
- Complex synchronization logic
Proposed Solution:
- Implement Change Data Capture (CDC) using Debezium or similar
- Stream MongoDB changes to PostgreSQL
- Create read-only analytics views in Postgres
- Remove duplicate SQL models from user-service
- Define clear operational vs analytical data boundaries
Data Plane Strategy:
- Operational Store (MongoDB): Real-time OLTP workloads
- Analytical Store (PostgreSQL): Dashboard queries, reporting, admin portal
- CDC Pipeline: Near real-time synchronization
Reference: 4-cdc-mongo-to-postgres.md
Issue #5: AI Read Replica
Priority: 🟡 Medium Impact: High (prevents AI from impacting user performance) Effort: Medium (2 weeks) Timeline: Q3 2026 (before AI features launch)
Problem: AI services reading from operational MongoDB can degrade user-facing performance:
- Recommendation engine queries compete with user requests
- ML feature extraction is resource-intensive
- No isolation between operational and ML workloads
Proposed Solution:
- Create dedicated MongoDB read replica for AI workloads
- Implement data pipeline (batch or CDC) to ML feature store
- Version datasets for reproducibility
- Schedule nightly feature materialization jobs
- Export to Parquet on object storage for ML training
ML Data Pipeline:
MongoDB (Primary) → Read Replica → Feature Engineering → Feature Store → ML Models
↓
Parquet Export → Object Storage → Training Pipeline
Reference: 5-ai-read-replica.md
Issue #7: Gateway Caching
Priority: 🟡 Medium Impact: Medium (performance) Effort: Medium (2 weeks) Timeline: Q3 2026
Problem: Repeated requests for the same data increase backend load and latency.
Proposed Solution:
- Implement response caching in API gateway for GET requests
- Cache popular event details, category lists, venue information
- Invalidate cache via event bus events
- Add cache hit/miss metrics
- Configure appropriate TTLs
Caching Strategy:
- High TTL (1 hour): Categories, venues, static content
- Medium TTL (15 min): Event details, user profiles
- Low TTL (5 min): Event lists, search results
- No cache: User-specific data, real-time updates
Reference: 7-gateway-caching.md
Issue #8: Secrets Management
Priority: 🔴 High (Security) Impact: Critical (security) Effort: Low (1 week) Timeline: Q3 2026 (before production launch)
Problem:
Secrets stored in .env files are:
- Vulnerable to accidental commits
- Hard to rotate
- Not audited
- Shared across environments
Proposed Solution:
- Migrate secrets to GCP Secret Manager (or AWS Secrets Manager/HashiCorp Vault)
- Implement secret rotation for:
- JWT signing keys
- API keys (SendGrid, Firebase, etc.)
- Database credentials
- Update CI/CD to fetch secrets dynamically
- Remove
.envfiles from deployment
Implementation:
- Set up GCP Secret Manager
- Migrate all secrets
- Update service configurations
- Implement rotation policies
- Update CI/CD pipelines
- Document secrets management process
- Remove .env from production deployments
Reference: 8-secrets-manager.md
Low Priority Issues
Issue #9: Contract Testing and E2E
Priority: 🟢 Low Impact: Medium (quality) Effort: High (3-4 weeks) Timeline: Q4 2026
Problem: No API contract validation or comprehensive E2E tests, leading to:
- Integration breakages
- API version incompatibilities
- Regression bugs
Proposed Solution:
- Implement OpenAPI specs for all services
- Add API schema validation in gateway
- Pact contract tests between services
- Playwright E2E tests for critical user flows
- Load testing with k6
Test Coverage:
- Contract Tests: All inter-service API calls
- E2E Tests: Signup, event creation, RSVP, chat, friend connections
- Load Tests: Auth, event list, WebSocket fan-out
Reference: 9-contract-e2e-testing.md
Issue #10: Schema Governance
Priority: 🟢 Low Impact: Medium (maintainability) Effort: Low (1 week) Timeline: Q4 2026
Problem: No formal schema versioning or migration strategy:
- Ad-hoc schema changes
- No migration tests
- Difficult rollbacks
Proposed Solution:
- Implement
migrate-mongofor all services - Create schema changelog
- Add migration tests in CI
- Document migration process
- Version all schema changes
Reference: 10-schema-governance.md
Issue #11: Gateway Evolution
Priority: 🟢 Low Impact: Medium (features) Effort: High (4-6 weeks) Timeline: Q1 2027
Problem: Express-based gateway lacks advanced features:
- Circuit breaker
- Sophisticated rate limiting
- Built-in caching
- Service mesh capabilities
Proposed Solution:
- Evaluate Kong, Traefik, or cloud-managed API Gateway
- PoC in staging environment
- Migrate routes progressively
- Leverage plugins for:
- Authentication/authorization
- Rate limiting
- Response caching
- mTLS for internal communication
- Built-in observability
Reference: 11-gateway-evolution.md
90-Day Roadmap
Month 1 (March 2026)
Focus: Critical fixes and foundation
- Issue #3: Port alignment across all services
- Issue #2: Enable Redis adapter for Socket.IO
- Issue #6 (Part 1): Basic OpenTelemetry instrumentation
- Add
/healthand/readyendpoints to all services - Issue #12 (Part 1): Enable MongoDB profiling and identify slow queries
Deliverables:
- Aligned port configuration
- Horizontally scalable real-time service
- Basic distributed tracing
- Health check standardization
Month 2 (April 2026)
Focus: Event bus and performance
- Issue #1 (Phase 1): Redis Streams event bus prototype
- Issue #12 (Part 2): Create MongoDB indexes
- Issue #6 (Part 2): Complete OpenTelemetry integration
- Issue #7 (Part 1): Gateway response caching for GET requests
- Load test WebSocket with Redis adapter
Deliverables:
- Event-driven communication prototype
- Optimized database queries
- Full observability stack
- Response caching in gateway
Month 3 (May 2026)
Focus: Data pipeline and security
- Issue #4: CDC PoC (Mongo → Postgres)
- Issue #8: Secrets management migration
- Issue #11: Evaluate managed gateway options (staging)
- Define SLOs and create Grafana dashboards
- Issue #7 (Part 2): Event-based cache invalidation
Deliverables:
- Analytics data pipeline
- Secure secrets management
- Gateway evaluation complete
- SLO monitoring dashboards
Risk Assessment
High Risk if Not Addressed
Realtime Service Not Scalable (Issue #2)
Risk: Chat and presence unreliable under load Impact: User experience degradation, churn Mitigation: Implement Redis adapter immediately
Secrets in .env Files (Issue #8)
Risk: Secret exposure, data breach Impact: Security incident, compliance violation Mitigation: Migrate to secret manager before production
No Observability (Issue #6)
Risk: Unable to debug production issues Impact: Long MTTR, poor user experience Mitigation: Implement OpenTelemetry in Q2
Medium Risk
Synchronous Service Communication (Issue #1)
Risk: Failure cascades, high latency Impact: System-wide outages Mitigation: Implement event bus in Q2
Data Inconsistency (Issue #4)
Risk: Inaccurate dashboards, wrong decisions Impact: Business decisions based on bad data Mitigation: Implement CDC in Q3
Low Risk
Missing Contract Tests (Issue #9)
Risk: Integration breakages Impact: Development velocity slowdown Mitigation: Implement in Q4
Detailed Issue Breakdown
For detailed documentation on each issue, see:
- Issue #1: Event Bus Implementation
- Issue #2: Redis Adapter for Socket.IO
- Issue #3: Port Alignment
- Issue #4: CDC Mongo to Postgres
- Issue #5: AI Read Replica
- Issue #6: Observability with OpenTelemetry
- Issue #7: Gateway Caching
- Issue #8: Secrets Management
- Issue #9: Contract Testing and E2E
- Issue #10: Schema Governance
- Issue #11: Gateway Evolution
- Issue #12: Performance and Indexing
Tracking and Updates
Update Cadence
- Weekly: Review in-progress issues
- Monthly: Update roadmap and priorities
- Quarterly: Comprehensive technical debt review
Ownership
- Overall: Tech Lead
- Individual Issues: Assigned engineers (see individual issue docs)
Status Tracking
- Issues tracked in Project Roadmap
- Progress updated in weekly engineering meetings
- Completed items moved to archive
Next Review: April 10, 2026 Document Owner: Tech Lead Last Updated: March 10, 2026