Flow - Architecture Improvements & Technical Debt

Document Version: 1.0.0 Last Updated: March 10, 2026 Status: Active Development Owner: Tech Lead

Documento pre-migrazione Supabase-only

Il technical-debt register qui elencato copre i microservizi legacy. La migrazione Supabase-only del 2026-03-29 ha risolto (o reso irrilevanti) la maggior parte degli item segnati come critici: il debt su gateway, service mesh, broker di eventi, e consistency Mongo↔Postgres non si applica all’architettura attuale.

Il debt sopravvissuto (es. test coverage, observability) va ri-triaged nel contesto Supabase. Architettura corrente: Architecture Overview.


Executive Summary

This document consolidates all identified architectural improvements, technical debt items, and infrastructure enhancements for the Flow project. It serves as the central tracking document for technical improvements across the platform.

Key Areas of Focus:

  1. Inter-service communication and event bus implementation
  2. Database strategy and data consistency
  3. Real-time service scalability
  4. AI/ML data pipeline architecture
  5. API Gateway evolution
  6. Observability and resilience
  7. Performance and caching strategies
  8. Security and secrets management
  9. Testing and quality assurance
  10. Schema governance and migrations

Table of Contents

  1. High Priority Issues
  2. Medium Priority Issues
  3. Low Priority Issues
  4. 90-Day Roadmap
  5. Risk Assessment
  6. Detailed Issue Breakdown

High Priority Issues

Issue #1: Event Bus Implementation

Priority: 🔴 Critical Impact: High (prevents failure cascades, improves resilience) Effort: Medium (2-3 weeks) Timeline: Q2 2026 (Month 1-2)

Problem: Synchronous HTTP calls between microservices (user ↔ event ↔ social) create:

  • High latency
  • Tight coupling
  • Failure cascades
  • “Distributed monolith” anti-pattern

Proposed Solution:

  1. Phase 1 (Quick Win): Implement Redis Streams (already available)

    • Define event contracts: EventCreated, EventUpdated, EventCancelled, UserProfileUpdated, NotificationRequested
    • Update services to publish/consume events
    • Implement Dead Letter Queue (DLQ) and retry logic
    • Add metrics for lag and consumer health
  2. Phase 2 (Production): Evaluate and migrate to Kafka

    • Better throughput for high volumes
    • Event replay capabilities
    • Mature ecosystem and tooling

Benefits:

  • ✅ Backpressure handled by message broker
  • ✅ Independent consumer services
  • ✅ Natural audit log of events
  • ✅ Integrated retry and DLQ mechanisms
  • ✅ Improved system resilience

Implementation Steps:

  • Define event schema and contracts
  • Set up Redis Streams infrastructure
  • Implement event publishers in each service
  • Implement event consumers with error handling
  • Add monitoring and metrics
  • Load test event throughput
  • Document event-driven architecture

Reference: 1-event-bus.md


Issue #2: Redis Adapter for Socket.IO

Priority: 🔴 Critical Impact: High (required for horizontal scaling) Effort: Low (1 week) Timeline: Q2 2026 (Month 1)

Problem: Without Redis adapter, Socket.IO cannot scale horizontally. Real-time service limited to single instance, creating:

  • Single point of failure
  • Limited concurrent user capacity
  • No high availability

Proposed Solution:

  1. Enable socket.io-redis-adapter in production
  2. Configure sticky sessions on load balancer
  3. Test horizontal scaling with multiple realtime service instances
  4. Implement rate limiting for WebSocket connections
  5. Separate namespaces (chat, presence, typing)

Implementation Steps:

  • Install and configure socket.io-redis-adapter
  • Update realtime service configuration
  • Configure load balancer with sticky sessions
  • Test with multiple service instances
  • Implement per-connection rate limiting
  • Add WebSocket-specific monitoring
  • Document scaling architecture

Reference: 2-redis-socketio.md


Issue #12: Performance and Indexing

Priority: 🔴 Critical Impact: High (query performance) Effort: Low (1-2 weeks) Timeline: Q2 2026 (Month 2-3)

Problem: Missing MongoDB indexes cause slow queries, especially for:

  • Event searches by location
  • Event filtering by category and date
  • User lookups by various fields
  • Social graph queries

Proposed Solution:

  1. Enable MongoDB profiling in staging environment
  2. Create indexes for common query patterns:
    • events: slug, organizer.id, datetime.start, location.coordinates (geospatial), category
    • users: email, username, location.coordinates
    • social_connections: userId, status, createdAt
  3. Add compound indexes for complex queries
  4. Benchmark performance improvements
  5. Document indexing strategy

Implementation Steps:

  • Enable MongoDB profiling
  • Analyze slow queries
  • Create single-field indexes
  • Create compound indexes
  • Test query performance improvements
  • Document index strategy
  • Add index creation to migration scripts

Reference: 12-performance-indexing.md


Issue #6: Observability with OpenTelemetry

Priority: 🔴 High Impact: High (debugging, performance monitoring) Effort: Medium (2-3 weeks) Timeline: Q2 2026 (Month 2-3)

Problem: Limited visibility into distributed system behavior:

  • Difficult to trace requests across services
  • No centralized metrics collection
  • Hard to debug performance issues
  • Lack of distributed tracing

Proposed Solution:

  1. Implement OpenTelemetry SDK in all Node.js services
  2. Propagate trace-id across service boundaries (in addition to correlation-id)
  3. Export traces to Jaeger or Google Cloud Trace
  4. Export metrics to Prometheus
  5. Create Grafana dashboards for:
    • Request latency (p50, p95, p99)
    • Error rates
    • Service health
    • Database performance
    • Cache hit rates

Implementation Steps:

  • Install OpenTelemetry SDKs
  • Configure trace propagation
  • Set up trace collection (Jaeger/Cloud Trace)
  • Set up metrics collection (Prometheus)
  • Create Grafana dashboards
  • Add instrumentation to critical paths
  • Document observability architecture

Reference: 6-observability-otel.md


Issue #3: Port Alignment

Priority: 🟡 Medium Impact: Low (consistency) Effort: Low (< 1 day) Timeline: Q2 2026 (Month 1)

Problem: Inconsistent port configuration across services and documentation.

Proposed Solution:

  1. Standardize notification service to port 3004
  2. Update all environment files
  3. Update docker-compose.yml
  4. Update documentation
  5. Update Kubernetes manifests

Standard Port Mapping:

  • API Gateway: 3000
  • User Service: 3001
  • Event Service: 3002
  • Social Service: 3003
  • Notification Service: 3004
  • Realtime Service: 3005
  • Recommendation Engine (AI): 8001
  • Matchmaking Service (AI): 8002

Reference: 3-notification-port-align.md


Medium Priority Issues

Issue #4: CDC Pipeline (Mongo to Postgres)

Priority: 🟡 Medium Impact: Medium (data consistency) Effort: High (3-4 weeks) Timeline: Q3 2026

Problem: Data split between MongoDB (operational) and PostgreSQL (admin/analytics) creates:

  • Data inconsistency risks
  • Duplicate data models
  • Slow cross-database queries
  • Complex synchronization logic

Proposed Solution:

  1. Implement Change Data Capture (CDC) using Debezium or similar
  2. Stream MongoDB changes to PostgreSQL
  3. Create read-only analytics views in Postgres
  4. Remove duplicate SQL models from user-service
  5. Define clear operational vs analytical data boundaries

Data Plane Strategy:

  • Operational Store (MongoDB): Real-time OLTP workloads
  • Analytical Store (PostgreSQL): Dashboard queries, reporting, admin portal
  • CDC Pipeline: Near real-time synchronization

Reference: 4-cdc-mongo-to-postgres.md


Issue #5: AI Read Replica

Priority: 🟡 Medium Impact: High (prevents AI from impacting user performance) Effort: Medium (2 weeks) Timeline: Q3 2026 (before AI features launch)

Problem: AI services reading from operational MongoDB can degrade user-facing performance:

  • Recommendation engine queries compete with user requests
  • ML feature extraction is resource-intensive
  • No isolation between operational and ML workloads

Proposed Solution:

  1. Create dedicated MongoDB read replica for AI workloads
  2. Implement data pipeline (batch or CDC) to ML feature store
  3. Version datasets for reproducibility
  4. Schedule nightly feature materialization jobs
  5. Export to Parquet on object storage for ML training

ML Data Pipeline:

MongoDB (Primary) → Read Replica → Feature Engineering → Feature Store → ML Models
                                      ↓
                                  Parquet Export → Object Storage → Training Pipeline

Reference: 5-ai-read-replica.md


Issue #7: Gateway Caching

Priority: 🟡 Medium Impact: Medium (performance) Effort: Medium (2 weeks) Timeline: Q3 2026

Problem: Repeated requests for the same data increase backend load and latency.

Proposed Solution:

  1. Implement response caching in API gateway for GET requests
  2. Cache popular event details, category lists, venue information
  3. Invalidate cache via event bus events
  4. Add cache hit/miss metrics
  5. Configure appropriate TTLs

Caching Strategy:

  • High TTL (1 hour): Categories, venues, static content
  • Medium TTL (15 min): Event details, user profiles
  • Low TTL (5 min): Event lists, search results
  • No cache: User-specific data, real-time updates

Reference: 7-gateway-caching.md


Issue #8: Secrets Management

Priority: 🔴 High (Security) Impact: Critical (security) Effort: Low (1 week) Timeline: Q3 2026 (before production launch)

Problem: Secrets stored in .env files are:

  • Vulnerable to accidental commits
  • Hard to rotate
  • Not audited
  • Shared across environments

Proposed Solution:

  1. Migrate secrets to GCP Secret Manager (or AWS Secrets Manager/HashiCorp Vault)
  2. Implement secret rotation for:
    • JWT signing keys
    • API keys (SendGrid, Firebase, etc.)
    • Database credentials
  3. Update CI/CD to fetch secrets dynamically
  4. Remove .env files from deployment

Implementation:

  • Set up GCP Secret Manager
  • Migrate all secrets
  • Update service configurations
  • Implement rotation policies
  • Update CI/CD pipelines
  • Document secrets management process
  • Remove .env from production deployments

Reference: 8-secrets-manager.md


Low Priority Issues

Issue #9: Contract Testing and E2E

Priority: 🟢 Low Impact: Medium (quality) Effort: High (3-4 weeks) Timeline: Q4 2026

Problem: No API contract validation or comprehensive E2E tests, leading to:

  • Integration breakages
  • API version incompatibilities
  • Regression bugs

Proposed Solution:

  1. Implement OpenAPI specs for all services
  2. Add API schema validation in gateway
  3. Pact contract tests between services
  4. Playwright E2E tests for critical user flows
  5. Load testing with k6

Test Coverage:

  • Contract Tests: All inter-service API calls
  • E2E Tests: Signup, event creation, RSVP, chat, friend connections
  • Load Tests: Auth, event list, WebSocket fan-out

Reference: 9-contract-e2e-testing.md


Issue #10: Schema Governance

Priority: 🟢 Low Impact: Medium (maintainability) Effort: Low (1 week) Timeline: Q4 2026

Problem: No formal schema versioning or migration strategy:

  • Ad-hoc schema changes
  • No migration tests
  • Difficult rollbacks

Proposed Solution:

  1. Implement migrate-mongo for all services
  2. Create schema changelog
  3. Add migration tests in CI
  4. Document migration process
  5. Version all schema changes

Reference: 10-schema-governance.md


Issue #11: Gateway Evolution

Priority: 🟢 Low Impact: Medium (features) Effort: High (4-6 weeks) Timeline: Q1 2027

Problem: Express-based gateway lacks advanced features:

  • Circuit breaker
  • Sophisticated rate limiting
  • Built-in caching
  • Service mesh capabilities

Proposed Solution:

  1. Evaluate Kong, Traefik, or cloud-managed API Gateway
  2. PoC in staging environment
  3. Migrate routes progressively
  4. Leverage plugins for:
    • Authentication/authorization
    • Rate limiting
    • Response caching
    • mTLS for internal communication
    • Built-in observability

Reference: 11-gateway-evolution.md


90-Day Roadmap

Month 1 (March 2026)

Focus: Critical fixes and foundation

  • Issue #3: Port alignment across all services
  • Issue #2: Enable Redis adapter for Socket.IO
  • Issue #6 (Part 1): Basic OpenTelemetry instrumentation
  • Add /health and /ready endpoints to all services
  • Issue #12 (Part 1): Enable MongoDB profiling and identify slow queries

Deliverables:

  • Aligned port configuration
  • Horizontally scalable real-time service
  • Basic distributed tracing
  • Health check standardization

Month 2 (April 2026)

Focus: Event bus and performance

  • Issue #1 (Phase 1): Redis Streams event bus prototype
  • Issue #12 (Part 2): Create MongoDB indexes
  • Issue #6 (Part 2): Complete OpenTelemetry integration
  • Issue #7 (Part 1): Gateway response caching for GET requests
  • Load test WebSocket with Redis adapter

Deliverables:

  • Event-driven communication prototype
  • Optimized database queries
  • Full observability stack
  • Response caching in gateway

Month 3 (May 2026)

Focus: Data pipeline and security

  • Issue #4: CDC PoC (Mongo → Postgres)
  • Issue #8: Secrets management migration
  • Issue #11: Evaluate managed gateway options (staging)
  • Define SLOs and create Grafana dashboards
  • Issue #7 (Part 2): Event-based cache invalidation

Deliverables:

  • Analytics data pipeline
  • Secure secrets management
  • Gateway evaluation complete
  • SLO monitoring dashboards

Risk Assessment

High Risk if Not Addressed

Realtime Service Not Scalable (Issue #2)

Risk: Chat and presence unreliable under load Impact: User experience degradation, churn Mitigation: Implement Redis adapter immediately

Secrets in .env Files (Issue #8)

Risk: Secret exposure, data breach Impact: Security incident, compliance violation Mitigation: Migrate to secret manager before production

No Observability (Issue #6)

Risk: Unable to debug production issues Impact: Long MTTR, poor user experience Mitigation: Implement OpenTelemetry in Q2

Medium Risk

Synchronous Service Communication (Issue #1)

Risk: Failure cascades, high latency Impact: System-wide outages Mitigation: Implement event bus in Q2

Data Inconsistency (Issue #4)

Risk: Inaccurate dashboards, wrong decisions Impact: Business decisions based on bad data Mitigation: Implement CDC in Q3

Low Risk

Missing Contract Tests (Issue #9)

Risk: Integration breakages Impact: Development velocity slowdown Mitigation: Implement in Q4


Detailed Issue Breakdown

For detailed documentation on each issue, see:


Tracking and Updates

Update Cadence

  • Weekly: Review in-progress issues
  • Monthly: Update roadmap and priorities
  • Quarterly: Comprehensive technical debt review

Ownership

  • Overall: Tech Lead
  • Individual Issues: Assigned engineers (see individual issue docs)

Status Tracking

  • Issues tracked in Project Roadmap
  • Progress updated in weekly engineering meetings
  • Completed items moved to archive

Next Review: April 10, 2026 Document Owner: Tech Lead Last Updated: March 10, 2026