# Non-Linear: Tech Stack & Architecture ## Stack Overview ``` ┌─────────────────────────────────────────────────────────┐ │ FRONTEND │ │ Vue 3 + Tailwind + Headless UI + ECharts │ │ Graph Viz: TBD (D3 vs Cytoscape — eval pending) │ │ Command Palette: vue-command-palette / custom │ │ Keybindings: VueUse useMagicKeys │ │ Icons: Lucide │ Font: Inter │ Motion: @vueuse/motion│ │ State: Pinia │ HTTP: ofetch │ WS: centrifuge-js │ ├─────────────────────────────────────────────────────────┤ │ CROSS-PLATFORM │ │ Desktop: Tauri (thin wrapper, no offline) — v0.1 │ │ Mobile: Capacitor (responsive web first) — v0.2+ │ ├─────────────────────────────────────────────────────────┤ │ BACKEND │ │ FastAPI (Python) │ │ Taskiq (async task queue — webhooks, imports, agents) │ ├─────────────────────────────────────────────────────────┤ │ DATA LAYER │ │ Neo4j — graph topology (nodes, edges, status, labels) │ │ Postgres — content & metadata (rich text, comments, │ │ attachments meta, audit logs, project cfg) │ │ Redis — caching, rate limiting │ │ Meilisearch — full-text search (issues, comments) │ │ MinIO — S3-compatible file storage (attachments) │ ├─────────────────────────────────────────────────────────┤ │ REAL-TIME │ │ Centrifugo — WebSocket server, live updates, push │ ├─────────────────────────────────────────────────────────┤ │ AUTH │ │ Authentik — OIDC, API tokens, role mgmt, SSO-ready │ ├─────────────────────────────────────────────────────────┤ │ INFRA/OPS │ │ Caddy (reverse proxy + TLS) │ Vault (secrets) │ │ Prometheus + Grafana (metrics + dashboards) │ │ Loki (logs) │ Tempo (traces) │ OpenTelemetry (SDK) │ ├─────────────────────────────────────────────────────────┤ │ DEPLOYMENT │ │ Docker Compose (dev + single-node production) │ └─────────────────────────────────────────────────────────┘ ``` ## Data Boundary ### Neo4j — Graph Topology Owns the decomposition tree and all overlay edges across the four data layers: - **Node labels:** `Component` (Layer 1), `Issue` (Layer 2), `Artifact` (Layer 4), `Cycle`, `Project` - Node identity (UUID), short ID - Lightweight properties: status, labels, assignee_id, created_at, updated_at - **Layer 1 edges:** `HAS_CHILD` between components (decomposition tree) - **Layer 1→2 edges:** `HAS_CHILD` from components to issues (work attachment) - **Layer 2 edges:** `BLOCKS`, `DUPLICATES`, `RELATES_TO` between issues (work coordination) - **Layer 3 edges:** `DEPENDS_ON`, `IMPORTS`, `CALLS_API`, `SHARES_DB` between components (code connections) - **Layer 4 edges:** `HAS_ARTIFACT` from components/issues to artifacts - Project root references, cycle membership Each edge type is scoped to a single layer, which enables efficient layer-filtered queries — a Cypher query for "show me this subtree with only Layer 3 edges" simply filters by relationship type. **Why Neo4j over Postgres recursive CTEs:** Queries like "find all unblocked leaves in this subtree," "critical path through blocks links," "everything 3 hops from this node" are what Cypher is built for. CTEs get painful with lateral links and variable-depth queries. The gap widens with Layer 3 code connections (multi-hop dependency chains) and in v0.2+ with cross-project edges. ### Postgres — Content & Metadata - **Rich text content:** issue and component descriptions (markdown) - **Comment threads:** body, author, parent_comment_id (threading), timestamps - **Attachment metadata:** filename, size, mime_type, s3_key, uploader_id, uploaded_at (inline attachments in comments/descriptions) - **Artifact metadata (Layer 4):** title, kind, url/file_ref, mime_type — rich metadata for external docs, designs, and uploaded files attached to nodes - **User/agent accounts:** profile data, preferences, notification settings - **Project settings:** configuration, member lists, default policies - **Audit logs:** who changed what, when, with before/after snapshots - **Policy definitions:** role templates, custom permission rules **Linked to Neo4j by UUID.** Neo4j node stores `id: "abc-123"`. Postgres stores full content keyed by same UUID. FastAPI joins them as needed. This applies to all node types: Components (Layer 1), Issues (Layer 2), and Artifacts (Layer 4). ### Redis — Caching & Real-Time - Subtree query cache (TTL, invalidated on graph mutations) - WebSocket pub/sub for real-time updates - Rate limiting for agent API - Authentik token validation cache ### Meilisearch — Search Index - Indexes issue titles, descriptions, comments, labels - Fed from both Neo4j and Postgres - Powers command palette search (issues + commands in one result set) - Typo-tolerant, prefix search, filtering by label/status/assignee ### MinIO — File Storage - S3-compatible API, self-hosted - Stores attachment files (images, docs) - Postgres stores metadata and S3 key; MinIO stores bytes - Migration path to AWS S3: zero code changes ## Concrete Database Schemas ### UUID Strategy All entities use UUIDv7 (time-sortable). Generated application-side by FastAPI before writing to either database. The same UUID is used as the primary key in both Neo4j and Postgres, serving as the cross-database join key. ### Neo4j Schema Neo4j stores graph topology and lightweight node properties. All content lives in Postgres. **Node labels and properties:** ```cypher // Layer 1: Component node CREATE (c:Component { id: "uuidv7", short_id: "NL-C12", title: "auth-service", status: null, // components have no status labels: ["backend", "core"], owner_id: "uuidv7", assignee_id: null, repo_provider: "github", repo_url: "https://github.com/team/auth", repo_path: "/src/oauth", repo_branch: "main", created_at: datetime(), updated_at: datetime() }) // Layer 2: Issue node CREATE (i:Issue { id: "uuidv7", short_id: "NL-42", title: "implement refresh tokens", status: "todo", labels: ["feature", "p1"], assignee_id: "uuidv7", created_by: "uuidv7", cycle_id: "uuidv7", created_at: datetime(), updated_at: datetime() }) // Layer 4: Artifact node CREATE (a:Artifact { id: "uuidv7", title: "Login flow mockup", kind: "link", // "link" | "file" | "embed" url: "https://figma.com/...", // for links/embeds file_ref: null, // MinIO s3_key for uploaded files mime_type: null, size_bytes: null, created_by: "uuidv7", created_at: datetime() }) // Project root (virtual node linking to decomposition tree root) CREATE (p:Project { id: "uuidv7", workspace_id: "uuidv7", root_id: "uuidv7" }) ``` **Relationships (organized by layer):** ```cypher // Decomposition tree (parent → child) — Layer 1 + Layer 2 (component)-[:HAS_CHILD]->(component) // Layer 1: structure nesting (component)-[:HAS_CHILD]->(issue) // Layer 1→2: work attached to structure (issue)-[:HAS_CHILD]->(issue) // Layer 2: sub-tasks // Layer 2: Work coordination links (between issues) (issue)-[:BLOCKS]->(issue) (issue)-[:RELATES_TO]->(issue) (issue)-[:DUPLICATES]->(issue) // Layer 3: Code connection links (between components) (component)-[:DEPENDS_ON {source: "manual"}]->(component) (component)-[:IMPORTS {source: "inferred"}]->(component) (component)-[:CALLS_API {source: "inferred"}]->(component) (component)-[:SHARES_DB {source: "manual"}]->(component) // Layer 4: Artifact attachments (component)-[:HAS_ARTIFACT]->(artifact) (issue)-[:HAS_ARTIFACT]->(artifact) // Cycle membership (issue)-[:IN_CYCLE]->(cycle:Cycle { id, name, start_date, end_date }) ``` Layer 3 edges carry a `source` property (`"manual"` or `"inferred"`) to distinguish human-declared dependencies from code-analysis results. **Indexes:** ```cypher CREATE INDEX comp_id FOR (c:Component) ON (c.id); CREATE INDEX comp_short FOR (c:Component) ON (c.short_id); CREATE INDEX issue_id FOR (i:Issue) ON (i.id); CREATE INDEX issue_short FOR (i:Issue) ON (i.short_id); CREATE INDEX issue_status FOR (i:Issue) ON (i.status); CREATE INDEX issue_assignee FOR (i:Issue) ON (i.assignee_id); CREATE INDEX artifact_id FOR (a:Artifact) ON (a.id); CREATE INDEX project_id FOR (p:Project) ON (p.id); ``` ### Postgres Schema (SQLModel) Postgres stores all content, metadata, and configuration. Managed via Alembic migrations. ```python class NodeContent(SQLModel, table=True): """Rich content for both components and issues.""" id: uuid.UUID = Field(primary_key=True) # matches Neo4j node id description: str | None = None # markdown description_html: str | None = None # pre-rendered, sanitized HTML class Comment(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True) author_id: uuid.UUID = Field(foreign_key="actor.id") body: str # markdown body_html: str # pre-rendered, sanitized HTML created_at: datetime updated_at: datetime class CommentReaction(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) comment_id: uuid.UUID = Field(foreign_key="comment.id", index=True) actor_id: uuid.UUID = Field(foreign_key="actor.id") emoji: str # e.g. "+1", "rocket" created_at: datetime class Attachment(SQLModel, table=True): """File attached inline to a comment or description (e.g. pasted image).""" id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True) filename: str size_bytes: int mime_type: str s3_key: str # MinIO object key uploader_id: uuid.UUID = Field(foreign_key="actor.id") uploaded_at: datetime class ArtifactContent(SQLModel, table=True): """Layer 4: external context attached to a component or issue. Topology (HAS_ARTIFACT edge) lives in Neo4j; rich metadata lives here.""" id: uuid.UUID = Field(primary_key=True) # matches Neo4j Artifact node id title: str kind: str # "link" | "file" | "embed" url: str | None = None # external URL (Figma, Docs, etc.) file_ref: str | None = None # MinIO s3_key for uploaded files mime_type: str | None = None size_bytes: int | None = None node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True) created_by: uuid.UUID = Field(foreign_key="actor.id") created_at: datetime class Actor(SQLModel, table=True): """Human user or AI agent.""" id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) type: str # "user" | "agent" name: str email: str | None = None authentik_uid: str | None = None # OIDC subject claim preferences: dict = Field(default_factory=dict) # JSON: theme, notifications, etc. created_at: datetime class Workspace(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) name: str slug: str = Field(unique=True, index=True) created_at: datetime class WorkspaceMember(SQLModel, table=True): workspace_id: uuid.UUID = Field(foreign_key="workspace.id", primary_key=True) actor_id: uuid.UUID = Field(foreign_key="actor.id", primary_key=True) role: str # workspace-level role joined_at: datetime class ProjectConfig(SQLModel, table=True): id: uuid.UUID = Field(primary_key=True) # matches Neo4j Project id workspace_id: uuid.UUID = Field(foreign_key="workspace.id", index=True) name: str settings: dict = Field(default_factory=dict) # JSON: custom statuses, defaults created_at: datetime class PolicyRule(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True) actor_id: uuid.UUID | None = Field(default=None) # null = role-level role_name: str | None = None action: str # e.g. "read_node", "create_child", "*" resource_scope: str # "global" | "subtree:{node_id}" | "node:{node_id}" effect: str # "allow" | "deny" class AuditLog(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True) actor_id: uuid.UUID = Field(foreign_key="actor.id") action: str # e.g. "status_changed", "reparented" node_id: uuid.UUID | None = None before: dict | None = None # JSON snapshot after: dict | None = None # JSON snapshot created_at: datetime = Field(index=True) class WebhookConfig(SQLModel, table=True): id: uuid.UUID = Field(default_factory=uuid7, primary_key=True) project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True) url: str secret_hash: str # hashed, never stored plaintext events: list[str] = Field(default_factory=list) active: bool = True consecutive_failures: int = 0 created_at: datetime ``` ## Dual-Database Consistency Neo4j and Postgres are **not replicated** — they own different data, linked by UUID. Both writes happen in the same API request. The consistency strategy for v0.1: ### Write Order 1. **Postgres first.** Open a SQLAlchemy transaction. Write content/metadata. Do not commit yet. 2. **Neo4j second.** Perform the graph mutation (create node, update properties, create edge). 3. **Commit Postgres.** If Postgres commit succeeds, the operation is complete. ### Failure Handling - **Neo4j write fails:** Rollback the Postgres transaction (it hasn't committed). Clean failure, no orphans. - **Postgres commit fails after Neo4j succeeds:** Issue a compensating operation on Neo4j (delete the node/revert the property change). Log the incident for review. - **Partial Neo4j failure (e.g., network timeout with unknown state):** Flag the UUID for reconciliation review. ### Reconciliation Job A periodic background task (Taskiq, runs every 15 minutes) checks for inconsistencies: - UUIDs present in Neo4j but missing from Postgres (orphan graph nodes) - UUIDs present in Postgres `NodeContent` but missing from Neo4j (orphan content) - Mismatched lightweight properties (status, assignee) between Neo4j and Postgres audit log Orphans are logged and surfaced in an admin dashboard. Auto-repair is deferred — manual review for v0.1. ### What's Eventually Consistent - **Meilisearch index:** Updated asynchronously via Taskiq. Acceptable lag of seconds. - **Redis cache:** Invalidated on mutation. TTL-based expiry as fallback. - **Centrifugo events:** Fire-and-forget publish. Missed events are recoverable by client re-fetch. ## Backend Architecture ### FastAPI Application Structure ``` non-linear-api/ ├── app/ │ ├── main.py # App, middleware, startup/shutdown │ ├── config.py # Settings from env vars │ ├── dependencies.py # Shared deps (db sessions, auth, current_user) │ ├── auth/ # Authentik integration │ │ ├── oidc.py # Token validation, OIDC discovery │ │ ├── permissions.py # Policy engine evaluation │ │ └── agent_tokens.py # API token management for agents │ ├── graph/ # Neo4j layer │ │ ├── connection.py # Neo4j driver management │ │ ├── queries.py # Cypher query templates │ │ ├── mutations.py # Graph write operations │ │ └── traversal.py # Subtree, path, neighbor queries │ ├── content/ # Postgres layer │ │ ├── models.py # SQLAlchemy/SQLModel models │ │ ├── descriptions.py # Rich text CRUD │ │ ├── comments.py # Comment thread CRUD │ │ ├── attachments.py # Inline attachment metadata + MinIO upload/download │ │ └── artifacts.py # Layer 4: artifact CRUD (links, files, embeds) │ ├── connections/ # Layer 3: code connection analysis │ │ ├── inference.py # Auto-infer dependencies from repo analysis │ │ └── manual.py # Manual code connection CRUD │ ├── search/ # Meilisearch integration │ │ ├── indexer.py # Index updates on mutations │ │ └── search.py # Query interface │ ├── realtime/ # WebSocket layer │ │ ├── manager.py # Connection management │ │ └── events.py # Event types and broadcasting │ ├── tasks/ # Taskiq background jobs │ │ ├── webhooks.py # Deliver webhooks to agent endpoints │ │ ├── indexing.py # Async search index updates │ │ ├── notifications.py # Notification delivery │ │ └── connections.py # Layer 3: periodic code connection inference │ └── api/v1/ # Route handlers │ ├── nodes.py # CRUD + tree operations │ ├── links.py # Lateral link management (Layer 2 + Layer 3) │ ├── projects.py # Project CRUD │ ├── comments.py # Comment endpoints │ ├── attachments.py # Inline upload/download │ ├── artifacts.py # Layer 4: artifact endpoints │ ├── connections.py # Layer 3: code connection endpoints │ ├── search.py # Search endpoint │ └── agent.py # Agent-specific API surface ├── tests/ ├── alembic/ # Postgres migrations ├── docker-compose.yml └── pyproject.toml ``` ### Request Flows **Typical read ("get node with full context"):** ``` Client → FastAPI → Auth middleware (validate token via Authentik) → Policy engine (check permissions) → Neo4j: fetch node + parent + children + links → Postgres: fetch description, comments, attachment meta → Merge response → Client ``` **Typical write ("change node status"):** ``` Client → FastAPI → Auth → Policy engine → Neo4j: update node status → Redis: invalidate cache, publish event → Taskiq: queue webhook delivery, search index update → WebSocket: broadcast to connected clients → Response → Client ``` ### Sync Strategy (Neo4j ↔ Postgres) Not replicated — they own different data. Linked by UUID. Both operations happen in same API request. Compensating transaction pattern for consistency. Eventual consistency acceptable for search index and cache. ## Auth Architecture ``` ┌──────────┐ OIDC token ┌───────────┐ │ Vue App ├─────────────────────►│ Authentik │ └────┬─────┘ (login flow) └─────┬─────┘ │ │ │ Bearer token │ Token introspection ▼ ▼ ┌──────────┐◄────────────────────┌───────────┐ │ FastAPI │ validate token │ Authentik │ │ (resource│ check claims │ (OIDC │ │ server) │ │ provider)│ └──────────┘ └───────────┘ ``` - **Human users:** OIDC login flow. JWT access tokens. - **AI agents:** API tokens issued through Authentik, tied to agent actor accounts. - **FastAPI:** pure resource server. Validates tokens, reads claims, enforces policies. ## API Error Contract All error responses use a consistent envelope: ```json { "error": { "code": "validation_error", "message": "Human-readable description", "details": [ { "field": "title", "message": "Field is required" } ] } } ``` ### HTTP Status Codes | Code | Usage | |------|-------| | `400` | Malformed request (bad JSON, missing required fields) | | `404` | Resource not found **or** actor lacks permission to see it. Permission-denied nodes return 404 (not 403) to prevent information leakage about resource existence. | | `409` | Conflict (e.g., duplicate `short_id`, stale update) | | `422` | Validation error. Standard FastAPI/Pydantic response with field-level detail. | | `429` | Rate limited. Includes `Retry-After` header (seconds). | | `500` | Internal server error. Logged with correlation ID for debugging. | ### Rate Limiting - Agent API: token bucket per actor, configurable per role (default: 100 req/min). - Human API: higher limits (default: 300 req/min). - Enforced via Redis. `429` response includes `Retry-After` and `X-RateLimit-Remaining` headers. ## Security ### Input Sanitization - **Cypher injection:** All Neo4j queries use parameterized Cypher exclusively. User-supplied values are never interpolated into query strings. The `graph/queries.py` module enforces this by accepting only typed parameters. - **SQL injection:** SQLModel/SQLAlchemy parameterized queries. No raw SQL with string formatting. - **XSS prevention:** All markdown content (descriptions, comments) is sanitized server-side using `nh3` (Rust-based HTML sanitizer) before storage. Both raw markdown and pre-rendered sanitized HTML are stored. The frontend renders the pre-sanitized HTML. - **File upload validation:** MIME type validation against allowlist (images, PDFs, common doc formats). Size limit: 25 MB per file. Filename sanitization to prevent path traversal. ### Transport & Headers - **TLS:** All traffic encrypted via Caddy reverse proxy (automatic Let's Encrypt certificates). - **CSRF:** SameSite=Lax cookies for browser sessions. Bearer token API calls are inherently CSRF-safe. - **Content-Security-Policy:** Strict CSP headers served by Caddy. `script-src 'self'`, no inline scripts, no `eval`. - **CORS:** Allowlist of known origins (frontend domain). No wildcard in production. - **Security headers:** `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY`, `Strict-Transport-Security`. ## Design Language Targets Linear's aesthetic: minimal, fast, slightly dark-IDE feel. - **Spacing:** tight, no wasted space - **Colors:** muted base palette, high-contrast accents only for status/priority - **Borders:** almost none — separation via spacing and subtle background shifts - **Dark mode:** default, light mode secondary - **Typography:** Inter, small-but-readable sizes - **Animations:** subtle slides and fades, 100-150ms, nothing bouncy - **Optimistic updates:** every interaction feels instant, syncs in background ## Real-Time Updates (Centrifugo) Centrifugo handles both live UI updates and notification delivery over WebSocket. Redis is no longer used for WebSocket pub/sub directly — Centrifugo manages its own connections and subscribes to events published by the backend via its server API. ### Channel Structure | Channel | Scope | Subscribers | |---------|-------|-------------| | `project:{id}` | All mutations in a project | All connected project members | | `node:{id}` | Mutations to a specific node | Clients viewing the focus widget for that node | | `user:{id}` | Personal notifications | Single user's connected clients | ### Events Pushed | Event | Layer | Channel | Payload | |-------|-------|---------|---------| | `node.status_changed` | 2 | `project:{id}` + `node:{id}` | node_id, old_status, new_status, actor | | `node.created` | 1/2 | `project:{id}` | node_id, parent_id, type, title, actor | | `node.deleted` | 1/2 | `project:{id}` + `node:{id}` | node_id, actor | | `node.reparented` | 1/2 | `project:{id}` + `node:{id}` | node_id, old_parent, new_parent, actor | | `comment.added` | 2 | `node:{id}` | comment_id, node_id, author, preview | | `link.changed` | 2/3 | `project:{id}` | source_id, target_id, link_type, layer, action (created/removed) | | `assignment.changed` | 2 | `project:{id}` + `node:{id}` | node_id, old_assignee, new_assignee | | `artifact.attached` | 4 | `project:{id}` + `node:{id}` | artifact_id, node_id, title, kind, actor | | `artifact.removed` | 4 | `project:{id}` + `node:{id}` | artifact_id, node_id, actor | | `connection.inferred` | 3 | `project:{id}` | source_id, target_id, link_type, source: "inferred" | | `notification` | — | `user:{id}` | notification object | The `layer` field on `link.changed` events tells the client which layer the change affects, enabling clients to ignore events for inactive layers. ### Backend Publish Flow ``` Mutation request → Postgres + Neo4j writes → Centrifugo server API: publish event to relevant channels → Taskiq: queue webhook delivery + search index update → Response to client ``` The backend publishes to Centrifugo via its HTTP server API (not through Redis pub/sub). This gives direct control over which channels receive which events. ### Client-Side Handling - **Pinia store:** Incoming Centrifugo events are applied to the Pinia store. The graph view, focus widget, and list view all react to store changes. - **Optimistic updates:** The client applies mutations locally before the server responds. If the server rejects the mutation (4xx), the client reverts the optimistic change by re-fetching the affected node. - **Conflict model:** Last-write-wins for simple fields (status, assignee, labels). The server is the source of truth. When two clients modify the same field concurrently, the last write committed to Neo4j is the one that Centrifugo broadcasts. - **Reconnection:** On WebSocket disconnect, the client re-subscribes to channels and fetches the current state to catch up on missed events. ### Cross-Platform - **Tauri desktop:** No offline support. Tauri wraps the Vue app as-is. When the network is unavailable, the app shows a connection-lost banner and retries. No local mutation queue. ## Docker Compose ### Development ```yaml services: api: # FastAPI (uvicorn --reload) frontend: # Vue 3 (vite dev server) worker: # Taskiq worker (same codebase as api) neo4j: # Graph database postgres: # Relational database redis: # Cache + rate limiting meilisearch: # Search engine minio: # Object storage centrifugo: # Real-time WebSocket server authentik: # Identity provider (server + worker) authentik-db: # Authentik's own Postgres ``` ~12 containers. Runs comfortably on 16GB RAM. ### Production (Single-Node) Same Docker Compose topology with production-grade additions: ```yaml services: # ... all of the above, plus: caddy: # Reverse proxy + automatic TLS vault: # Secrets management (HashiCorp Vault) prometheus: # Metrics collection grafana: # Dashboards + alerting loki: # Log aggregation tempo: # Distributed tracing ``` ~18 containers total. Recommended: 32GB RAM, 4+ CPU cores for production. ## Reverse Proxy (Caddy) Caddy serves as the single entry point for all traffic: - **Automatic TLS** via Let's Encrypt (ACME). Zero-config HTTPS. - **Routes:** `/api/*` → FastAPI, `/ws/*` → Centrifugo, `/*` → Vue frontend (nginx or static files). - **Security headers:** CSP, HSTS, X-Frame-Options, X-Content-Type-Options injected at this layer. - **Rate limiting:** Basic connection-level rate limiting as a first defense layer (application-level rate limiting in FastAPI for finer control). ## Secrets Management ### HashiCorp Vault (Primary) - All sensitive configuration (database passwords, Authentik client secrets, agent API token signing keys, webhook HMAC secrets, MinIO credentials) stored in Vault. - FastAPI reads secrets from Vault at startup via the `hvac` Python client. - Secret rotation supported without application restart (Vault dynamic secrets for Postgres credentials). ### Docker Secrets (Fallback) For simpler deployments that don't want Vault overhead, Docker secrets via compose files are supported. Environment variables as the last resort. ## Observability ### Metrics (Prometheus + Grafana) - **FastAPI:** `prometheus-fastapi-instrumentator` exposes request latency, status codes, in-flight requests at `/metrics`. - **Neo4j:** Neo4j Prometheus plugin or `neo4j-exporter` for query latency, cache hit rates, transaction counts. - **Postgres:** `postgres_exporter` for connection pool, query stats, replication lag. - **Redis:** `redis_exporter` for memory, hit rate, connected clients. - **Centrifugo:** Built-in Prometheus metrics for connections, channels, messages. - **Grafana dashboards:** Pre-built dashboards for each service. Alerting rules for error rate spikes, high latency, container restarts. ### Tracing (OpenTelemetry + Tempo) - OpenTelemetry SDK instrumented in FastAPI. Traces span the full request lifecycle: auth → policy check → Neo4j query → Postgres query → response. - Trace context propagated to Taskiq workers (webhook delivery, indexing). - Traces stored in Grafana Tempo, queryable from Grafana. ### Logging (Structured JSON + Loki) - All services emit structured JSON logs (Python `structlog` for FastAPI). - Fields: timestamp, level, correlation_id, actor_id, action, duration_ms. - Collected by Grafana Loki via Docker logging driver or Promtail. - Correlation ID links logs across FastAPI → Taskiq → Centrifugo for a single request. ### Health Checks Every service exposes a health check endpoint used by Docker Compose `healthcheck` directives: - `GET /health` on FastAPI, Centrifugo - TCP checks for Neo4j, Postgres, Redis, Meilisearch, MinIO - Grafana alerts on health check failures. ## Database Migrations ### Postgres (Alembic) - Alembic manages all Postgres schema migrations. - Migration files stored in `alembic/versions/`. - Auto-generated from SQLModel model changes (`alembic revision --autogenerate`). - Applied on deployment: `alembic upgrade head` runs before the API container starts. ### Neo4j (Versioned Cypher Scripts) - Migration scripts stored in `neo4j/migrations/` as numbered Cypher files (`001_initial_schema.cypher`, `002_add_cycle_nodes.cypher`). - A lightweight migration runner (Python script) tracks applied migrations in a Neo4j `:Migration` node. - Applied on deployment before the API container starts. ## Testing Strategy ### Integration Tests (Primary) - **Framework:** pytest with testcontainers. - **Containers:** Neo4j, Postgres, Redis, Meilisearch spun up per test session (shared across tests for speed, reset between test classes). - **Scope:** API endpoint tests hitting real databases. Policy engine tests with real Neo4j graph structures. Dual-DB consistency tests verifying write-order semantics. - **Fixtures:** Factory functions that create graph structures (components, issues, links) for test scenarios. ### End-to-End Tests - **Framework:** Playwright against the full Docker Compose stack. - **Scope:** Critical user flows — create project, add components, navigate graph, triage inbox, agent API workflows. - **Environment:** Dedicated `docker-compose.test.yml` with ephemeral containers. ### What's Not Mandated Isolated unit tests are not required by convention. The dual-DB architecture makes mocking both databases brittle. Integration tests with real containers are the priority. ## CI/CD Pipeline ``` push/MR → lint → test → build → deploy ``` | Stage | Tools | Description | |-------|-------|-------------| | **Lint** | ruff (Python), eslint + prettier (Vue/TS) | Code style and static analysis | | **Test** | pytest + testcontainers, Playwright | Integration + E2E tests | | **Build** | Docker | Build API, frontend, worker images | | **Push** | Container registry | Push tagged images to GitLab Container Registry | | **Deploy** | SSH + docker compose pull | Pull new images on production server, rolling restart | CI runs on GitLab CI. Pipeline definition in `.gitlab-ci.yml`. Testcontainers require Docker-in-Docker or a privileged runner. ## Open Technical Questions 1. **Graph viz library:** D3 vs Cytoscape — prototype comparison pending 2. **Neo4j driver:** official `neo4j` Python driver vs `neomodel` OGM 3. **Gantt implementation:** custom or frappe-gantt as starting point