# Non-Linear: Tech Stack & Architecture

## Stack Overview

```
┌─────────────────────────────────────────────────────────┐
│                        FRONTEND                         │
│  Vue 3 + Tailwind + Headless UI + ECharts               │
│  Graph Viz: TBD (D3 vs Cytoscape — eval pending)        │
│  Command Palette: vue-command-palette / custom           │
│  Keybindings: VueUse useMagicKeys                       │
│  Icons: Lucide  │  Font: Inter  │  Motion: @vueuse/motion│
│  State: Pinia   │  HTTP: ofetch │  WS: centrifuge-js     │
├─────────────────────────────────────────────────────────┤
│                     CROSS-PLATFORM                      │
│  Desktop: Tauri (thin wrapper, no offline) — v0.1       │
│  Mobile: Capacitor (responsive web first) — v0.2+       │
├─────────────────────────────────────────────────────────┤
│                        BACKEND                          │
│  FastAPI (Python)                                       │
│  Taskiq (async task queue — webhooks, imports, agents)   │
├─────────────────────────────────────────────────────────┤
│                      DATA LAYER                         │
│  Neo4j — graph topology (nodes, edges, status, labels)  │
│  Postgres — content & metadata (rich text, comments,    │
│             attachments meta, audit logs, project cfg)   │
│  Redis — caching, rate limiting                         │
│  Meilisearch — full-text search (issues, comments)      │
│  MinIO — S3-compatible file storage (attachments)       │
├─────────────────────────────────────────────────────────┤
│                      REAL-TIME                          │
│  Centrifugo — WebSocket server, live updates, push      │
├─────────────────────────────────────────────────────────┤
│                         AUTH                            │
│  Authentik — OIDC, API tokens, role mgmt, SSO-ready     │
├─────────────────────────────────────────────────────────┤
│                       INFRA/OPS                         │
│  Caddy (reverse proxy + TLS)  │  Vault (secrets)        │
│  Prometheus + Grafana (metrics + dashboards)            │
│  Loki (logs) │ Tempo (traces) │ OpenTelemetry (SDK)     │
├─────────────────────────────────────────────────────────┤
│                     DEPLOYMENT                          │
│  Docker Compose (dev + single-node production)          │
└─────────────────────────────────────────────────────────┘
```

## Data Boundary

### Neo4j — Graph Topology

Owns the decomposition tree and all overlay edges across the four data layers:

- **Node labels:** `Component` (Layer 1), `Issue` (Layer 2), `Artifact` (Layer 4), `Cycle`, `Project`
- Node identity (UUID), short ID
- Lightweight properties: status, labels, assignee_id, created_at, updated_at
- **Layer 1 edges:** `HAS_CHILD` between components (decomposition tree)
- **Layer 1→2 edges:** `HAS_CHILD` from components to issues (work attachment)
- **Layer 2 edges:** `BLOCKS`, `DUPLICATES`, `RELATES_TO` between issues (work coordination)
- **Layer 3 edges:** `DEPENDS_ON`, `IMPORTS`, `CALLS_API`, `SHARES_DB` between components (code connections)
- **Layer 4 edges:** `HAS_ARTIFACT` from components/issues to artifacts
- Project root references, cycle membership

Each edge type is scoped to a single layer, which enables efficient layer-filtered queries — a Cypher query for "show me this subtree with only Layer 3 edges" simply filters by relationship type.

**Why Neo4j over Postgres recursive CTEs:** Queries like "find all unblocked leaves in this subtree," "critical path through blocks links," "everything 3 hops from this node" are what Cypher is built for. CTEs get painful with lateral links and variable-depth queries. The gap widens with Layer 3 code connections (multi-hop dependency chains) and in v0.2+ with cross-project edges.

### Postgres — Content & Metadata

- **Rich text content:** issue and component descriptions (markdown)
- **Comment threads:** body, author, parent_comment_id (threading), timestamps
- **Attachment metadata:** filename, size, mime_type, s3_key, uploader_id, uploaded_at (inline attachments in comments/descriptions)
- **Artifact metadata (Layer 4):** title, kind, url/file_ref, mime_type — rich metadata for external docs, designs, and uploaded files attached to nodes
- **User/agent accounts:** profile data, preferences, notification settings
- **Project settings:** configuration, member lists, default policies
- **Audit logs:** who changed what, when, with before/after snapshots
- **Policy definitions:** role templates, custom permission rules

**Linked to Neo4j by UUID.** Neo4j node stores `id: "abc-123"`. Postgres stores full content keyed by same UUID. FastAPI joins them as needed. This applies to all node types: Components (Layer 1), Issues (Layer 2), and Artifacts (Layer 4).

### Redis — Caching & Real-Time

- Subtree query cache (TTL, invalidated on graph mutations)
- WebSocket pub/sub for real-time updates
- Rate limiting for agent API
- Authentik token validation cache

### Meilisearch — Search Index

- Indexes issue titles, descriptions, comments, labels
- Fed from both Neo4j and Postgres
- Powers command palette search (issues + commands in one result set)
- Typo-tolerant, prefix search, filtering by label/status/assignee

### MinIO — File Storage

- S3-compatible API, self-hosted
- Stores attachment files (images, docs)
- Postgres stores metadata and S3 key; MinIO stores bytes
- Migration path to AWS S3: zero code changes

## Concrete Database Schemas

### UUID Strategy

All entities use UUIDv7 (time-sortable). Generated application-side by FastAPI before writing to either database. The same UUID is used as the primary key in both Neo4j and Postgres, serving as the cross-database join key.

### Neo4j Schema

Neo4j stores graph topology and lightweight node properties. All content lives in Postgres.

**Node labels and properties:**

```cypher
// Layer 1: Component node
CREATE (c:Component {
  id: "uuidv7",
  short_id: "NL-C12",
  title: "auth-service",
  status: null,                    // components have no status
  labels: ["backend", "core"],
  owner_id: "uuidv7",
  assignee_id: null,
  repo_provider: "github",
  repo_url: "https://github.com/team/auth",
  repo_path: "/src/oauth",
  repo_branch: "main",
  created_at: datetime(),
  updated_at: datetime()
})

// Layer 2: Issue node
CREATE (i:Issue {
  id: "uuidv7",
  short_id: "NL-42",
  title: "implement refresh tokens",
  status: "todo",
  labels: ["feature", "p1"],
  assignee_id: "uuidv7",
  created_by: "uuidv7",
  cycle_id: "uuidv7",
  created_at: datetime(),
  updated_at: datetime()
})

// Layer 4: Artifact node
CREATE (a:Artifact {
  id: "uuidv7",
  title: "Login flow mockup",
  kind: "link",                    // "link" | "file" | "embed"
  url: "https://figma.com/...",    // for links/embeds
  file_ref: null,                  // MinIO s3_key for uploaded files
  mime_type: null,
  size_bytes: null,
  created_by: "uuidv7",
  created_at: datetime()
})

// Project root (virtual node linking to decomposition tree root)
CREATE (p:Project {
  id: "uuidv7",
  workspace_id: "uuidv7",
  root_id: "uuidv7"
})
```

**Relationships (organized by layer):**

```cypher
// Decomposition tree (parent → child) — Layer 1 + Layer 2
(component)-[:HAS_CHILD]->(component)    // Layer 1: structure nesting
(component)-[:HAS_CHILD]->(issue)        // Layer 1→2: work attached to structure
(issue)-[:HAS_CHILD]->(issue)            // Layer 2: sub-tasks

// Layer 2: Work coordination links (between issues)
(issue)-[:BLOCKS]->(issue)
(issue)-[:RELATES_TO]->(issue)
(issue)-[:DUPLICATES]->(issue)

// Layer 3: Code connection links (between components)
(component)-[:DEPENDS_ON {source: "manual"}]->(component)
(component)-[:IMPORTS {source: "inferred"}]->(component)
(component)-[:CALLS_API {source: "inferred"}]->(component)
(component)-[:SHARES_DB {source: "manual"}]->(component)

// Layer 4: Artifact attachments
(component)-[:HAS_ARTIFACT]->(artifact)
(issue)-[:HAS_ARTIFACT]->(artifact)

// Cycle membership
(issue)-[:IN_CYCLE]->(cycle:Cycle { id, name, start_date, end_date })
```

Layer 3 edges carry a `source` property (`"manual"` or `"inferred"`) to distinguish human-declared dependencies from code-analysis results.

**Indexes:**

```cypher
CREATE INDEX comp_id FOR (c:Component) ON (c.id);
CREATE INDEX comp_short FOR (c:Component) ON (c.short_id);
CREATE INDEX issue_id FOR (i:Issue) ON (i.id);
CREATE INDEX issue_short FOR (i:Issue) ON (i.short_id);
CREATE INDEX issue_status FOR (i:Issue) ON (i.status);
CREATE INDEX issue_assignee FOR (i:Issue) ON (i.assignee_id);
CREATE INDEX artifact_id FOR (a:Artifact) ON (a.id);
CREATE INDEX project_id FOR (p:Project) ON (p.id);
```

### Postgres Schema (SQLModel)

Postgres stores all content, metadata, and configuration. Managed via Alembic migrations.

```python
class NodeContent(SQLModel, table=True):
    """Rich content for both components and issues."""
    id: uuid.UUID = Field(primary_key=True)  # matches Neo4j node id
    description: str | None = None           # markdown
    description_html: str | None = None      # pre-rendered, sanitized HTML

class Comment(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True)
    author_id: uuid.UUID = Field(foreign_key="actor.id")
    body: str                                # markdown
    body_html: str                           # pre-rendered, sanitized HTML
    created_at: datetime
    updated_at: datetime

class CommentReaction(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    comment_id: uuid.UUID = Field(foreign_key="comment.id", index=True)
    actor_id: uuid.UUID = Field(foreign_key="actor.id")
    emoji: str                               # e.g. "+1", "rocket"
    created_at: datetime

class Attachment(SQLModel, table=True):
    """File attached inline to a comment or description (e.g. pasted image)."""
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True)
    filename: str
    size_bytes: int
    mime_type: str
    s3_key: str                              # MinIO object key
    uploader_id: uuid.UUID = Field(foreign_key="actor.id")
    uploaded_at: datetime

class ArtifactContent(SQLModel, table=True):
    """Layer 4: external context attached to a component or issue.
    Topology (HAS_ARTIFACT edge) lives in Neo4j; rich metadata lives here."""
    id: uuid.UUID = Field(primary_key=True)  # matches Neo4j Artifact node id
    title: str
    kind: str                                # "link" | "file" | "embed"
    url: str | None = None                   # external URL (Figma, Docs, etc.)
    file_ref: str | None = None              # MinIO s3_key for uploaded files
    mime_type: str | None = None
    size_bytes: int | None = None
    node_id: uuid.UUID = Field(foreign_key="nodecontent.id", index=True)
    created_by: uuid.UUID = Field(foreign_key="actor.id")
    created_at: datetime

class Actor(SQLModel, table=True):
    """Human user or AI agent."""
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    type: str                                # "user" | "agent"
    name: str
    email: str | None = None
    authentik_uid: str | None = None         # OIDC subject claim
    preferences: dict = Field(default_factory=dict)  # JSON: theme, notifications, etc.
    created_at: datetime

class Workspace(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    name: str
    slug: str = Field(unique=True, index=True)
    created_at: datetime

class WorkspaceMember(SQLModel, table=True):
    workspace_id: uuid.UUID = Field(foreign_key="workspace.id", primary_key=True)
    actor_id: uuid.UUID = Field(foreign_key="actor.id", primary_key=True)
    role: str                                # workspace-level role
    joined_at: datetime

class ProjectConfig(SQLModel, table=True):
    id: uuid.UUID = Field(primary_key=True)  # matches Neo4j Project id
    workspace_id: uuid.UUID = Field(foreign_key="workspace.id", index=True)
    name: str
    settings: dict = Field(default_factory=dict)  # JSON: custom statuses, defaults
    created_at: datetime

class PolicyRule(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True)
    actor_id: uuid.UUID | None = Field(default=None)     # null = role-level
    role_name: str | None = None
    action: str                              # e.g. "read_node", "create_child", "*"
    resource_scope: str                      # "global" | "subtree:{node_id}" | "node:{node_id}"
    effect: str                              # "allow" | "deny"

class AuditLog(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True)
    actor_id: uuid.UUID = Field(foreign_key="actor.id")
    action: str                              # e.g. "status_changed", "reparented"
    node_id: uuid.UUID | None = None
    before: dict | None = None               # JSON snapshot
    after: dict | None = None                # JSON snapshot
    created_at: datetime = Field(index=True)

class WebhookConfig(SQLModel, table=True):
    id: uuid.UUID = Field(default_factory=uuid7, primary_key=True)
    project_id: uuid.UUID = Field(foreign_key="projectconfig.id", index=True)
    url: str
    secret_hash: str                         # hashed, never stored plaintext
    events: list[str] = Field(default_factory=list)
    active: bool = True
    consecutive_failures: int = 0
    created_at: datetime
```

## Dual-Database Consistency

Neo4j and Postgres are **not replicated** — they own different data, linked by UUID. Both writes happen in the same API request. The consistency strategy for v0.1:

### Write Order

1. **Postgres first.** Open a SQLAlchemy transaction. Write content/metadata. Do not commit yet.
2. **Neo4j second.** Perform the graph mutation (create node, update properties, create edge).
3. **Commit Postgres.** If Postgres commit succeeds, the operation is complete.

### Failure Handling

- **Neo4j write fails:** Rollback the Postgres transaction (it hasn't committed). Clean failure, no orphans.
- **Postgres commit fails after Neo4j succeeds:** Issue a compensating operation on Neo4j (delete the node/revert the property change). Log the incident for review.
- **Partial Neo4j failure (e.g., network timeout with unknown state):** Flag the UUID for reconciliation review.

### Reconciliation Job

A periodic background task (Taskiq, runs every 15 minutes) checks for inconsistencies:

- UUIDs present in Neo4j but missing from Postgres (orphan graph nodes)
- UUIDs present in Postgres `NodeContent` but missing from Neo4j (orphan content)
- Mismatched lightweight properties (status, assignee) between Neo4j and Postgres audit log

Orphans are logged and surfaced in an admin dashboard. Auto-repair is deferred — manual review for v0.1.

### What's Eventually Consistent

- **Meilisearch index:** Updated asynchronously via Taskiq. Acceptable lag of seconds.
- **Redis cache:** Invalidated on mutation. TTL-based expiry as fallback.
- **Centrifugo events:** Fire-and-forget publish. Missed events are recoverable by client re-fetch.

## Backend Architecture

### FastAPI Application Structure

```
non-linear-api/
├── app/
│   ├── main.py                 # App, middleware, startup/shutdown
│   ├── config.py               # Settings from env vars
│   ├── dependencies.py         # Shared deps (db sessions, auth, current_user)
│   ├── auth/                   # Authentik integration
│   │   ├── oidc.py             # Token validation, OIDC discovery
│   │   ├── permissions.py      # Policy engine evaluation
│   │   └── agent_tokens.py     # API token management for agents
│   ├── graph/                  # Neo4j layer
│   │   ├── connection.py       # Neo4j driver management
│   │   ├── queries.py          # Cypher query templates
│   │   ├── mutations.py        # Graph write operations
│   │   └── traversal.py        # Subtree, path, neighbor queries
│   ├── content/                # Postgres layer
│   │   ├── models.py           # SQLAlchemy/SQLModel models
│   │   ├── descriptions.py     # Rich text CRUD
│   │   ├── comments.py         # Comment thread CRUD
│   │   ├── attachments.py      # Inline attachment metadata + MinIO upload/download
│   │   └── artifacts.py        # Layer 4: artifact CRUD (links, files, embeds)
│   ├── connections/            # Layer 3: code connection analysis
│   │   ├── inference.py        # Auto-infer dependencies from repo analysis
│   │   └── manual.py           # Manual code connection CRUD
│   ├── search/                 # Meilisearch integration
│   │   ├── indexer.py          # Index updates on mutations
│   │   └── search.py           # Query interface
│   ├── realtime/               # WebSocket layer
│   │   ├── manager.py          # Connection management
│   │   └── events.py           # Event types and broadcasting
│   ├── tasks/                  # Taskiq background jobs
│   │   ├── webhooks.py         # Deliver webhooks to agent endpoints
│   │   ├── indexing.py         # Async search index updates
│   │   ├── notifications.py   # Notification delivery
│   │   └── connections.py     # Layer 3: periodic code connection inference
│   └── api/v1/                 # Route handlers
│       ├── nodes.py            # CRUD + tree operations
│       ├── links.py            # Lateral link management (Layer 2 + Layer 3)
│       ├── projects.py         # Project CRUD
│       ├── comments.py         # Comment endpoints
│       ├── attachments.py      # Inline upload/download
│       ├── artifacts.py        # Layer 4: artifact endpoints
│       ├── connections.py      # Layer 3: code connection endpoints
│       ├── search.py           # Search endpoint
│       └── agent.py            # Agent-specific API surface
├── tests/
├── alembic/                    # Postgres migrations
├── docker-compose.yml
└── pyproject.toml
```

### Request Flows

**Typical read ("get node with full context"):**

```
Client → FastAPI → Auth middleware (validate token via Authentik)
  → Policy engine (check permissions)
  → Neo4j: fetch node + parent + children + links
  → Postgres: fetch description, comments, attachment meta
  → Merge response → Client
```

**Typical write ("change node status"):**

```
Client → FastAPI → Auth → Policy engine
  → Neo4j: update node status
  → Redis: invalidate cache, publish event
  → Taskiq: queue webhook delivery, search index update
  → WebSocket: broadcast to connected clients
  → Response → Client
```

### Sync Strategy (Neo4j ↔ Postgres)

Not replicated — they own different data. Linked by UUID. Both operations happen in same API request. Compensating transaction pattern for consistency. Eventual consistency acceptable for search index and cache.

## Auth Architecture

```
┌──────────┐     OIDC token      ┌───────────┐
│  Vue App ├─────────────────────►│ Authentik │
└────┬─────┘     (login flow)    └─────┬─────┘
     │                                 │
     │ Bearer token                    │ Token introspection
     ▼                                 ▼
┌──────────┐◄────────────────────┌───────────┐
│ FastAPI  │  validate token     │ Authentik │
│ (resource│  check claims       │  (OIDC    │
│  server) │                     │  provider)│
└──────────┘                     └───────────┘
```

- **Human users:** OIDC login flow. JWT access tokens.
- **AI agents:** API tokens issued through Authentik, tied to agent actor accounts.
- **FastAPI:** pure resource server. Validates tokens, reads claims, enforces policies.

## API Error Contract

All error responses use a consistent envelope:

```json
{
  "error": {
    "code": "validation_error",
    "message": "Human-readable description",
    "details": [
      { "field": "title", "message": "Field is required" }
    ]
  }
}
```

### HTTP Status Codes

| Code | Usage |
|------|-------|
| `400` | Malformed request (bad JSON, missing required fields) |
| `404` | Resource not found **or** actor lacks permission to see it. Permission-denied nodes return 404 (not 403) to prevent information leakage about resource existence. |
| `409` | Conflict (e.g., duplicate `short_id`, stale update) |
| `422` | Validation error. Standard FastAPI/Pydantic response with field-level detail. |
| `429` | Rate limited. Includes `Retry-After` header (seconds). |
| `500` | Internal server error. Logged with correlation ID for debugging. |

### Rate Limiting

- Agent API: token bucket per actor, configurable per role (default: 100 req/min).
- Human API: higher limits (default: 300 req/min).
- Enforced via Redis. `429` response includes `Retry-After` and `X-RateLimit-Remaining` headers.

## Security

### Input Sanitization

- **Cypher injection:** All Neo4j queries use parameterized Cypher exclusively. User-supplied values are never interpolated into query strings. The `graph/queries.py` module enforces this by accepting only typed parameters.
- **SQL injection:** SQLModel/SQLAlchemy parameterized queries. No raw SQL with string formatting.
- **XSS prevention:** All markdown content (descriptions, comments) is sanitized server-side using `nh3` (Rust-based HTML sanitizer) before storage. Both raw markdown and pre-rendered sanitized HTML are stored. The frontend renders the pre-sanitized HTML.
- **File upload validation:** MIME type validation against allowlist (images, PDFs, common doc formats). Size limit: 25 MB per file. Filename sanitization to prevent path traversal.

### Transport & Headers

- **TLS:** All traffic encrypted via Caddy reverse proxy (automatic Let's Encrypt certificates).
- **CSRF:** SameSite=Lax cookies for browser sessions. Bearer token API calls are inherently CSRF-safe.
- **Content-Security-Policy:** Strict CSP headers served by Caddy. `script-src 'self'`, no inline scripts, no `eval`.
- **CORS:** Allowlist of known origins (frontend domain). No wildcard in production.
- **Security headers:** `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY`, `Strict-Transport-Security`.

## Design Language

Targets Linear's aesthetic: minimal, fast, slightly dark-IDE feel.

- **Spacing:** tight, no wasted space
- **Colors:** muted base palette, high-contrast accents only for status/priority
- **Borders:** almost none — separation via spacing and subtle background shifts
- **Dark mode:** default, light mode secondary
- **Typography:** Inter, small-but-readable sizes
- **Animations:** subtle slides and fades, 100-150ms, nothing bouncy
- **Optimistic updates:** every interaction feels instant, syncs in background

## Real-Time Updates (Centrifugo)

Centrifugo handles both live UI updates and notification delivery over WebSocket. Redis is no longer used for WebSocket pub/sub directly — Centrifugo manages its own connections and subscribes to events published by the backend via its server API.

### Channel Structure

| Channel | Scope | Subscribers |
|---------|-------|-------------|
| `project:{id}` | All mutations in a project | All connected project members |
| `node:{id}` | Mutations to a specific node | Clients viewing the focus widget for that node |
| `user:{id}` | Personal notifications | Single user's connected clients |

### Events Pushed

| Event | Layer | Channel | Payload |
|-------|-------|---------|---------|
| `node.status_changed` | 2 | `project:{id}` + `node:{id}` | node_id, old_status, new_status, actor |
| `node.created` | 1/2 | `project:{id}` | node_id, parent_id, type, title, actor |
| `node.deleted` | 1/2 | `project:{id}` + `node:{id}` | node_id, actor |
| `node.reparented` | 1/2 | `project:{id}` + `node:{id}` | node_id, old_parent, new_parent, actor |
| `comment.added` | 2 | `node:{id}` | comment_id, node_id, author, preview |
| `link.changed` | 2/3 | `project:{id}` | source_id, target_id, link_type, layer, action (created/removed) |
| `assignment.changed` | 2 | `project:{id}` + `node:{id}` | node_id, old_assignee, new_assignee |
| `artifact.attached` | 4 | `project:{id}` + `node:{id}` | artifact_id, node_id, title, kind, actor |
| `artifact.removed` | 4 | `project:{id}` + `node:{id}` | artifact_id, node_id, actor |
| `connection.inferred` | 3 | `project:{id}` | source_id, target_id, link_type, source: "inferred" |
| `notification` | — | `user:{id}` | notification object |

The `layer` field on `link.changed` events tells the client which layer the change affects, enabling clients to ignore events for inactive layers.

### Backend Publish Flow

```
Mutation request → Postgres + Neo4j writes
  → Centrifugo server API: publish event to relevant channels
  → Taskiq: queue webhook delivery + search index update
  → Response to client
```

The backend publishes to Centrifugo via its HTTP server API (not through Redis pub/sub). This gives direct control over which channels receive which events.

### Client-Side Handling

- **Pinia store:** Incoming Centrifugo events are applied to the Pinia store. The graph view, focus widget, and list view all react to store changes.
- **Optimistic updates:** The client applies mutations locally before the server responds. If the server rejects the mutation (4xx), the client reverts the optimistic change by re-fetching the affected node.
- **Conflict model:** Last-write-wins for simple fields (status, assignee, labels). The server is the source of truth. When two clients modify the same field concurrently, the last write committed to Neo4j is the one that Centrifugo broadcasts.
- **Reconnection:** On WebSocket disconnect, the client re-subscribes to channels and fetches the current state to catch up on missed events.

### Cross-Platform

- **Tauri desktop:** No offline support. Tauri wraps the Vue app as-is. When the network is unavailable, the app shows a connection-lost banner and retries. No local mutation queue.

## Docker Compose

### Development

```yaml
services:
  api:          # FastAPI (uvicorn --reload)
  frontend:     # Vue 3 (vite dev server)
  worker:       # Taskiq worker (same codebase as api)
  neo4j:        # Graph database
  postgres:     # Relational database
  redis:        # Cache + rate limiting
  meilisearch:  # Search engine
  minio:        # Object storage
  centrifugo:   # Real-time WebSocket server
  authentik:    # Identity provider (server + worker)
  authentik-db: # Authentik's own Postgres
```

~12 containers. Runs comfortably on 16GB RAM.

### Production (Single-Node)

Same Docker Compose topology with production-grade additions:

```yaml
services:
  # ... all of the above, plus:
  caddy:        # Reverse proxy + automatic TLS
  vault:        # Secrets management (HashiCorp Vault)
  prometheus:   # Metrics collection
  grafana:      # Dashboards + alerting
  loki:         # Log aggregation
  tempo:        # Distributed tracing
```

~18 containers total. Recommended: 32GB RAM, 4+ CPU cores for production.

## Reverse Proxy (Caddy)

Caddy serves as the single entry point for all traffic:

- **Automatic TLS** via Let's Encrypt (ACME). Zero-config HTTPS.
- **Routes:** `/api/*` → FastAPI, `/ws/*` → Centrifugo, `/*` → Vue frontend (nginx or static files).
- **Security headers:** CSP, HSTS, X-Frame-Options, X-Content-Type-Options injected at this layer.
- **Rate limiting:** Basic connection-level rate limiting as a first defense layer (application-level rate limiting in FastAPI for finer control).

## Secrets Management

### HashiCorp Vault (Primary)

- All sensitive configuration (database passwords, Authentik client secrets, agent API token signing keys, webhook HMAC secrets, MinIO credentials) stored in Vault.
- FastAPI reads secrets from Vault at startup via the `hvac` Python client.
- Secret rotation supported without application restart (Vault dynamic secrets for Postgres credentials).

### Docker Secrets (Fallback)

For simpler deployments that don't want Vault overhead, Docker secrets via compose files are supported. Environment variables as the last resort.

## Observability

### Metrics (Prometheus + Grafana)

- **FastAPI:** `prometheus-fastapi-instrumentator` exposes request latency, status codes, in-flight requests at `/metrics`.
- **Neo4j:** Neo4j Prometheus plugin or `neo4j-exporter` for query latency, cache hit rates, transaction counts.
- **Postgres:** `postgres_exporter` for connection pool, query stats, replication lag.
- **Redis:** `redis_exporter` for memory, hit rate, connected clients.
- **Centrifugo:** Built-in Prometheus metrics for connections, channels, messages.
- **Grafana dashboards:** Pre-built dashboards for each service. Alerting rules for error rate spikes, high latency, container restarts.

### Tracing (OpenTelemetry + Tempo)

- OpenTelemetry SDK instrumented in FastAPI. Traces span the full request lifecycle: auth → policy check → Neo4j query → Postgres query → response.
- Trace context propagated to Taskiq workers (webhook delivery, indexing).
- Traces stored in Grafana Tempo, queryable from Grafana.

### Logging (Structured JSON + Loki)

- All services emit structured JSON logs (Python `structlog` for FastAPI).
- Fields: timestamp, level, correlation_id, actor_id, action, duration_ms.
- Collected by Grafana Loki via Docker logging driver or Promtail.
- Correlation ID links logs across FastAPI → Taskiq → Centrifugo for a single request.

### Health Checks

Every service exposes a health check endpoint used by Docker Compose `healthcheck` directives:

- `GET /health` on FastAPI, Centrifugo
- TCP checks for Neo4j, Postgres, Redis, Meilisearch, MinIO
- Grafana alerts on health check failures.

## Database Migrations

### Postgres (Alembic)

- Alembic manages all Postgres schema migrations.
- Migration files stored in `alembic/versions/`.
- Auto-generated from SQLModel model changes (`alembic revision --autogenerate`).
- Applied on deployment: `alembic upgrade head` runs before the API container starts.

### Neo4j (Versioned Cypher Scripts)

- Migration scripts stored in `neo4j/migrations/` as numbered Cypher files (`001_initial_schema.cypher`, `002_add_cycle_nodes.cypher`).
- A lightweight migration runner (Python script) tracks applied migrations in a Neo4j `:Migration` node.
- Applied on deployment before the API container starts.

## Testing Strategy

### Integration Tests (Primary)

- **Framework:** pytest with testcontainers.
- **Containers:** Neo4j, Postgres, Redis, Meilisearch spun up per test session (shared across tests for speed, reset between test classes).
- **Scope:** API endpoint tests hitting real databases. Policy engine tests with real Neo4j graph structures. Dual-DB consistency tests verifying write-order semantics.
- **Fixtures:** Factory functions that create graph structures (components, issues, links) for test scenarios.

### End-to-End Tests

- **Framework:** Playwright against the full Docker Compose stack.
- **Scope:** Critical user flows — create project, add components, navigate graph, triage inbox, agent API workflows.
- **Environment:** Dedicated `docker-compose.test.yml` with ephemeral containers.

### What's Not Mandated

Isolated unit tests are not required by convention. The dual-DB architecture makes mocking both databases brittle. Integration tests with real containers are the priority.

## CI/CD Pipeline

```
push/MR → lint → test → build → deploy
```

| Stage | Tools | Description |
|-------|-------|-------------|
| **Lint** | ruff (Python), eslint + prettier (Vue/TS) | Code style and static analysis |
| **Test** | pytest + testcontainers, Playwright | Integration + E2E tests |
| **Build** | Docker | Build API, frontend, worker images |
| **Push** | Container registry | Push tagged images to GitLab Container Registry |
| **Deploy** | SSH + docker compose pull | Pull new images on production server, rolling restart |

CI runs on GitLab CI. Pipeline definition in `.gitlab-ci.yml`. Testcontainers require Docker-in-Docker or a privileged runner.

## Open Technical Questions

1. **Graph viz library:** D3 vs Cytoscape — prototype comparison pending
2. **Neo4j driver:** official `neo4j` Python driver vs `neomodel` OGM
3. **Gantt implementation:** custom or frappe-gantt as starting point