# Architecture

This document describes how the workflow engine actually works. If you're looking for setup instructions, check the README. If you want to understand the design decisions behind the workflow orchestration, see WORKFLOW_ORCHESTRATOR_DESIGN.md.

## What This System Does

The workflow engine is a multi-tenant backend service that executes user-defined workflows. Think of it as the business logic layer behind a visual workflow builder like Zapier or n8n, but designed to be embedded in your own product.

Users create workflow graphs (nodes and edges) through your frontend. When triggered, the engine executes those graphs node-by-node, handling API calls, conditional logic, parallel branches, delays, and callbacks from external systems. Each organization's data is isolated at the database level using PostgreSQL Row-Level Security.

The system doesn't have its own UI. It exposes REST APIs that your frontend consumes. Authentication is done via API keys (service-to-service), not user logins.

## High-Level Architecture

The system splits responsibilities between two main components:

**Django (Control Plane)** - Stores workflow definitions, manages triggers, handles multi-tenancy, and provides REST APIs for frontends. It's the source of truth for "what workflows exist" and "who can access what."

**Hatchet (Runtime)** - Executes workflow graphs durably. When a workflow starts, Django hands off execution to Hatchet, which handles the hard distributed systems problems: retries, timeouts, waiting for external callbacks, and surviving process crashes.

This separation keeps concerns clear. Django does CRUD and multi-tenant filtering (what it's good at), while Hatchet handles durable execution (what it's good at).

## Three-Layer Structure

### API Layer

The API layer is Django REST Framework running in Gunicorn (for HTTP) and Daphne (for WebSockets). It handles:

- REST endpoints for managing workflows, triggers, and viewing execution history
- API key authentication and tenant isolation
- Input validation against JSON schemas
- WebSocket connections for real-time execution updates

When a request comes in, middleware extracts the API key, validates it, looks up the organization, and sets PostgreSQL's RLS context. From that point on, all database queries automatically filter by that organization.

Key middleware:
- `TenantMiddleware` - Validates API keys and sets RLS context (runs before DRF auth)
- `RequestLoggingMiddleware` - Logs requests with timing for debugging

The API is documented via drf-spectacular (OpenAPI/Swagger).

### Worker Layer

The worker layer has two types of processes:

**Celery Workers** - Handle background tasks like checking scheduled triggers (cron jobs) and cleaning up expired callbacks. These are traditional background jobs, not workflow execution. Celery Beat acts as the scheduler.

**Hatchet Workers** - Run workflow graphs. These long-lived processes register with the Hatchet server, receive workflow execution requests, and interpret graphs node-by-node. If a node needs to wait for an external callback (e.g., human approval), the workflow pauses and Hatchet persists the state. When the callback arrives, execution resumes right where it left off.

There's a single Hatchet workflow definition (`graph-executor`) that handles all workflow templates by interpreting their graph structure at runtime. This avoids creating separate Hatchet workflows for each user template.

### Data Layer

PostgreSQL 16 stores everything: organizations, API keys, workflow templates, execution history, credentials metadata, audit logs. Redis handles caching and acts as the message broker for Celery and Channels (WebSockets).

The key feature here is Row-Level Security (RLS). Every tenant-scoped table has policies that automatically filter queries by `app.current_tenant`. This is set per-request via middleware, so you can't accidentally query another org's data even if you try.

Tables without RLS:
- `api_key` - Needs to be looked up before knowing which tenant
- `organization` - The tenant root
- `auth_user` - Django admin users
- `workflow_triggers` - Bootstrap lookup for webhooks (secured via webhook_secret)
- `action_callbacks` - Bootstrap lookup for callbacks (secured via callback_token)

## Authentication Flow

The system uses API key authentication, not user sessions or JWT. Each organization has one or more API keys with a format like `wfe_a1b2c3d4e5f6...` (256-bit entropy).

Here's what happens on each request:

1. Client sends `X-API-Key: wfe_a1b2c3d4e5f6...` header
2. `TenantMiddleware` runs (before DRF authentication):
   - Extracts the key
   - Looks up the key by prefix and validates the SHA256 hash
   - Sets `request.organization` and `request.api_key`
   - Calls PostgreSQL's `set_current_tenant(org_id)` function
3. `APIKeyAuthentication` (DRF) reads the middleware result and returns a dummy user object
4. View handles the request with `request.organization` available

The actual key is only shown once at creation. After that, only the SHA256 hash is stored.

Django admin uses separate authentication (standard `createsuperuser`). Superusers can use the `X-Switch-Tenant` header to access specific tenant data for debugging.

## Multi-Tenancy via RLS

The multi-tenancy strategy is: single database, shared schema, automatic filtering via PostgreSQL Row-Level Security.

Every tenant-scoped table has an `organization_id` foreign key and RLS policies like:

```sql
CREATE POLICY workflow_template_isolation_policy ON workflow_template
    USING (organization_id = get_current_tenant());
```

Middleware sets the current tenant before any ORM queries:

```sql
SELECT set_current_tenant('organization-uuid'::uuid);
```

From that point on, PostgreSQL automatically filters all queries. You can write `WorkflowTemplate.objects.all()` and it only returns templates for that org. You can't turn this off (except as a superuser).

This defense-in-depth approach means even if application code has a bug, cross-tenant data leaks are prevented at the database level.

## Project Structure

The codebase is organized in a domain-grouped layout. Here's the complete structure with what each piece does:

```
workflow-engine/
├── config/                  # Django project configuration
│   ├── settings.py         # Main settings (database, DRF, middleware, apps)
│   ├── urls.py             # Root URL routing
│   ├── celery.py           # Celery configuration and beat schedule
│   ├── wsgi.py             # WSGI entry point (Gunicorn)
│   └── asgi.py             # ASGI entry point (Daphne, WebSockets)
│
├── apps/                    # All application code
│   ├── core/               # Foundation layer (no business logic)
│   │   ├── authentication/ # API key authentication
│   │   │   ├── models.py   # APIKey model (hash storage, validation)
│   │   │   ├── api_key_auth.py  # DRF authentication class
│   │   │   ├── admin.py    # Django admin for API keys
│   │   │   ├── urls.py     # Empty (auth via middleware)
│   │   │   ├── apps.py     # App config
│   │   │   ├── management/commands/
│   │   │   │   └── create_api_key.py  # CLI tool to generate keys
│   │   │   └── migrations/ # Database migrations
│   │   │
│   │   ├── organizations/  # Multi-tenancy root
│   │   │   ├── models.py   # Organization model (tenant entity)
│   │   │   ├── views.py    # CRUD endpoints for orgs
│   │   │   ├── serializers.py
│   │   │   ├── urls.py     # Organization routes
│   │   │   ├── admin.py    # Django admin for orgs
│   │   │   ├── apps.py     # App config
│   │   │   ├── tests.py    # Unit tests
│   │   │   └── migrations/ # Database migrations
│   │   │
│   │   └── common/         # Shared utilities
│   │       ├── enums.py    # All status enums (TemplateStatus, TriggerType, etc.)
│   │       ├── middleware.py  # TenantMiddleware (RLS), RequestLoggingMiddleware
│   │       ├── apps.py     # App config
│   │       ├── management/commands/
│   │       │   ├── seed_data.py     # Create dev org + API key
│   │       │   ├── verify_setup.py  # Verify RLS works
│   │       │   └── reset_db.py      # Nuke database (dev only)
│   │       └── migrations/ # Database migrations
│   │
│   ├── workflows/          # Core domain (MOST IMPORTANT)
│   │   ├── models.py       # 6 main models:
│   │   │                   #  - WorkflowAction (action catalog)
│   │   │                   #  - WorkflowTemplate (graph definitions)
│   │   │                   #  - WorkflowTrigger (trigger configs)
│   │   │                   #  - WorkflowInstance (execution projections)
│   │   │                   #  - ActionCallback (async callback tracking)
│   │   │                   #  - WorkflowEvent (audit log)
│   │   │
│   │   ├── views.py        # REST API ViewSets:
│   │   │                   #  - WorkflowActionViewSet (read-only catalog)
│   │   │                   #  - WorkflowTemplateViewSet (CRUD + activate/clone)
│   │   │                   #  - WorkflowTriggerViewSet (CRUD + pause/resume)
│   │   │                   #  - WorkflowInstanceViewSet (manual trigger + list)
│   │   │                   #  - WebhookTriggerView (public webhook receiver)
│   │   │                   #  - CallbackView (public callback receiver)
│   │   │
│   │   ├── serializers.py  # DRF serializers for all models
│   │   ├── urls.py         # URL routing
│   │   ├── tasks.py        # Celery tasks (check scheduled triggers, cleanup)
│   │   ├── admin.py        # Django admin config (workflow testing interface)
│   │   ├── apps.py         # App config
│   │   ├── tests.py        # Unit tests
│   │   │
│   │   ├── services/       # Business logic layer (SERVICE LAYER PATTERN)
│   │   │   ├── __init__.py
│   │   │   ├── execution_service.py    # Start, cancel, complete workflows
│   │   │   ├── template_service.py     # Create, activate, clone templates
│   │   │   ├── trigger_manager.py      # Trigger matching and webhook validation
│   │   │   ├── graph_validator.py      # Graph structure validation
│   │   │   └── schema_validator.py     # JSON schema validation
│   │   │
│   │   ├── hatchet_workflows/  # Hatchet workflow definitions
│   │   │   ├── __init__.py
│   │   │   ├── graph_executor.py   # Main workflow: interprets graphs
│   │   │   └── executor.py         # Helper: node execution logic
│   │   │
│   │   ├── hatchet_client.py   # Singleton Hatchet client
│   │   │
│   │   ├── internal_actions/   # Built-in action handlers
│   │   │   ├── __init__.py
│   │   │   ├── catalog.py          # System action definitions (SYSTEM_ACTIONS)
│   │   │   ├── handler_registry.py # Action handler lookup
│   │   │   └── handlers/
│   │   │       ├── __init__.py
│   │   │       ├── http.py         # HTTP requests (GET, POST, etc.)
│   │   │       ├── email.py        # Send emails
│   │   │       ├── transform.py    # Data transformations (jq, JSONPath)
│   │   │       ├── log.py          # Logging
│   │   │       ├── delay.py        # Sleep/wait
│   │   │       └── ai.py           # AI completions (LLM calls)
│   │   │
│   │   ├── templates/      # Django templates (for admin UI)
│   │   │   └── admin/workflows/
│   │   │       └── workflow_tester.html  # Admin workflow testing interface
│   │   │
│   │   ├── management/commands/
│   │   │   ├── __init__.py
│   │   │   ├── run_hatchet_worker.py   # Start Hatchet worker
│   │   │   └── sync_action_catalog.py  # Sync SYSTEM_ACTIONS to DB
│   │   │
│   │   └── migrations/     # Database migrations
│   │
│   ├── iam/                # Identity and access management
│   │   ├── credentials/    # Credential management
│   │   │   ├── models.py   # CredentialsMetadata, CredentialAccessLog, LocalCredential
│   │   │   ├── views.py    # CRUD endpoints for credentials
│   │   │   ├── serializers.py  # Credential serializers
│   │   │   ├── services.py # CredentialService (fetch with caching, rotation)
│   │   │   ├── permissions.py  # Permission checks
│   │   │   ├── throttling.py   # Rate limiting
│   │   │   ├── urls.py     # Credential routes
│   │   │   ├── admin.py    # Django admin
│   │   │   ├── apps.py     # App config
│   │   │   ├── tests.py    # Unit tests
│   │   │   └── migrations/ # Database migrations
│   │   │
│   │   └── rbac/           # Role-based access control (DEPRECATED)
│   │       ├── models.py   # Role, Permission models (unused with API keys)
│   │       ├── views.py    # RBAC endpoints (deprecated)
│   │       ├── admin.py    # Django admin
│   │       ├── apps.py     # App config
│   │       ├── tests.py    # Unit tests
│   │       └── migrations/ # Database migrations
│   │
│   ├── integrations/       # External service integrations
│   │   ├── infisical/      # Secrets management
│   │   │   ├── __init__.py # get_vault_manager() factory
│   │   │   ├── vault_manager.py        # VaultManager (Infisical SDK wrapper)
│   │   │   ├── local_vault_manager.py  # LocalVaultManager (dev fallback)
│   │   │   ├── models.py              # LocalCredential (encrypted storage)
│   │   │   ├── views.py    # Infisical endpoints (if any)
│   │   │   ├── admin.py    # Django admin
│   │   │   ├── apps.py     # App config
│   │   │   ├── tests.py    # Unit tests
│   │   │   └── migrations/ # Database migrations
│   │   │
│   │   └── n8n/            # n8n integration
│   │       ├── models.py   # N8NWorkflow, N8NExecution, ActivityN8NMapping
│   │       ├── views.py    # n8n webhook endpoints
│   │       ├── admin.py    # Django admin
│   │       ├── apps.py     # App config
│   │       ├── tests.py    # Unit tests
│   │       └── migrations/ # Database migrations
│   │
│   └── operations/         # Cross-cutting operational concerns
│       ├── audit/          # Audit logging
│       │   ├── models.py   # AuditLog model
│       │   ├── views.py    # Audit log endpoints
│       │   ├── admin.py    # Django admin
│       │   ├── apps.py     # App config
│       │   ├── tests.py    # Unit tests
│       │   └── migrations/ # Database migrations
│       │
│       ├── files/          # File management
│       │   ├── models.py   # File metadata
│       │   ├── views.py    # File upload/download endpoints
│       │   ├── admin.py    # Django admin
│       │   ├── apps.py     # App config
│       │   ├── tests.py    # Unit tests
│       │   └── migrations/ # Database migrations
│       │
│       ├── notifications/  # Multi-channel notifications
│       │   ├── models.py   # NotificationTemplate, NotificationLog
│       │   ├── views.py    # Notification endpoints
│       │   ├── admin.py    # Django admin
│       │   ├── apps.py     # App config
│       │   ├── tests.py    # Unit tests
│       │   └── migrations/ # Database migrations
│       │
│       └── system/         # System monitoring
│           ├── models.py   # SystemHealth, Metrics
│           ├── views.py    # Health check endpoints
│           ├── admin.py    # Django admin
│           ├── apps.py     # App config
│           ├── tests.py    # Unit tests
│           └── migrations/ # Database migrations
│
├── docs/                   # Documentation
│   ├── ARCHITECTURE.md     # This file (system architecture guide)
│   ├── WORKFLOW_ORCHESTRATOR_DESIGN.md  # Workflow design deep dive
│   ├── MIGRATION_PLAN.md   # Temporal → Hatchet migration plan
│   ├── SECURITY.md         # Security considerations
│   └── adr/                # Architecture Decision Records
│       ├── README.md       # ADR index
│       ├── 0001-domain-grouped-project-structure.md
│       ├── 0002-webhook-secret-write-once-pattern.md
│       ├── 0003-manual-trigger-endpoint-and-trigger-metadata-rename.md
│       └── 0004-api-key-authentication.md
│
├── scripts/                # Utility scripts
│   └── init-db.sql         # PostgreSQL initialization (RLS functions, extensions)
│
├── manage.py               # Django management script
├── pyproject.toml          # Project dependencies (PEP 621)
├── requirements.txt        # Python dependencies (Docker uses this)
├── requirements-local.txt  # Local dev dependencies
├── pytest.ini              # Pytest configuration
├── .env.example            # Environment variables template (if exists)
├── .gitignore              # Git ignore patterns
├── Dockerfile              # Backend container definition
├── docker-compose.yml      # Main services (postgres, redis, backend, workers)
├── docker-compose.hatchet.yml   # Hatchet engine services
├── docker-compose.infisical.yml # Infisical services (optional)
├── docker-compose.override.yml  # Local dev overrides
├── CLAUDE.md               # AI assistant instructions
├── README.md               # Setup and getting started
├── TODOS.md                # Task tracker and migration progress
└── Agents.md               # Agent workflow documentation (if exists)
```

### Key Directories Explained

**config/** - Django project settings. Start here to understand how apps are wired together. `settings.py` has the installed apps list, middleware stack, and DRF config. `celery.py` defines periodic tasks. `urls.py` is the root URL router that includes all app URLs.

**apps/workflows/** - The heart of the system. If you're touching business logic, it's here. Models define the domain, views handle HTTP, services contain the logic, and hatchet_workflows run the graphs. The `templates/` subdirectory contains Django templates for the admin interface workflow testing UI.

**apps/workflows/services/** - Service layer pattern. All business logic goes here, NOT in views or models. Views are thin adapters, services do the work. This keeps business rules testable and reusable. Every service returns domain objects or raises domain exceptions.

**apps/workflows/internal_actions/** - Built-in action system. `catalog.py` defines SYSTEM_ACTIONS (the built-in action definitions). `handler_registry.py` maps action slugs to handler functions. `handlers/` contains the actual handler implementations (http.py for HTTP requests, email.py for emails, etc.). Add new system actions by creating a handler and registering it in the catalog.

**apps/workflows/hatchet_workflows/** - Hatchet workflow definitions. `graph_executor.py` contains the main `graph-executor` workflow that interprets all user templates at runtime. `executor.py` has the node execution logic. This is the runtime that executes user-defined graphs.

**apps/workflows/templates/** - Django templates (NOT workflow templates). Contains HTML files for Django admin UI, specifically the workflow testing interface at `admin/workflows/workflow_tester.html`. This is standard Django templates directory structure.

**apps/core/common/** - Shared utilities. `enums.py` has ALL status enums (single source of truth). `middleware.py` has the RLS logic and request logging. Management commands for dev setup (seed_data, verify_setup, reset_db).

**apps/iam/credentials/** - Credential metadata only. Actual secrets are in Infisical (vault_manager.py) or LocalCredential model (dev). Never store plaintext secrets in Django models. `services.py` handles credential fetching with caching.

**apps/integrations/** - External service adapters. Keep integration code isolated here so it's easy to swap or mock. `infisical/` handles secrets management. `n8n/` handles external workflow execution.

**apps/operations/** - Cross-cutting concerns that don't fit in the core domain. Audit logs, file management, notifications, and system health monitoring. These are supporting features, not core business logic.

**scripts/** - Utility scripts for setup and maintenance. `init-db.sql` is run during Docker initialization to set up PostgreSQL extensions (uuid-ossp, pg_trgm, btree_gin) and create RLS functions (`set_current_tenant()`, `get_current_tenant()`). This is the foundation for multi-tenant isolation.

### Important Conventions

**Service Layer Pattern** - Business logic lives in `services/`, not views or models. Views validate input and call services. Models are just data structures.

**RLS Everywhere** - Every tenant-scoped model has `organization_id` and RLS policies. Middleware sets context per-request. You can't opt out.

**Status Enums** - Defined in `apps/core/common/enums.py`. Don't hardcode status strings. Import the enum.

**UUID Primary Keys** - All models use UUIDs, not integers. Prevents enumeration attacks and makes multi-database sharding possible.

**No Soft Deletes on Most Models** - Templates have ARCHIVED status. Otherwise, deletes are hard deletes. KISS principle.

**Migrations** - Each app has its own `migrations/` directory containing database schema changes. Django auto-generates these when you modify models. Run `poe makemigrations` after model changes to generate migration files, then `poe migrate` to apply them to the database. Migrations are sequential and versioned.

**Management Commands** - Custom Django management commands go in `management/commands/` within each app. Each command is a Python file with a `Command` class that extends `BaseCommand`. Examples: `seed_data` (creates test org and API key), `sync_action_catalog` (syncs SYSTEM_ACTIONS to DB), `run_hatchet_worker` (starts Hatchet workflow worker). Run with `python manage.py <command_name>` or `poe <command_name>`.

**Apps.py** - Every Django app has an `apps.py` file defining the app configuration (name, verbose name, default auto field). This registers the app with Django.

**Admin.py** - Django admin interface configuration. Registers models for admin CRUD and customizes the admin UI. The workflows admin has a custom workflow tester interface.

**__init__.py** - Python package marker. Makes directories importable as Python modules. Some have initialization code (e.g., `integrations/infisical/__init__.py` exports `get_vault_manager()`).

### Entry Points

**HTTP API** - `config/wsgi.py` → Gunicorn → Django → `config/urls.py` → app URLs → views

**WebSocket** - `config/asgi.py` → Daphne → Channels → routing (not heavily used yet)

**Celery Worker** - `config/celery.py` → app tasks (e.g., `apps/workflows/tasks.py`)

**Hatchet Worker** - `python manage.py run_hatchet_worker` → registers workflows → waits for execution requests

**Django Admin** - `/admin/` → standard Django admin (for superusers only)

### Testing

**Location** - Each app has a `tests.py` file. Tests live next to the code they test.

**Run Tests** - `poe test` (runs pytest in Docker) or `pytest` locally

**Config** - `pytest.ini` has settings. Uses `pytest-django` for database fixtures.

**Coverage** - `pytest --cov=apps` generates coverage reports

### Development Workflow

1. **First Time Setup**: `poe bootstrap` (builds containers, starts services, runs migrations, seeds data)
2. **Daily Start**: `poe up` (starts all services)
3. **Make Code Changes**: Edit files, Django auto-reloads
4. **Add Dependencies**: Edit `pyproject.toml`, rebuild container: `poe build`
5. **Database Changes**: Edit models → `poe makemigrations` → `poe migrate`
6. **Run Tests**: `poe test`
7. **Check Logs**: `poe logs` (backend), `poe hatchet-logs` (Hatchet)
8. **Shell Access**: `poe shell` (Django shell), `poe bash` (container bash), `poe dbshell` (psql)

### Where to Find Things

**Add a new API endpoint?** → Create a view in `apps/{app}/views.py`, add URL in `apps/{app}/urls.py`, include in `config/urls.py`

**Add a new action handler?** → Create handler in `apps/workflows/internal_actions/handlers/{name}.py`, register in `catalog.py`, run `poe sync-catalog`

**Add a new model?** → Create in `apps/{app}/models.py`, add RLS policy in SQL if tenant-scoped, run `poe makemigrations` and `poe migrate`

**Change authentication?** → Edit `apps/core/common/middleware.py` and `apps/core/authentication/api_key_auth.py`

**Add a Celery task?** → Create in `apps/{app}/tasks.py`, register in `config/celery.py` if periodic

**Change Hatchet workflow?** → Edit `apps/workflows/hatchet_workflows/graph_executor.py`, restart Hatchet worker

**Add environment variable?** → Add to `.env`, update `config/settings.py` to read it via `decouple`

The structure is designed to be greppable and navigable. Domain-grouped means related code stays together. Services keep logic testable. The `workflows` app is the core; everything else is supporting infrastructure.

## Core Models

Six models form the workflow domain:

**WorkflowAction** - The action catalog. Each entry defines what an action does (e.g., "create Salesforce lead") and how to execute it (internal handler, n8n workflow, or webhook). System actions have `organization=NULL`, org-specific actions are scoped to an org.

**WorkflowTemplate** - User-created workflow graphs stored as `{nodes: [...], edges: [...]}`. Templates have a status lifecycle: DRAFT (editable) → ACTIVE (frozen, triggers can run) → INACTIVE/ARCHIVED. Active templates can't be modified; you clone them to create new drafts.

**WorkflowTrigger** - Links templates to trigger conditions. Types: MANUAL, WEBHOOK_INBOUND, SCHEDULE_CRON, and various events. Config includes cron expressions, webhook secrets, and JSONLogic filters for event matching.

**WorkflowInstance** - Queryable projection of execution state. This is NOT the source of truth (Hatchet is). It's an index for multi-tenant queries and WebSocket fanout. Updated via events from Hatchet, not by polling.

**ActionCallback** - Tracks callbacks awaited by async action nodes (e.g., waiting for n8n to complete, or for a human approval). Security via `callback_token` (64-char random string). Idempotency via `consumed_at` timestamp.

**WorkflowEvent** - Append-only event log for debugging and audit trails. Captures instance lifecycle events (started, completed, failed) and node-level events (node.started, node.completed, etc.).

All models except Organization and APIKey have RLS policies.

## Workflow Execution Flow

Here's what happens when a workflow runs:

### 1. Trigger Fires

A workflow can be triggered in several ways:

- **Manual**: `POST /api/v1/workflows/instances/` with template ID and input data
- **Webhook**: External system posts to `/api/v1/workflows/webhooks/{trigger_id}/` with webhook secret
- **Scheduled**: Celery beat runs `check_scheduled_triggers` task every minute, evaluates cron expressions
- **Event**: Internal events trigger via `TriggerManager.match_triggers()`

### 2. Execution Service Creates Instance

`ExecutionService.start_workflow()` is called with:
- Template reference
- Trigger type and metadata
- Input data
- Optional idempotency key

It creates a `WorkflowInstance` record (status=PENDING) and dispatches to Hatchet:

```python
workflow_run = graph_executor_workflow.run_no_wait(
    input={
        'instance_id': str(instance.id),
        'organization_id': str(org.id),
        'graph': template.graph,
        'input_data': input_data,
    },
)
```

The instance is updated with `hatchet_workflow_id` and status=RUNNING.

### 3. Hatchet Worker Executes Graph

The `GraphExecutor` class interprets the graph using ready-set scheduling:

1. Start with the trigger node in the ready set
2. Execute all ready nodes concurrently via `asyncio.gather()`
3. After batch completes, find successors that are now ready (dependencies met, edge conditions passed)
4. Repeat until no nodes remain

This naturally handles PARALLEL/MERGE nodes:
- PARALLEL node completes → all its successors become ready simultaneously
- Multiple branches execute concurrently in the next batch
- MERGE node waits until all (or any, depending on mode) incoming branches complete

For each node, `GraphExecutor._execute_single_node()` is called:

- **trigger**: Returns the input data
- **action**: Dispatches to the executor (internal/n8n/webhook)
- **condition**: Evaluates a boolean expression
- **delay**: Sleeps for specified duration
- **parallel**: Pass-through (scheduling handles the forking)
- **merge**: Collects outputs from branches
- **end**: Terminal node

### 4. Action Execution

For action nodes, the executor:

1. Fetches the `WorkflowAction` from the catalog
2. Resolves variable references in inputs (e.g., `{{trigger.email}}` becomes the actual email)
3. Fetches credentials if needed (from Infisical or local vault)
4. Dispatches based on executor type:

**INTERNAL**: Looks up handler in `ActionHandlerRegistry` and calls it directly. Example: `http.request` handler makes an HTTP call and returns the response.

**N8N**: Creates an `ActionCallback` record, calls n8n's webhook with a callback URL, waits for callback by listening to a Hatchet event. When n8n finishes, it posts to the callback URL, which publishes the event, resuming the workflow.

**WEBHOOK**: Makes an HTTP request to the configured URL with the action inputs. If the webhook is async, creates a callback similar to n8n.

### 5. Completion

When all nodes finish, `GraphExecutor` returns `{success: True, outputs: {...}}`. The execution service updates the instance status to COMPLETED/FAILED and creates a `INSTANCE_COMPLETED` or `INSTANCE_FAILED` event.

Frontends can poll `/api/v1/workflows/instances/{id}/` or subscribe via WebSocket to get real-time updates.

## Hatchet Integration

Hatchet is configured via environment variables (no Django settings):
- `HATCHET_CLIENT_TOKEN` - API token from Hatchet dashboard
- `HATCHET_CLIENT_TLS_STRATEGY` - Set to "none" for self-hosted

The Hatchet client is a singleton (`apps/workflows/hatchet_client.py`). Workers register the `graph-executor` workflow on startup via the `run_hatchet_worker` management command.

There's only one workflow definition. It interprets all templates at runtime by reading their graph structure. This avoids creating hundreds of Hatchet workflows (one per template).

Hatchet handles:
- Durable execution (survives process crashes)
- Retries with exponential backoff
- Timeouts at the task level
- Event-based waiting (for callbacks)
- Workflow cancellation

The control plane (Django) doesn't know about Hatchet's internal state. It only knows the `hatchet_workflow_id` for correlation. When Hatchet emits events, the execution service updates the `WorkflowInstance` projection.

## n8n Integration

n8n acts as an external workflow executor for actions that need complex orchestration. The integration works like this:

1. An action with `executor=N8N` references an n8n workflow via `executor_config.n8n_workflow_id`
2. When the action runs, `GraphExecutor` creates an `ActionCallback` and calls n8n's webhook URL with:
   - Input data
   - Callback URL: `/api/v1/workflows/callbacks/{instance_id}/{node_id}/`
   - Callback token (for validation)
3. n8n executes its workflow and posts results to the callback URL
4. `CallbackView` validates the token, publishes a Hatchet event
5. Hatchet workflow resumes from the waiting node

The `N8NWorkflow` model maps n8n workflows to workflow templates for tracking. `N8NExecution` records track n8n runs.

## Credentials and Secrets

Credentials are stored in Infisical (or a local encrypted fallback for dev). The `CredentialsMetadata` model stores metadata only:
- `vault_path` - Path in Infisical (e.g., `/tenants/{org_id}/credentials`)
- `vault_key` - Key name
- `credential_type` - What kind of credential (oauth, api_key, etc.)
- Rotation schedule and validation status

Actual secrets never touch Django or PostgreSQL. Workers fetch them directly from Infisical at execution time:

```python
vault = VaultManager(organization_id)
creds = vault.get(credential_id)
```

The `CredentialService` wraps this with caching (5-minute TTL) and access logging. Every credential access is logged in `CredentialAccessLog` with IP, user, and purpose.

For local development, `VAULT_BACKEND=local` uses the `LocalCredential` model (encrypted with Fernet) instead of Infisical.

## Background Tasks

Celery handles three main periodic tasks:

**check_scheduled_triggers** (every minute) - Evaluates cron expressions for SCHEDULE_CRON triggers. If the cron matches the current time and the trigger's `status=ACTIVE`, starts the workflow.

**cleanup_expired_callbacks** (hourly) - Finds `ActionCallback` records where `expires_at < now` and `consumed_at IS NULL`. Marks them as expired and fails the associated workflow instance.

**check_credential_rotation** (daily) - Checks credentials that need rotation based on their rotation schedule, notifies administrators.

Celery is configured with:
- Redis as broker and result backend
- Task time limit: 30 minutes (soft 25, hard 30)
- Late acks (ensures tasks aren't lost on worker crash)
- Prefetch multiplier of 1 (one task at a time per worker)

## API Endpoints

All endpoints require the `X-API-Key` header with a valid API key, except webhooks and callbacks which use their own validation.

### Authentication
```
/api/auth/                             # Currently empty (auth is via middleware)
```

### Organizations
```
GET    /api/organizations/              List organizations (admin only)
POST   /api/organizations/              Create organization (admin only)
GET    /api/organizations/{id}/         Organization detail
PUT    /api/organizations/{id}/         Update organization
DELETE /api/organizations/{id}/         Delete organization
```

### Credentials
```
GET    /api/credentials/                List credentials for organization
POST   /api/credentials/                Create credential (stores in vault)
GET    /api/credentials/{id}/           Credential metadata detail
PUT    /api/credentials/{id}/           Update credential
DELETE /api/credentials/{id}/           Delete credential (removes from vault)

GET    /api/credential-access-logs/     List access logs (audit trail)
```

### Workflows - Actions
```
GET    /api/v1/workflows/actions/       List action catalog (system + org actions)
GET    /api/v1/workflows/actions/{id}/  Action detail
```

### Workflows - Templates
```
GET    /api/v1/workflows/templates/     List templates
                                        Query params:
                                        - ?status=DRAFT|ACTIVE|INACTIVE|ARCHIVED
                                        - ?include_archived=true

POST   /api/v1/workflows/templates/     Create new template (DRAFT status)

GET    /api/v1/workflows/templates/{id}/     Template detail

PUT    /api/v1/workflows/templates/{id}/     Update template (DRAFT only, 400 if not DRAFT)

DELETE /api/v1/workflows/templates/{id}/     Archive template (soft delete)

POST   /api/v1/workflows/templates/{id}/validate/    Validate graph structure
                                                      Returns: {is_valid, errors, warnings}

POST   /api/v1/workflows/templates/{id}/activate/    Activate template (DRAFT/INACTIVE → ACTIVE)
                                                      Validates first, 400 if invalid

POST   /api/v1/workflows/templates/{id}/deactivate/  Deactivate template (ACTIVE → INACTIVE)

POST   /api/v1/workflows/templates/{id}/clone/       Clone to new DRAFT
                                                      Body: {new_slug, new_name}
```

### Workflows - Triggers
```
GET    /api/v1/workflows/triggers/      List triggers
                                        Query params:
                                        - ?trigger_type=MANUAL|WEBHOOK_INBOUND|SCHEDULE_CRON|...
                                        - ?status=ACTIVE|PAUSED|DISABLED
                                        - ?workflow_template={uuid}

POST   /api/v1/workflows/triggers/      Create trigger
                                        Response includes webhook_secret (shown ONCE)

GET    /api/v1/workflows/triggers/{id}/      Trigger detail

PUT    /api/v1/workflows/triggers/{id}/      Update trigger config/filter/status

DELETE /api/v1/workflows/triggers/{id}/      Delete trigger

POST   /api/v1/workflows/triggers/{id}/pause/    Pause trigger (ACTIVE → PAUSED)

POST   /api/v1/workflows/triggers/{id}/resume/   Resume trigger (PAUSED → ACTIVE)

POST   /api/v1/workflows/triggers/{id}/disable/  Disable trigger (→ DISABLED)
```

### Workflows - Instances
```
GET    /api/v1/workflows/instances/     List instances
                                        Query params:
                                        - ?status=PENDING|RUNNING|COMPLETED|FAILED|...
                                        - ?workflow_template={uuid}
                                        - ?trigger_type=MANUAL|WEBHOOK_INBOUND|...

POST   /api/v1/workflows/instances/     Manually trigger workflow
                                        Body: {
                                          workflow_template: uuid,
                                          input_data: {},
                                          idempotency_key: "optional"
                                        }
                                        Returns: 201 (new) or 200 (idempotent replay)

GET    /api/v1/workflows/instances/{id}/     Instance detail
                                              Includes: node_states, input/output data

POST   /api/v1/workflows/instances/{id}/cancel/    Cancel running instance
                                                    Body: {reason: "optional"}
```

### Workflows - Webhooks (Public)
```
POST   /api/v1/workflows/webhooks/{trigger_id}/    Trigger workflow via webhook
                                                    Headers:
                                                    - X-Webhook-Signature: sha256=<hex> (if secret configured)
                                                    Body: {} (becomes workflow input)
                                                    Returns: 201 + instance_id
```

### Workflows - Callbacks (Public)
```
POST   /api/v1/workflows/callbacks/{instance_id}/{node_id}/    Complete async action
                                                                Headers:
                                                                - X-Callback-Token: <token>
                                                                Body: {
                                                                  success: bool,
                                                                  outputs: {} or error: {}
                                                                }
                                                                Returns: 200 {acknowledged: true}
```

### API Documentation
```
GET    /api/schema/                     OpenAPI schema (JSON)
GET    /api/docs/                       Swagger UI (interactive docs)
GET    /api/redoc/                      ReDoc UI (alternative docs view)
```

### Django Admin
```
GET    /admin/                          Django admin panel (superusers only)
```

### Authentication Details

**Standard Endpoints** - Send `X-API-Key: wfe_<your_key>` header. Middleware validates the key, looks up the organization, sets RLS context, and attaches `request.organization` and `request.api_key`.

**Webhook Endpoints** - No API key needed. Validates via `X-Webhook-Signature` header (HMAC-SHA256) if the trigger has a `webhook_secret`. Signature format: `sha256=<hex_digest>`.

**Callback Endpoints** - No API key needed. Validates via `X-Callback-Token` header. Token is a 64-character random string returned when creating the callback. One-time use (idempotent).

**Query Filters** - Most list endpoints support filtering via query parameters. Filters are tenant-scoped automatically via RLS.

**Pagination** - List endpoints use DRF's default pagination (page size in settings). Use `?page=2` for next page.

**Error Responses** - Standard DRF error format:
- 400 Bad Request - Validation errors
- 401 Unauthorized - Missing/invalid API key or callback token
- 403 Forbidden - Valid auth but insufficient permissions
- 404 Not Found - Resource doesn't exist or belongs to another org
- 409 Conflict - Callback already consumed, duplicate slug, etc.
- 410 Gone - Callback expired

## Configuration

Configuration is via environment variables (using `python-decouple`). Key settings:

**Database**: `DB_NAME`, `DB_USER`, `DB_PASSWORD`, `DB_HOST`, `DB_PORT`

**Redis**: `REDIS_HOST`, `REDIS_PORT`, `REDIS_PASSWORD`

**Celery**: `CELERY_BROKER_URL`, `CELERY_RESULT_BACKEND`

**Hatchet**: `HATCHET_CLIENT_TOKEN`, `HATCHET_CLIENT_TLS_STRATEGY`

**Infisical**: `INFISICAL_API_URL`, `INFISICAL_CLIENT_ID`, `INFISICAL_CLIENT_SECRET`

**App**: `BASE_URL` (for constructing callback URLs), `VAULT_BACKEND` (infisical or local)

**Security**: `SECRET_KEY`, `DEBUG`, `ALLOWED_HOSTS`, `CORS_ALLOWED_ORIGINS`

Settings are in `config/settings.py`. Most defaults work for local development. Production requires proper secrets and TLS configuration.

## Deployment

Local development uses Docker Compose with separate files:
- `docker-compose.yml` - Main app services (postgres, redis, backend, workers)
- `docker-compose.hatchet.yml` - Hatchet engine (postgres, rabbitmq, hatchet server, setup worker)

Poe tasks wrap common operations:
- `poe infra-up` - Start databases and Hatchet
- `poe app-up` - Start Django, Celery, and Hatchet worker
- `poe migrate` - Run migrations
- `poe seed` - Seed data and sync action catalog

For production, you'd typically:
1. Run Django behind a load balancer (multiple instances)
2. Run Celery workers scaled based on load
3. Run Hatchet workers scaled based on workflow concurrency needs
4. Use managed PostgreSQL and Redis
5. Run Hatchet server as a separate cluster
6. Store credentials in Infisical Cloud or self-hosted

The system is stateless except for database and Hatchet, so it scales horizontally.

## Technology Stack

**Backend**: Django 5.2.8, Django REST Framework 3.16.1
**Database**: PostgreSQL 16 (with Row-Level Security)
**Cache/Queue**: Redis 7
**Task Queue**: Celery 5.4.0 with Celery Beat
**Workflow Runtime**: Hatchet (self-hosted or cloud)
**WebSocket**: Django Channels 4.2.0 with Channels Redis
**Secrets**: Infisical (or local encrypted storage)
**API Docs**: drf-spectacular (OpenAPI 3.0)
**Servers**: Gunicorn (WSGI), Daphne (ASGI)
**Container**: Docker with Docker Compose

Python dependencies are managed via `pyproject.toml` (but currently installed via `requirements.txt` in Docker - migration to `uv` is planned).

## Design Decisions

**Why RLS for multi-tenancy?** It's defense-in-depth. Application bugs can't cause cross-tenant leaks because the database enforces isolation. It's also simpler than schema-per-tenant or database-per-tenant.

**Why API keys instead of JWT?** This is a backend service, not a user-facing app. Services authenticate with keys. It's one table instead of five (User, Session, Account, Verification, RefreshToken). Django admin works with normal `createsuperuser`.

**Why Hatchet?** We needed durable execution (survive crashes, wait for callbacks, retries). Temporal was too complex operationally. Hatchet is MIT licensed, self-hostable, and has a clean Python SDK.

**Why single Hatchet workflow?** Creating a separate Hatchet workflow per template would mean dynamic registration, version management, and N workflows in Hatchet's UI. Interpreting graphs at runtime keeps it simple.

**Why separate control plane and runtime?** Clear separation of concerns. Django manages "what exists" and "who can access it." Hatchet manages "execute this workflow durably." Neither bleeds into the other's domain.

**Why Celery for scheduled triggers?** Cron evaluation is a periodic task, not a workflow. Celery Beat is built for this. Hatchet workflows are for long-running, stateful execution.

## Further Reading

- `README.md` - Setup and getting started
- `TODOS.md` - Migration progress and remaining tasks
- `docs/adr/` - Architecture Decision Records
- `docs/WORKFLOW_ORCHESTRATOR_DESIGN.md` - Detailed workflow design (if it exists)
