Skip to content

Queue & BullMQ

  • Redis ≥ 7 (single-shard; no Cluster).
  • BullMQ — queue library + worker pool + repeatable-job scheduler.
  • Redlock-ish discipline — one instance owns the cron scheduler.

One queue per role:

Queue nameConsumer
queue:chief_of_staffChiefOfStaffService in-process
queue:baBAAgentService in-process
queue:architectArchitectService in-process
queue:devAgent daemon (remote)
queue:qaQAAgentService in-process or agent daemon
queue:memoryMemoryOptimizerService in-process
queue:webhook-outboundOutbound webhook retries
queue:digestDaily / weekly digest cron
interface JobPayload {
ticketId: string;
tenantId: string;
roleSlug: string;
attempt: number;
// role-specific extras
skills?: string[];
subagentSlug?: string | null;
}

Small payload — the worker fetches full ticket state from Postgres. Keeps Redis small and lets payload version drift be non-breaking.

BullMQ’s attempts: 3 with backoff: { type: "exponential", delay: 5000 }. After 3 failed attempts, the job moves to failed and emits job.failed, which the chief-of-staff subscribes to.

Priorities 1..10, lower = higher priority:

  • 1 — urgent (exec-triggered).
  • 5 — normal (default).
  • 10 — background (e.g. digest).

BullMQ’s priority queueing cuts urgent jobs ahead of normal ones.

Per-worker, via BULLMQ_CONCURRENCY_<ROLE>:

  • BULLMQ_CONCURRENCY_CHIEF_OF_STAFF=2 — planners are expensive.
  • BULLMQ_CONCURRENCY_BA=5
  • BULLMQ_CONCURRENCY_DEV=3 — remote agents handle their own.
  • Default: 5.

Each job carries a ticketId. The worker checks the ticket’s status before running — if it’s already done, the job is a no-op. This defends against at-least-once delivery.

BullMQ’s repeatable-job API schedules:

  • Daily digestDIGEST_CRON_HOUR local time.
  • Weekly digest — Mondays 09:00.
  • Google Drive push-channel renewal — 24 h before expiry.
  • Graph staleness check — hourly, flags plans as stale when applicable.
  • Provider-health recovery probes — every 60 s per degraded provider.

Only one instance should own the cron scheduler across a horizontally-scaled backend. Controlled via WORKFORCE0_CRON_ENABLED=1 — set on exactly one replica.

A worker crash mid-job leaves BullMQ with no heartbeat. After 30 s, BullMQ moves the job back to active for another worker to pick up. Configurable via lockDuration.

Jobs that fail all attempts move to failed. We don’t move them to a separate DLQ — the failed state in BullMQ is queryable and the chief-of-staff’s replan logic reads from it.

For post-mortem, the admin UI has a Queue → Failed view per role.

Prometheus metrics:

  • wf0_queue_jobs_total{queue,status} — counter per terminal state.
  • wf0_queue_wait_seconds — histogram of queue lag.
  • wf0_queue_processing_seconds — histogram of processing time.

Dashboards in docs/observability/grafana-dashboard.json.

Redis becomes a SPoF. Mitigations:

  • Managed Redis (ElastiCache, Upstash, Memorystore) for production.
  • Persistence (AOF) on for crash recovery.
  • Regular backups if you use Redis for anything besides BullMQ.
  • Sidekiq / Celery / Temporal. Overkill. BullMQ is the minimum that works.
  • PostgreSQL as a queue. LISTEN/NOTIFY tempting, but adds DB write pressure and lacks BullMQ’s priority + concurrency + retry ergonomics.
  • SQS / Cloud Tasks. Would commit us to a cloud. BullMQ is portable.