Testing
The stack
Section titled “The stack”- Vitest for unit + integration in backend / frontend / agent.
- Playwright for frontend e2e.
- Smoke tests in
backend/scripts/smoke-test.ts— end-to-end against a deployed environment.
What goes where
Section titled “What goes where”Unit tests
Section titled “Unit tests”Colocated under __tests__/. Test pure functions, class methods,
service logic with mocked dependencies.
describe('extractFromPythonFile', () => { it('extracts classes and functions', () => { const out = extractFromPythonFile(fixture('a.py', 'class Foo: pass')); expect(out.nodes.some(n => n.name === 'Foo')).toBe(true); });});Integration tests
Section titled “Integration tests”Hit real Postgres + Redis via docker-compose.test.yml. Mock LLMs
unless testing LLM integration itself.
E2E tests
Section titled “E2E tests”Playwright driving the frontend with a real backend. Fixtures seed the DB, actions flow through the full stack.
Run locally:
cd frontendnpm run e2eSmoke tests
Section titled “Smoke tests”Against a running deployment:
cd backendWORKFORCE0_API_URL=https://your-instance npm run smoke-testSeeds a fake meeting, generates a brief, fans out child tickets, prints a PASS/FAIL report.
Expected PR coverage
Section titled “Expected PR coverage”- Business logic change — unit test covering the change.
- Route change — integration test covering the route.
- UI change — e2e test for user-visible flows; Vitest for components.
- Bug fix — a regression test for the specific bug (even if it’s a one-liner).
- Docs-only change — no tests needed.
Running the full suite
Section titled “Running the full suite”# Backendcd backendnpm test
# Frontendcd frontendnpm test && npm run e2e
# Agentcd agentnpm testCI does all of the above on every PR.
Writing good tests
Section titled “Writing good tests”- Arrange-Act-Assert. Each test has one clear action.
- Names say what, not how. “returns 0 when no god-nodes match” is better than “tests listGodNodes with empty input”.
- One assertion per behaviour. Multiple asserts for the same behaviour is fine; multiple behaviours means split the test.
- Avoid snapshots for LLM output. Too brittle. Assert structural properties instead.
Mocking LLMs
Section titled “Mocking LLMs”Use vi.mock to replace the client factory. A canonical setup:
vi.mock('../../agent-runtime/clients/client-factory.js', () => ({ createModelClient: () => ({ chat: vi.fn().mockResolvedValue({ content: JSON.stringify({ summary: 'x', steps: [{ title: 'y', … }] }), toolCalls: [], tokenUsage: { input: 0, output: 0 }, stopReason: 'end_turn', }), }),}));See backend/src/services/chief-of-staff/__tests__/planner-llm.test.ts
for the full pattern.
Test databases
Section titled “Test databases”Unit tests mock Prisma. Integration tests use a throwaway schema:
beforeAll(async () => { await prisma.$executeRaw`CREATE SCHEMA IF NOT EXISTS test_${randomId}`;});
afterAll(async () => { await prisma.$executeRaw`DROP SCHEMA test_${randomId} CASCADE`;});Keeps tests parallel-safe.
Deterministic randomness
Section titled “Deterministic randomness”Snapshot-like tests that depend on time or UUIDs should use a fake
clock (vi.useFakeTimers) and seed any RNG.
What not to test
Section titled “What not to test”- Provider SDKs. Don’t test that Anthropic returns a response; test that our code handles provider responses correctly.
- Prisma query correctness. Prisma is well-tested upstream.
- Third-party webhook payload shapes. Trust Slack / GitHub / Twilio; test that we handle their documented shapes.
CI behaviour
Section titled “CI behaviour”- Tests run on every push to any branch.
- Typecheck runs alongside tests.
- A failed test blocks merge. Retries are OK for flaky tests, but flaky tests are bugs — file them.
Metric harness (PG.14)
Section titled “Metric harness (PG.14)”A separate mechanism for measuring planner quality over time. Not a
traditional “test” — see Metric harness.
Used for prompt changes; gate PRs on verdict: improved | flat, not
regressed.