Local models (Ollama, llama.cpp, vLLM, LM Studio)
Why local
Section titled “Why local”- Privacy. Nothing transits a third-party provider. Transcripts, briefs, code context all stay on your machine.
- Cost. You already paid for the GPU; inference is free.
- Offline. Train deployments, air-gapped workshops, flaky-network environments.
What works
Section titled “What works”Anything that exposes the OpenAI-compatible chat completions endpoint:
- Ollama —
brew install ollamaor the installer. - LM Studio — macOS / Windows GUI; OpenAI-compatible server mode.
- vLLM — high-throughput serving (production).
- llama.cpp with the
serverbinary. - TabbyAPI, KoboldCPP, etc.
All use the same config path.
Set it up
Section titled “Set it up”# Example: Ollama on the host machine, with llama3.1:70b pulledAI_CUSTOM_ENABLED=1AI_CUSTOM_BASE_URL=http://host.docker.internal:11434/v1AI_CUSTOM_API_KEY=sk-local-placeholder # most local servers ignore thisAI_CUSTOM_MODELS=llama3.1:70b,qwen2.5-coder:14b,mistral-small:22bThe backend auto-registers a provider entry named custom. In
Settings → AI providers it shows as “Custom (local)”.
Per-role routing guide
Section titled “Per-role routing guide”Not every role is equal — some need frontier quality, others are narrow specialists that run happily on local models. Use this as the default split:
| Role | Needs frontier? | Good local option |
|---|---|---|
chief_of_staff (planner) | ✅ Yes — open-ended decomposition | Not recommended; degrades sharply on ambiguous inputs |
ba_agent (brief generator) | ⚠️ Strong preference | 70B+ local models are acceptable for structured transcripts |
dev_agent | ❌ Scoped specialist | Qwen 2.5 Coder, DeepSeek Coder via Ollama |
qa_agent | ❌ Scoped specialist | Same as dev_agent |
memory_optimizer | ❌ Summarization | Any 8B+ local model |
Keep the planner on a frontier provider. Local 7B / 13B models produce plans that the critique step rejects — you burn tokens retrying without quality gain. Every other role tolerates local substitution reasonably well.
Routing roles to local
Section titled “Routing roles to local”By default, specialists keep using whichever BYOK provider you set. To route them local, set:
# Scoped specialists — local works wellMODEL_DEV_CUSTOM=qwen2.5-coder:14bMODEL_QA_CUSTOM=qwen2.5-coder:14bMODEL_MEMORY_CUSTOM=llama3.1:8b
# BA agent — OK on 70B+ local for structured transcriptsMODEL_BA_CUSTOM=llama3.1:70b
# Planner stays on a frontier provider — don't route localMODEL_PLANNER_ANTHROPIC=claude-sonnet-4-6Specialists route to custom if it’s set; otherwise they fall back
to the other providers in priority order.
Recommended local models
Section titled “Recommended local models”| Use case | Model | Size | Reasoning |
|---|---|---|---|
| Specialist (BA, architect) | llama3.1:70b | 40 GB | Best reasoning/size ratio. |
| Coding tasks | qwen2.5-coder:14b | 8 GB | Strong code synthesis. |
| Cheap summarisation | llama3.1:8b | 4 GB | Fast; works on CPU. |
| Critique | deepseek-r1:14b | 8 GB | Reasoning-tuned. |
Needs Apple Silicon M2 Max / M3 Max or a 24GB+ GPU for the 70B models. 13B variants run fine on a 16GB Mac.
vLLM for production
Section titled “vLLM for production”vLLM’s OpenAI-compatible server is the production-grade option:
docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 vllm/vllm-openai \ --model meta-llama/Llama-3.1-70B-Instruct \ --api-key sk-local-placeholderThen point Workforce0 at it:
AI_CUSTOM_BASE_URL=http://vllm-host:8000/v1Throughput is ~10× Ollama on the same GPU because of continuous batching.
Debugging local calls
Section titled “Debugging local calls”# See what Workforce0 is sendingLOG_LEVEL=debug docker compose logs -f backend | grep 'ai:chat'Logs show the full request (model, messages, params) at debug level. Compare against a raw curl to the local endpoint to isolate model-side vs Workforce0-side issues.
Common problems
Section titled “Common problems””Model not found”
Section titled “”Model not found””Ollama: ollama pull llama3.1:70b first.
LM Studio: load the model in the GUI before starting the server.
Slow first response
Section titled “Slow first response”Cold-start is real. Ollama unloads idle models after 5 minutes. Set
OLLAMA_KEEP_ALIVE=24h to keep models hot.
Context window too small
Section titled “Context window too small”Default Ollama num_ctx is 2048 — nowhere near enough for a brief
with context. Configure per-model:
# In a ModelfileFROM llama3.1:70bPARAMETER num_ctx 8192Rebuild with ollama create custom-llama3 -f ./Modelfile.
Tool-use / function-calling mismatches
Section titled “Tool-use / function-calling mismatches”Some local models don’t support OpenAI-style tool calls cleanly.
Workforce0 works around this — if a call returns non-tool-call output,
it falls back to JSON parsing. But model-specific quirks exist; grep
backend logs for tool:mismatch after a regression.
Hybrid mode (recommended)
Section titled “Hybrid mode (recommended)”The most cost-effective setup for real teams:
- Planner + critique → Anthropic Claude (~40% of calls by volume, ~60% by quality impact).
- Specialists → Local via Ollama / vLLM (~60% of volume).
- Voice dial-in → Gemini Live (free tier often covers it).
Monthly cost: ~$20–50 Anthropic + electricity.