Skip to content

Local models (Ollama, llama.cpp, vLLM, LM Studio)

  • Privacy. Nothing transits a third-party provider. Transcripts, briefs, code context all stay on your machine.
  • Cost. You already paid for the GPU; inference is free.
  • Offline. Train deployments, air-gapped workshops, flaky-network environments.

Anything that exposes the OpenAI-compatible chat completions endpoint:

  • Ollamabrew install ollama or the installer.
  • LM Studio — macOS / Windows GUI; OpenAI-compatible server mode.
  • vLLM — high-throughput serving (production).
  • llama.cpp with the server binary.
  • TabbyAPI, KoboldCPP, etc.

All use the same config path.

Terminal window
# Example: Ollama on the host machine, with llama3.1:70b pulled
AI_CUSTOM_ENABLED=1
AI_CUSTOM_BASE_URL=http://host.docker.internal:11434/v1
AI_CUSTOM_API_KEY=sk-local-placeholder # most local servers ignore this
AI_CUSTOM_MODELS=llama3.1:70b,qwen2.5-coder:14b,mistral-small:22b

The backend auto-registers a provider entry named custom. In Settings → AI providers it shows as “Custom (local)”.

Not every role is equal — some need frontier quality, others are narrow specialists that run happily on local models. Use this as the default split:

RoleNeeds frontier?Good local option
chief_of_staff (planner)✅ Yes — open-ended decompositionNot recommended; degrades sharply on ambiguous inputs
ba_agent (brief generator)⚠️ Strong preference70B+ local models are acceptable for structured transcripts
dev_agent❌ Scoped specialistQwen 2.5 Coder, DeepSeek Coder via Ollama
qa_agent❌ Scoped specialistSame as dev_agent
memory_optimizer❌ SummarizationAny 8B+ local model

Keep the planner on a frontier provider. Local 7B / 13B models produce plans that the critique step rejects — you burn tokens retrying without quality gain. Every other role tolerates local substitution reasonably well.

By default, specialists keep using whichever BYOK provider you set. To route them local, set:

Terminal window
# Scoped specialists — local works well
MODEL_DEV_CUSTOM=qwen2.5-coder:14b
MODEL_QA_CUSTOM=qwen2.5-coder:14b
MODEL_MEMORY_CUSTOM=llama3.1:8b
# BA agent — OK on 70B+ local for structured transcripts
MODEL_BA_CUSTOM=llama3.1:70b
# Planner stays on a frontier provider — don't route local
MODEL_PLANNER_ANTHROPIC=claude-sonnet-4-6

Specialists route to custom if it’s set; otherwise they fall back to the other providers in priority order.

Use caseModelSizeReasoning
Specialist (BA, architect)llama3.1:70b40 GBBest reasoning/size ratio.
Coding tasksqwen2.5-coder:14b8 GBStrong code synthesis.
Cheap summarisationllama3.1:8b4 GBFast; works on CPU.
Critiquedeepseek-r1:14b8 GBReasoning-tuned.

Needs Apple Silicon M2 Max / M3 Max or a 24GB+ GPU for the 70B models. 13B variants run fine on a 16GB Mac.

vLLM’s OpenAI-compatible server is the production-grade option:

Terminal window
docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 vllm/vllm-openai \
--model meta-llama/Llama-3.1-70B-Instruct \
--api-key sk-local-placeholder

Then point Workforce0 at it:

Terminal window
AI_CUSTOM_BASE_URL=http://vllm-host:8000/v1

Throughput is ~10× Ollama on the same GPU because of continuous batching.

Terminal window
# See what Workforce0 is sending
LOG_LEVEL=debug docker compose logs -f backend | grep 'ai:chat'

Logs show the full request (model, messages, params) at debug level. Compare against a raw curl to the local endpoint to isolate model-side vs Workforce0-side issues.

Ollama: ollama pull llama3.1:70b first. LM Studio: load the model in the GUI before starting the server.

Cold-start is real. Ollama unloads idle models after 5 minutes. Set OLLAMA_KEEP_ALIVE=24h to keep models hot.

Default Ollama num_ctx is 2048 — nowhere near enough for a brief with context. Configure per-model:

Terminal window
# In a Modelfile
FROM llama3.1:70b
PARAMETER num_ctx 8192

Rebuild with ollama create custom-llama3 -f ./Modelfile.

Some local models don’t support OpenAI-style tool calls cleanly. Workforce0 works around this — if a call returns non-tool-call output, it falls back to JSON parsing. But model-specific quirks exist; grep backend logs for tool:mismatch after a regression.

The most cost-effective setup for real teams:

  • Planner + critique → Anthropic Claude (~40% of calls by volume, ~60% by quality impact).
  • Specialists → Local via Ollama / vLLM (~60% of volume).
  • Voice dial-in → Gemini Live (free tier often covers it).

Monthly cost: ~$20–50 Anthropic + electricity.