pq.io — LLM API

OpenAI-compatible · bearer-gated · https://llm.pq.io

Quick start

Get a bearer token from matthew@pq.io, then:

curl https://llm.pq.io/main/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-coder",
    "messages": [{"role": "user", "content": "hello"}]
  }'

Swap "model" for any model below. Add "stream": true for SSE.

Chat & agentic models /main/v1

Only one /main lane is GPU-resident at a time; lane-swap takes ~30-60s. Single slot per lane — concurrent requests queue at llama-server. Decode rates are fresh-context; expect ~30% slowdown near full context.

qwen3-coderdefault
Code-specialist, non-thinking. Daily-driver chat lane.
Context: 192K × 1 Decode: ~120 tok/s Compaction fires: ~167K input (87%)
qwen3.6
Reasoning + agentic. Dedicated OpenCode compactor; NIAH 100% at 200K.
Context: 208K × 1 Decode: ~123 tok/s Compaction fires: ~183K input (88%)
qwen3.5
Reasoning + agentic. On-demand thinking lane at native max context.
Context: 256K × 1 Decode: ~122 tok/s Compaction fires: ~231K input (90%)

Utility models /small/v1

qwen2.5-coder
Qwen2.5-Coder-7B — quick utility lane (summarize, classify, tag, structured JSON).
Context: 32K Decode: ~35 tok/s
bge-m3
Multilingual embeddings (1024-dim).
Input: up to 8K tokens
curl https://llm.pq.io/small/v1/embeddings \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model": "bge-m3", "input": "hello world"}'

OpenCode setup

Drop into ~/.config/opencode/opencode.json. Replace <your-token> with your bearer.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "pq.io": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "pq.io",
      "options": {
        "baseURL": "https://llm.pq.io/main/v1",
        "headers": {
          "Authorization": "Bearer <your-token>"
        }
      },
      "models": {
        "qwen3-coder": { "name": "Qwen3-Coder", "limit": { "context": 196608, "input": 188416, "output": 8192 } },
        "qwen3.6":     { "name": "Qwen3.6",     "limit": { "context": 212992, "input": 204800, "output": 8192 } },
        "qwen3.5":     { "name": "Qwen3.5",     "limit": { "context": 262144, "input": 253952, "output": 8192 } }
      }
    }
  },
  "model": "pq.io/qwen3-coder",
  "compaction": {
    "auto": true,
    "prune": true,
    "reserved": 17000,
    "tail_turns": 2,
    "preserve_recent_tokens": 8000
  },
  "agent": {
    "compaction": { "model": "pq.io/qwen3.6" }
  }
}

Explicit limit.input is required (upstream bug #13980 silently no-ops reserved without it). The trigger formula is limit.input - reserved — that's where the per-model compaction points above come from.

Other clients

OpenAI SDKOpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN)
Aider--openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3-coder
Cline / Continue / RooOpenAI-compatible provider, model qwen3-coder
Advanced — sampling defaults

Per Qwen team recommendations. Already applied server-side; override per-request via standard OpenAI sampling fields.

modeltemperaturetop_ptop_krepeat_penaltythinking
qwen3-coder0.70.8201.05no
qwen3.60.60.9520yes (<think> tags)
qwen3.50.60.95201.10yes (<think> tags)