OpenAI-compatible · bearer-gated · https://llm.pq.io
Get a bearer token from matthew@pq.io, then:
curl https://llm.pq.io/main/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder",
"messages": [{"role": "user", "content": "hello"}]
}'
Swap "model" for any model below. Add "stream": true for SSE.
/main/v1Only one /main lane is GPU-resident at a time; lane-swap takes ~30-60s. Single slot per lane — concurrent requests queue at llama-server. Decode rates are fresh-context; expect ~30% slowdown near full context.
qwen3-coderdefaultqwen3.6qwen3.5/small/v1qwen2.5-coderbge-m3curl https://llm.pq.io/small/v1/embeddings \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model": "bge-m3", "input": "hello world"}'
Drop into ~/.config/opencode/opencode.json. Replace <your-token> with your bearer.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"pq.io": {
"npm": "@ai-sdk/openai-compatible",
"name": "pq.io",
"options": {
"baseURL": "https://llm.pq.io/main/v1",
"headers": {
"Authorization": "Bearer <your-token>"
}
},
"models": {
"qwen3-coder": { "name": "Qwen3-Coder", "limit": { "context": 196608, "input": 188416, "output": 8192 } },
"qwen3.6": { "name": "Qwen3.6", "limit": { "context": 212992, "input": 204800, "output": 8192 } },
"qwen3.5": { "name": "Qwen3.5", "limit": { "context": 262144, "input": 253952, "output": 8192 } }
}
}
},
"model": "pq.io/qwen3-coder",
"compaction": {
"auto": true,
"prune": true,
"reserved": 17000,
"tail_turns": 2,
"preserve_recent_tokens": 8000
},
"agent": {
"compaction": { "model": "pq.io/qwen3.6" }
}
}
Explicit limit.input is required (upstream bug #13980 silently no-ops reserved without it). The trigger formula is limit.input - reserved — that's where the per-model compaction points above come from.
| OpenAI SDK | OpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN) |
| Aider | --openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3-coder |
| Cline / Continue / Roo | OpenAI-compatible provider, model qwen3-coder |
Per Qwen team recommendations. Already applied server-side; override per-request via standard OpenAI sampling fields.
| model | temperature | top_p | top_k | repeat_penalty | thinking |
|---|---|---|---|---|---|
qwen3-coder | 0.7 | 0.8 | 20 | 1.05 | no |
qwen3.6 | 0.6 | 0.95 | 20 | — | yes (<think> tags) |
qwen3.5 | 0.6 | 0.95 | 20 | 1.10 | yes (<think> tags) |