OpenAI-compatible. Bearer-gated. https://llm.pq.io
| model | description | context | ~tok/s | endpoint |
|---|---|---|---|---|
qwen3.5default |
Qwen3.5 — reasoning + agentic coding. Recommended for long-running tasks. | 96K | ~110 | /main/v1 |
qwen3.6exp |
Qwen3.6 — newer than 3.5, but with known agentic-loop regressions. | 96K | ~110 | /main/v1 |
qwen3-coder |
Qwen3-Coder — code-specialist, non-thinking. Faster, less verbose. | 96K | ~120 | /main/v1 |
general |
Llama 3.1 8B — quick chat / utility (summarize, classify, tag). | 8K | ~60 | /small/v1 |
bge-m3 |
Multilingual embeddings (1024-dim). | 8K input | — | /small/v1 |
tok/s is decode throughput on a fresh context; expect ~30% slowdown near full context. Models on /main share GPU memory — only one is loaded at a time, switching takes ~30-60s.
Get a bearer token from matthew@pq.io, then:
curl https://llm.pq.io/main/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"messages": [{"role": "user", "content": "hello"}]
}'
Add "stream": true for SSE streaming. Use "temperature": 0 for deterministic agentic loops.
curl https://llm.pq.io/small/v1/embeddings \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model": "bge-m3", "input": "hello world"}'
| client | config |
|---|---|
| OpenAI Python SDK | OpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN) |
| OpenCode | See snippet below |
| Aider | --openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3.5 |
| Cline / Continue / Roo | Add as OpenAI-compatible provider, model qwen3.5 |
Drop into ~/.config/opencode/opencode.json. Replace <your-token> with your bearer.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"pq.io": {
"npm": "@ai-sdk/openai-compatible",
"name": "pq.io",
"options": {
"baseURL": "https://llm.pq.io/main/v1",
"headers": {
"Authorization": "Bearer <your-token>"
}
},
"models": {
"qwen3.5": { "name": "Qwen3.5 (default)", "limit": { "context": 98304, "input": 90112, "output": 8192 } },
"qwen3.6": { "name": "Qwen3.6 (experimental)", "limit": { "context": 98304, "input": 90112, "output": 8192 } },
"qwen3-coder": { "name": "Qwen3-Coder", "limit": { "context": 98304, "input": 90112, "output": 8192 } }
}
}
},
"compaction": {
"auto": true,
"prune": true,
"reserved": 8192,
"tail_turns": 2,
"preserve_recent_tokens": 8000
},
"agent": {
"compaction": { "model": "pq.io/qwen3.5" }
}
}
The explicit limit.input field is required for compaction to work — without it, OpenCode silently ignores compaction.reserved (upstream bug #13980).
Per Qwen team recommendations. Override per-request via standard OpenAI sampling fields.
| model | temperature | top_p | top_k | repeat_penalty | thinking |
|---|---|---|---|---|---|
qwen3.5 | 0.6 | 0.95 | 20 | — | yes (<think> tags) |
qwen3.6 | 0.6 | 0.95 | 20 | — | yes (<think> tags) |
qwen3-coder | 0.7 | 0.8 | 20 | 1.05 | no |
Health: /health (no auth). Issues: matthew@pq.io.