pq.io — LLM API

OpenAI-compatible. Bearer-gated. https://llm.pq.io

Models

modeldescriptioncontext~tok/sendpoint
qwen3.5default Qwen3.5 — reasoning + agentic coding. Recommended for long-running tasks. 96K ~110 /main/v1
qwen3.6exp Qwen3.6 — newer than 3.5, but with known agentic-loop regressions. 96K ~110 /main/v1
qwen3-coder Qwen3-Coder — code-specialist, non-thinking. Faster, less verbose. 96K ~120 /main/v1
general Llama 3.1 8B — quick chat / utility (summarize, classify, tag). 8K ~60 /small/v1
bge-m3 Multilingual embeddings (1024-dim). 8K input /small/v1

tok/s is decode throughput on a fresh context; expect ~30% slowdown near full context. Models on /main share GPU memory — only one is loaded at a time, switching takes ~30-60s.

Quick start

Get a bearer token from matthew@pq.io, then:

curl https://llm.pq.io/main/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "hello"}]
  }'

Add "stream": true for SSE streaming. Use "temperature": 0 for deterministic agentic loops.

Embeddings

curl https://llm.pq.io/small/v1/embeddings \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model": "bge-m3", "input": "hello world"}'

Clients

clientconfig
OpenAI Python SDKOpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN)
OpenCodeSee snippet below
Aider--openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3.5
Cline / Continue / RooAdd as OpenAI-compatible provider, model qwen3.5

OpenCode

Drop into ~/.config/opencode/opencode.json. Replace <your-token> with your bearer.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "pq.io": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "pq.io",
      "options": {
        "baseURL": "https://llm.pq.io/main/v1",
        "headers": {
          "Authorization": "Bearer <your-token>"
        }
      },
      "models": {
        "qwen3.5":     { "name": "Qwen3.5 (default)",       "limit": { "context": 98304, "input": 90112, "output": 8192 } },
        "qwen3.6":     { "name": "Qwen3.6 (experimental)",  "limit": { "context": 98304, "input": 90112, "output": 8192 } },
        "qwen3-coder": { "name": "Qwen3-Coder",             "limit": { "context": 98304, "input": 90112, "output": 8192 } }
      }
    }
  },
  "compaction": {
    "auto": true,
    "prune": true,
    "reserved": 8192,
    "tail_turns": 2,
    "preserve_recent_tokens": 8000
  },
  "agent": {
    "compaction": { "model": "pq.io/qwen3.5" }
  }
}

The explicit limit.input field is required for compaction to work — without it, OpenCode silently ignores compaction.reserved (upstream bug #13980).

Sampling defaults

Per Qwen team recommendations. Override per-request via standard OpenAI sampling fields.

modeltemperaturetop_ptop_krepeat_penaltythinking
qwen3.50.60.9520yes (<think> tags)
qwen3.60.60.9520yes (<think> tags)
qwen3-coder0.70.8201.05no

Health: /health (no auth). Issues: matthew@pq.io.