pq.io — LLM API

OpenAI-compatible. Bearer-gated. https://llm.pq.io

Models

model	description	context	~tok/s	endpoint
`qwen3.5`default	Qwen3.5 — reasoning + agentic coding. Recommended for long-running tasks.	96K	~110	`/main/v1`
`qwen3.6`exp	Qwen3.6 — newer than 3.5, but with known agentic-loop regressions.	96K	~110	`/main/v1`
`qwen3-coder`	Qwen3-Coder — code-specialist, non-thinking. Faster, less verbose.	96K	~120	`/main/v1`
`general`	Llama 3.1 8B — quick chat / utility (summarize, classify, tag).	8K	~60	`/small/v1`
`bge-m3`	Multilingual embeddings (1024-dim).	8K input	—	`/small/v1`

tok/s is decode throughput on a fresh context; expect ~30% slowdown near full context. Models on /main share GPU memory — only one is loaded at a time, switching takes ~30-60s.

Quick start

Get a bearer token from matthew@pq.io, then:

curl https://llm.pq.io/main/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "hello"}]
  }'

Add "stream": true for SSE streaming. Use "temperature": 0 for deterministic agentic loops.

Embeddings

curl https://llm.pq.io/small/v1/embeddings \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model": "bge-m3", "input": "hello world"}'

Clients

client	config
OpenAI Python SDK	`OpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN)`
OpenCode	See snippet below
Aider	`--openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3.5`
Cline / Continue / Roo	Add as OpenAI-compatible provider, model `qwen3.5`

OpenCode

Drop into ~/.config/opencode/opencode.json. Replace <your-token> with your bearer.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "pq.io": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "pq.io",
      "options": {
        "baseURL": "https://llm.pq.io/main/v1",
        "headers": {
          "Authorization": "Bearer <your-token>"
        }
      },
      "models": {
        "qwen3.5":     { "name": "Qwen3.5 (default)",       "limit": { "context": 98304, "input": 90112, "output": 8192 } },
        "qwen3.6":     { "name": "Qwen3.6 (experimental)",  "limit": { "context": 98304, "input": 90112, "output": 8192 } },
        "qwen3-coder": { "name": "Qwen3-Coder",             "limit": { "context": 98304, "input": 90112, "output": 8192 } }
      }
    }
  },
  "compaction": {
    "auto": true,
    "prune": true,
    "reserved": 8192,
    "tail_turns": 2,
    "preserve_recent_tokens": 8000
  },
  "agent": {
    "compaction": { "model": "pq.io/qwen3.5" }
  }
}

The explicit limit.input field is required for compaction to work — without it, OpenCode silently ignores compaction.reserved (upstream bug #13980).

Sampling defaults

Per Qwen team recommendations. Override per-request via standard OpenAI sampling fields.

model	temperature	top_p	top_k	repeat_penalty	thinking
`qwen3.5`	0.6	0.95	20	—	yes (`<think>` tags)
`qwen3.6`	0.6	0.95	20	—	yes (`<think>` tags)
`qwen3-coder`	0.7	0.8	20	1.05	no

Health: /health (no auth). Issues: matthew@pq.io.