# pq.io — LLM API

> OpenAI-compatible, bearer-gated LLM API at https://llm.pq.io
> Get a bearer token from matthew@pq.io.

This is the agent-readable mirror of the https://llm.pq.io landing page.

## Quick start

Replace $TOKEN with your bearer. Swap "model" for any model below; add
"stream": true for SSE.

    curl https://llm.pq.io/main/v1/chat/completions \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "qwen3-coder",
        "messages": [{"role": "user", "content": "hello"}]
      }'

## Chat & agentic models — /main/v1

Only one /main lane is GPU-resident at a time; lane-swap takes ~30-60s.
Single slot per lane — concurrent requests queue at llama-server.
Decode rates are fresh-context; expect ~30% slowdown near full context.

| model | description | context | decode | compaction fires |
|---|---|---|---|---|
| qwen3-coder (default) | Code-specialist, non-thinking. Daily-driver chat lane. | 192K x 1 | ~120 tok/s | ~167K input (87%) |
| qwen3.6 | Reasoning + agentic. Dedicated OpenCode compactor; NIAH 100% at 200K. | 208K x 1 | ~123 tok/s | ~183K input (88%) |
| qwen3.5 | Reasoning + agentic. On-demand thinking lane at native max context. | 256K x 1 | ~122 tok/s | ~231K input (90%) |

## Utility models — /small/v1

| model | description | context | decode |
|---|---|---|---|
| qwen2.5-coder | Qwen2.5-Coder-7B — quick utility lane (summarize, classify, tag, structured JSON). | 32K | ~35 tok/s |
| bge-m3 | Multilingual embeddings (1024-dim). | up to 8K tokens input | — |

Embeddings:

    curl https://llm.pq.io/small/v1/embeddings \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"model": "bge-m3", "input": "hello world"}'

## OpenCode setup

Drop into ~/.config/opencode/opencode.json. Replace <your-token> with
your bearer.

    {
      "$schema": "https://opencode.ai/config.json",
      "provider": {
        "pq.io": {
          "npm": "@ai-sdk/openai-compatible",
          "name": "pq.io",
          "options": {
            "baseURL": "https://llm.pq.io/main/v1",
            "headers": {
              "Authorization": "Bearer <your-token>"
            }
          },
          "models": {
            "qwen3-coder": { "name": "Qwen3-Coder", "limit": { "context": 196608, "input": 188416, "output": 8192 } },
            "qwen3.6":     { "name": "Qwen3.6",     "limit": { "context": 212992, "input": 204800, "output": 8192 } },
            "qwen3.5":     { "name": "Qwen3.5",     "limit": { "context": 262144, "input": 253952, "output": 8192 } }
          }
        }
      },
      "model": "pq.io/qwen3-coder",
      "compaction": {
        "auto": true,
        "prune": true,
        "reserved": 17000,
        "tail_turns": 2,
        "preserve_recent_tokens": 8000
      },
      "agent": {
        "compaction": { "model": "pq.io/qwen3.6" }
      }
    }

Explicit limit.input is required (upstream bug #13980 silently no-ops
reserved without it). The trigger formula is limit.input - reserved.

## Other clients

| client | config |
|---|---|
| OpenAI SDK | OpenAI(base_url="https://llm.pq.io/main/v1", api_key=TOKEN) |
| Aider | --openai-api-base https://llm.pq.io/main/v1 --openai-api-key $TOKEN --model openai/qwen3-coder |
| Cline / Continue / Roo | OpenAI-compatible provider, model qwen3-coder |

## Sampling defaults

Per Qwen team recommendations. Already applied server-side; override
per-request via standard OpenAI sampling fields.

| model | temperature | top_p | top_k | repeat_penalty | thinking |
|---|---|---|---|---|---|
| qwen3-coder | 0.7 | 0.8 | 20 | 1.05 | no |
| qwen3.6 | 0.6 | 0.95 | 20 | — | yes (think tags) |
| qwen3.5 | 0.6 | 0.95 | 20 | 1.10 | yes (think tags) |

## Endpoints

- GET /          — HTML landing page (no auth)
- GET /llms.txt  — this document (no auth)
- GET /health    — 200 ok (no auth)
- /main/v1/*     — chat/agentic models (bearer required)
- /small/v1/*    — utility models + embeddings (bearer required)

Health: https://llm.pq.io/health · Issues: matthew@pq.io
