blank

qwen3.6-enhanced.jinja: CoT leakage into tool turns and why preserve_thinking works now

2026-05-02T00:00:00+08:00

In my April note on Qwen 3.6-27B I described a stack that survived a long agentic trace: qwen3.5-enhanced.jinja on the 3.6 checkpoint, qwen3_coder for streaming extraction, preserve_thinking=false, and NCCL tweaks after Studio Driver 595.79.

That is the same cluster of reasons preserve_thinking had to stay off: Qwen 3.6 sustains interleaved thinking in a way 3.5 largely does not; qwen3.5-enhanced.jinja does not repair missing and can double-wrap assistant turns on 3.6; with preserve_thinking=true the template keeps more of that broken structure in rendered history, so prefix pollution, CoT bleed, and ignored tool_call get worse. preserve_thinking=false was the pressure-release—stripping much think from earlier turns so agent runs could finish—not a statement that 3.6 “should not” expose reasoning. I dug in when reasoning still leaked into tool_response and tools stopped firing even with the flag off.

I developed qwen3.6-enhanced.jinja so the Qwen 3.6 family can use an enhanced chat template without that compromise: multimodal paths, interleaved thinking aligned to how 3.6 actually behaves, self-healing before the reasoning split, and preserve_thinking supported (true or false)—i.e. the full surface the 3.6 series is meant to expose, instead of turning off preserve_thinking to paper over 3.5-enhanced-on-3.6 bugs. Raw file for vllm serve --chat-template. Working proof (128k token spent): qwen36_27B_36jinja_project.

This post is the template-side story: why pointing raw qwen3.5-enhanced.jinja at 3.6 could corner the runtime, why that file never inserts a missing (it leaves the broken assistant text in the prompt—causal models still condition on it), and the minimal self‑healing step I put in the assistant branch of qwen3.6-enhanced.jinja before the reasoning split.

What broke in plain terms

Sometimes the assistant emitted something shaped like:

an opening think marker (qwen3.6-enhanced.jinja and qwen3.5-enhanced.jinja share the same literal family; the April post used thinking casually for readability),
no closing tag before a raw block.

Training and runtime prompts encourage closed think sections. Reality is messier: the model can wedge a tool payload inside what is effectively still “thinking.”

Separately, using qwen3.5-enhanced.jinja on Qwen 3.6 used assistant logic equivalent to wrapping every qualifying turn in a synthetic think sandwich even when reasoning_content stayed empty. That interacted badly with malformed history: after rendering, it could look like the model was still inside an outer think envelope when appeared. Downstream behaviour matches what I observed as CoT leakage across turn boundaries and tool instructions that never get scheduled.

None of this negates qwen3_coder on 3.6—the parser lane still matters—but fixing the template removes a structural failure mode rather than leaning only on parsing heuristics.

Why reasoning extraction silently failed

The template extracts reasoning_content by looking for in the message body. When the assistant never emits that closing tag, the splitter never runs, reasoning_content stays empty, and the remainder stays the full raw string—including the unclosed opening think tag ahead of .

A 3.6-style handler that unconditionally wrapped “post–last-user” assistant text in opening and closing redacted-thinking fences, plus the recombined body, then effectively produced stacked think markup: a vacant fenced block followed by thought text that still began with another dangling ahead of .

From the model’s point of view that is dangerously close to “tool call emitted while still reasoning,” which rationalizes ignored tool XML and follow-up prose that belongs in Think leaking into structured tool payloads.

What `qwen3.5-enhanced.jinja` actually does (and does not do)

qwen3.5-enhanced.jinja does not repair a missing : there is no pass that closes a dangling opener or strips half-open think markup. Whatever the assistant emitted—including with no matching close before tool_call—can still show up in the serialized prompt the causal model conditions on next step; “letting it be” is input-side pollution in principle whenever that text stays in prefix.

Why the same no-fix workaround looked “fine” on Qwen 3.5: in my runs Qwen 3.5 does not really sustain a long-lived interleaved thinking block the way Qwen 3.6 does—it lacks that stickier “keep thinking open across turns” behaviour. Interleaved chat templating also discards many think segments for assistant turns before the last real user message, so most of the half-open scaffold never re-enters the prefix the model sees. 3.6 is where that stops being a sufficient safety net, so the same “don’t repair the close” policy starts to hurt visibly (CoT bleed, ignored tools) and self-healing in qwen3.6-enhanced.jinja becomes worth the complexity.

Earlier 3.5 assistant logic only wrapped output in an explicit Think block when reasoning_content was non-empty after splitting. With no close tag, reasoning_content stayed blank, the template skipped an extra synthetic think envelope, and the same dirty assistant string (still containing the unclosed opener) was emitted as bare assistant content. That sometimes kept tool_call outside a second layer of scaffolding the template would have invented—helping scheduling—but it did not make the token history structurally clean. On the faulty 3.6-on-3.5-enhanced path, the unconditional wrapper added that outer layer on top of the still-unclosed inner block, which made tool behaviour worse without fixing the underlying transcript hygiene problem.

The fix I settled on

I wanted deterministic repair, not another special case that might leave historic turns ending in without a sibling close before :

Self-healing (before splitting):
When both and appear and the last sits before the last (including the -1 / missing cases), inject immediately before the first when that tool call sits after the dangling opener; otherwise append at the end.
Keep the outer think wrapper unchanged afterward: splitting now sees balanced markers, extracts reasoning_content cleanly, and the tool payload never sits upstream of two contradictory think layers.

Roughly—the snippet lives today in qwen3.6-enhanced.jinja; the operative structure is:

{%- elif message.role == "assistant" -%}
    {%- set content = render_content(message.content, true)|trim -%}

    {# Ensure  exists before tool XML when opener was left dangling #}
    {%- if '' in content and '' in content -%}
        {%- set last_think = content.rfind('') -%}
        {%- set last_close = content.rfind('') -%}
        {%- set tool_pos = content.find('') -%}
        {%- if last_close < last_think or last_close == -1 -%}
            {%- if tool_pos > last_think -%}
                {%- set content = content[:tool_pos] ~ '' ~ content[tool_pos:] -%}
            {%- else -%}
                {%- set content = content ~ '' -%}
            {%- endif -%}
        {%- endif -%}
    {%- endif -%}

    {%- set reasoning_content = '' -%}
    {# … existing reasoning extraction + interleaved-thinking render … #}
{%- endif -%}

Above, tags match qwen3.6-enhanced.jinja as checked in; if you merge this into another fork, substitute your literal open/close think strings verbatim.

Branches I deliberately did not adopt: emitting a trail-only opening tag immediately followed by a newline wrapper for assistant history when reasoning is blank but the turn qualifies for preservation. That mirrors the add_generation_prompt tail—which is appropriate at generation start—but is incorrect mid-conversation because it nests the next beneath an unfinished think scaffold.

Practical scope

Surface area: the assistant message branch through the unchanged tool message handler—nothing else needed in my audits (system preamble, structured tool_calls serialization, trailing generation prompt untouched).
Interaction with knobs: with qwen3.6-enhanced.jinja, preserve_thinking=true is a safe option again—histories carry balanced fences after self-healing, so interleaved-thinking strip/keep semantics stay predictable. On bare qwen3.5-enhanced.jinja against 3.6 I still recommend preserve_thinking=false until you migrate.

What stays the same in the April stack

The April launcher remains the blueprint for parsers, GPUs, MARLIN-aligned FP8, NCCL tweaks, --disable-custom-all-reduce on 595.79, and qwen3_coder on 3.6. Point --chat-template at the local path of qwen3.6-enhanced.jinja (clone or copy from the chat-template/ folder); --default-chat-template-kwargs can then set preserve_thinking to true or false as you prefer (April’s preserve_thinking=false was keyed to qwen3.5-enhanced.jinja on 3.6, not to vLLM itself).

Where I reran transcripts that previously reproduced leakage, executions scheduled reliably again and stray reasoning stopped surfacing downstream of repaired markers; the public trace and code live in qwen36_27B_36jinja_project. Others’ mileage will vary by checkpoint and client parsing, which is exactly why I publish both halves: parser ergonomics plus truthful templating, plus a repo you can clone when a blog post is not enough.

vLLM launch recipe (`qwen3.6-enhanced.jinja`, `preserve_thinking=true`)

Below is the vLLM recipe I use with qwen3.6-enhanced.jinja and preserve_thinking: true (the pairing this post is about). I tested this configuration on vLLM v0.19.0; newer or older releases may need small flag or env tweaks. Point --chat-template at your local copy—e.g. from chat-template/qwen3.6-enhanced.jinja. Adjust source …/activate, GPU indices, and paths for your box. Lines that end with \ plus an inline # … can trip some shells; drop those comments after \ if paste fails.

On NVIDIA Studio 595.79 with mixed GPUs I still needed --disable-custom-all-reduce for stability (April note); it is commented here so you can enable it without hunting the flag.

#!/bin/bash
# vLLM v0.19.0 (recipe tested on this version)
# ------------------------------
# Safe, Speed-Focused Env Vars
# ------------------------------
export CUDA_DEVICE_ORDER=PCI_BUS_ID  # mixed-GPU safeguard
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

export OMP_NUM_THREADS=8

# NCCL tuning for SYS/PCIe topology
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=0
export NCCL_ALGO=Ring
export MODEL_NAME="Qwen/Qwen3.6-27B-FP8"
export NCCL_P2P_LEVEL=LOC
export VLLM_RPC_TIMEOUT=180
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# --------------------------
# Clean stale FlashInfer cache
# --------------------------
rm -rf ~/.cache/flashinfer

# Activate virtual environment (change to your path)
source /home/cychan/vLLM/.venv/bin/activate

export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve $MODEL_NAME \
  --served-model-name Qwen3.5-27B \
  --chat-template qwen3.6-enhanced.jinja \
  --default-chat-template-kwargs '{"preserve_thinking": true}' \
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 219520 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only
#  --disable-custom-all-reduce   # uncomment on Studio 595.79 + mixed GPU if you hit NCCL deadlocks (see April post)

# Optional: Qwen3 MTP speculative decoding (needs headroom; 80B-A3B speculator not on current hardware)
#  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \

Summary

The flawed qwen3.5-enhanced.jinja assistant branch aimed at 3.6, combined with sometimes-unclosed markers, yielded double layering after rendering: vacant synthetic think blocks atop still-open reasoning. Downstream failures looked like ignored tools and polluted tool responses—not always mistakable for NCCL deadlocks.

qwen3.5-enhanced.jinja could look less explosive on 3.5 partly because an empty reasoning_content skipped an extra synthetic wrapper—not because it healed think markup—and partly because Qwen 3.5 in my experience does not keep a thinking block alive the way 3.6 does, so prefix pollution rarely compounds. That skip disappeared on the faulty 3.6-on-3.5 path, and 3.6 does sustain interleaved thinking, so preserve_thinking=false on the old file masked double-layer tool failures while dirty prefixes became a first-class problem.

qwen3.6-enhanced.jinja uses pre-split self-healing to insert the missing close where needed so tool_call is not trapped inside an unterminated think region and the serialized history is not stuck carrying an endless “still thinking” span before the tool payload. That is what lets preserve_thinking work without the old trade-off between trace fidelity and clean conditioning. Operationally, keep parser and GPU settings from April; swap --chat-template and revisit preserve_thinking as intended rather than forced off. qwen36_27B_36jinja_project is the end-to-end proof repository for this template path.

Resources

Why I built this blog?

2026-05-01T21:00:00+08:00

Why I built this blog?

I built this website because I want one place to record my work and my growth clearly.

I want to document four things:

research
journey
bug fixes and workarounds
reflection & progress

I have tried sharing my ideas and notes on Reddit, X, Threads, and LinkedIn. They are useful for quick updates, but they cannot fit everything I need. For example, I often need proper math display, PPT showcase, and embedded PDF rendering. Those are important for how I think, build, and explain technical work.

So instead of forcing my content into platform limits, I decided to build my own website.

This blog will mainly track my work across AI engineering, quantitative research & trading infrastructure, and practical MLsystems, including LLM-agent harnesses, local LLM deployment, or multi-modal model for image, video, audio, or even combination of them (e.g. digital human). I am currently building end-to-end quant research workflows, and I also spend a lot of time on debugging and implementation details. I want this space to be a long-term technical notebook, not only highlights.

If you are working on similar topics, I hope these notes can save you time, give you ideas, or start useful conversations.

Qwen 3.6-27B-FP8 on vLLM: enhanced.jinja, qwen3_coder, and fixing NCCL after Studio Driver 595.79

2026-04-29T00:00:00+08:00

This post continues from my earlier notes on Qwen 3.5 tool-calling and the Qwen 3.6-35B-A3B follow-up. I reused the same enhanced.jinja stack and ran Qwen 3.6-27B-FP8 in a long unsupervised agentic session. NVIDIA Studio Driver 595.79 introduced NCCL deadlocks until I added the environment and flag overrides in the sections below. After that, the run reached about 180 000 tokens with no malformed tool calls. The resulting project is on GitHub.

The qwen3.5-enhanced.jinja template requires preserve_thinking=false under Qwen 3.6 (a new surface-area flag). With preserve_thinking=true, that template breaks and tool calls fail. The rest of this note assumes the flag is set to false.

Background: what already worked on Qwen 3.5

Earlier work on Qwen 3.5-27B and 35B-A3B used an RTX 4090 and RTX 3090 together. The configuration that made long agentic runs reliable included:

qwen3.5-enhanced.jinja — interleaved-thinking template that treats an unclosed block as plain content, not reasoning-only text, so the harness still sees tool output when the model omits the closing tag (“CoT leakage”). preserve_thinking=false is required (and was the workable default for this path).
Streaming tool-call parsing — the template assumes tokens are parsed as they arrive so can be recognized while is still open. On Qwen 3.5-27B, qwen3_xml behaved well and was the more robust option. On Qwen 3.6, qwen3_xml did not emit tool calls in that unclosed-thinking situation; qwen3_coder did.
VLLM_TEST_FORCE_FP8_MARLIN=1 — keeps the 4090 (SM89) on W8A16 instead of native W8A8, avoiding precision drift across the two GPUs.
NCCL tuning (P2P_DISABLE, IB_DISABLE, Ring) for stability on PCIe topologies.

With that setup, Qwen 3.5-27B completed a 1h 9m agentic session at 138K tokens and built a FastAPI + React application without tool-calling failures.

Moving to Qwen 3.6-27B and changing the parser

I pointed the same server at Qwen/Qwen3.6-27B-FP8, still using enhanced.jinja and preserve_thinking=false. qwen3_xml, which was preferable on 3.5, did not trigger tool calls when stayed open—the case the template is designed for—so I moved the --tool-call-parser to qwen3_coder, which streams more aggressively and still picks up the tool call inside the unclosed block. A related fix is discussed in this vLLM thread; a future vLLM 0.20.1 release might let me revisit qwen3_xml for this workload.

On driver 591.86 the stack was stable. After moving to Studio Driver 595.79, the server began hitting NCCL deadlocks: hard freezes mid-generation that required a restart. In logs the failure showed up as NCCL timeouts, not parser errors, so it was easy to confuse with tool-calling regressions.

Why `qwen3_coder` pairs with this template on 3.6

Two separate behaviors interact usefully here:

Model output: Qwen 3.6 sometimes emits a before closing . The enhanced template leaves that tool call in plain content so downstream code can still parse it.
Parser behavior: qwen3_coder is tuned for code-like streams and detects tool-call patterns mid-stream even when the XML framing is incomplete: it does not need a fully closed section and will fire on . qwen3_xml behaves more like strict XML; an unclosed can block it from surfacing the nested .

Together, that yields more resilient extraction for this template on 3.6 than either piece alone. Neither behavior is ideal in isolation; combined they amount to a production-viable setup. I therefore keep qwen3_coder for Qwen 3.6 with enhanced.jinja, while qwen3_xml remains the better fit for 3.5-27B on my hardware. If this interpretation disagrees with how the maintainers model the parsers, I am glad to be corrected.

Driver 595.79, NCCL, and vLLM all-reduce

I had been on 591.86 with acceptable behavior. 595.79 introduced NCCL deadlocks that froze generation. My working hypothesis is that the newer driver tightens NCCL behavior on mixed-GPU PCIe topologies enough to break vLLM’s custom all-reduce path. I did not roll back the driver; instead I applied:

Additional environment variables:

export NCCL_SHM_DISABLE=0
export NCCL_P2P_LEVEL=LOC          # restrict P2P to local GPUs only
export VLLM_RPC_TIMEOUT=180        # prevent premature RPC timeouts
export VLLM_WORKER_MULTIPROC_METHOD=spawn  # more robust worker lifecycle

--disable-custom-all-reduce on vllm serve, forcing native NCCL all-reduce instead of vLLM’s custom path on this PCIe-only topology.

Without these settings on 595.79, I saw intermittent deadlocks that resembled tool-calling failures in the UI but were not parser issues.

Long run: 180K tokens

With the driver and NCCL changes in place, I gave Qwen 3.6-27B ownership of a directory and a 10 000-token budget per step, without manual steering.

Prompt	Wall time	Accumulated tokens
“Welcome to life, you are Qwen 3.6-27B. Full leadership. What project do you want to build?”	0s	0k
“Don’t ask me – you have full leadership. 10k token budget.” (model used a Question tool to clarify, then proceeded)	31s	14.0k
“Did you check if this is bug-free? It’s your own project.”	17m 13s	63.3k
“Deliver the first possible functional upgrade. Do it nicely.”	11m 35s	126.7k
(session ended naturally)	10m 46s	180.0k

The model built a React + Vite + TypeScript front end with a FastAPI backend, revised it after critical feedback, and shipped a further upgrade. I did not observe a malformed tool call in that trace. Code: qwen36_27B_own_project.

Launch script

The script below is the same one I published in the discussion thread. Lines that end with \ followed by an inline # comment can confuse some shells; if paste fails, drop the comments after \ and keep the flags.

#!/bin/bash
# -------------------------------------------------
# Qwen 3.6-27B-FP8 – Agentic-Ready vLLM Launch Script
# Tested: 180K tokens, zero tool-calling failures
# Driver: NVIDIA Studio 595.79
# -------------------------------------------------

# ---- Safe, Speed-Focused Env Vars ----
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

export OMP_NUM_THREADS=8

# ---- NCCL Tuning for SYS/PCIe Topology ----
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=0          # NEW for driver 595.79
export NCCL_ALGO=Ring
export NCCL_P2P_LEVEL=LOC          # NEW for driver 595.79

# ---- vLLM Stability (Driver-Dependent) ----
export VLLM_RPC_TIMEOUT=180                  # NEW
export VLLM_WORKER_MULTIPROC_METHOD=spawn    # NEW

# ---- FP8 & Memory ----
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

# Clean stale FlashInfer cache
rm -rf ~/.cache/flashinfer

# Activate environment
source /home/cychan/vLLM/.venv/bin/activate

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --served-model-name Qwen3.5-27B \
  --chat-template qwen3.5-enhanced.jinja \
  --default-chat-template-kwargs '{"preserve_thinking": false}' \   # MANDATORY: the enhanced jinja will break if this is true
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 219520 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_coder \          # REQUIRED for Qwen 3.6 with enhanced.jinja; for Qwen 3.5 27B, qwen3_xml also works (see https://www.reddit.com/r/Vllm/comments/1suasv2/)
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only \
  --disable-custom-all-reduce            # CRITICAL for driver 595.79

Summary

The enhanced.jinja path needs a streaming tool parser that still fires when is left open. On Qwen 3.5-27B, qwen3_xml met that requirement and stayed the more robust option in my tests (detail). On Qwen 3.6, qwen3_xml missed tool calls in that pattern, so qwen3_coder was necessary.

preserve_thinking must stay false for qwen3.5-enhanced.jinja on Qwen 3.6; true is incompatible with that template in my setup.

591.86 → 595.79 introduced NCCL deadlocks on the mixed 4090/3090 box. Mitigations were NCCL_SHM_DISABLE=0, NCCL_P2P_LEVEL=LOC, VLLM_RPC_TIMEOUT=180, VLLM_WORKER_MULTIPROC_METHOD=spawn, and --disable-custom-all-reduce. Without them, deadlocks can look like tool failures.

VLLM_TEST_FORCE_FP8_MARLIN=1 and NCCL tuning remain mandatory for mixed FP8 ranks; the driver upgrade added the extra variables above.

Qwen 3.6-27B is a dense 27B checkpoint that, on the public agentic-coding figures I cite, beats Qwen 3.5-397B-A17B (SWE-bench Verified 77.2 vs 76.2, Pro 53.5 vs 50.9, SkillsBench 48.2 vs 30.0)—a larger step than a minor revision.

In one continuous session the stack sustained about 180K tokens with no tool-calling errors and roughly 10 minutes of uninterrupted agentic use on consumer GPUs, which matches what I wanted from a production-minded local setup.

The same template works on Qwen 3.5 and 3.6 if preserve_thinking=false and the parser matches the model generation. With the 595.79 workarounds, I get stable 180K-token agentic runs. Step-by-step background and earlier tuning notes are in the Qwen 3.5 deep-dive.

Resources

Findings: Karpathy-style autoresearch on a crypto backtester (local LLM)

2026-04-24T00:00:00+08:00

The thread I started from was whether Karpathy-style autoresearch—an LLM in a tight read / act / evaluate loop—could apply to quant mining on my own stack: local inference, crypto data, a custom backtester, no paid API for the search itself. I connected a Qwen 3.5 agent to that backtester and DB, let it run for ~2 hours and 30+ strategy cycles with no per-step prompting, and watched it self-learn the repo, burn through bad ideas, and land one configuration that cleared my gates ($0 inference bill for that grind). It was not a fully closed loop: when it stalled in weak local optima, I used human-in-the-loop nudges (short instructions, ideas from notes or papers) and it folded those into the next iterations without restarting the harness.

The point of this note is the trail: what blocked, what the loop did, and what I take away for the next run. Nothing here claims the strategy generalizes out-of-sample.

Recording (YouTube)—captures how the autonomous loop behaved on my stack:

What I was testing

Hypothesis: The same generate → backtest → gate → decide skeleton Karpathy describes for research automation is enough here for the model to learn my strategy API from the codebase, iterate without per-step prompting, and eventually hit strict metrics (Sharpe, max drawdown, profit factor, minimum trades).

What I wanted out of the experiment: A working skeleton—stable tools, objective discard, bounded wall clock—not a proof that the first passing parameter set is economic edge. I also wanted to see whether human-in-the-loop could stay lightweight: occasional steering instead of babysitting every iteration.

Setup (fixed for the run)

Component	Detail
Model	Qwen3.5-27B via vLLM
GPUs	Mixed setup (RTX 4090 + 3090)
Inference	vLLM with FP8, custom Jinja template, `qwen3_xml` parser
Runtime	~2 hours; autonomous iterations, human-in-the-loop when stuck
Iterations	~30+ strategy cycles
API cost	$0

Data: Crypto spot, 1m bars; TimescaleDB-HA for continuous aggregation.
Backtester: Custom DB + engine (Nautilus-based).
Harness: Ralph loop + Claude Code as execution harness.

Finding: tool calling was the real prerequisite

Before any autoresearch, Qwen 3.5 in agent mode was unreliable for me: premature stops, mid-thought tool calls, format drift—unacceptable for multi-hour loops.

What eventually worked: A custom M2.5-style Jinja template, the qwen3_xml parser, and precision alignment across the mixed GPU pair. Calendar time on this was on the order of weeks.

Reference I kept for myself while debugging: Qwen 3.5 27B/35B tool calling on vLLM. Without that stack, I would not rerun the experiment and expect the same stability.

Procedure (what I actually ran)

Each iteration:

Generate strategy code + YAML config.
Backtest immediately.
Score against the gates.
Decide: abandon, fix error, or change approach.

Observation: The useful part is not the four bullets in isolation—it is that the same interface and discard semantics repeat every cycle so the run has a trajectory instead of isolated guesses.

Observations from the long run

Bootstrap. The agent searched the repo, read existing strategies, matched the template, and produced a run_backtest()-compatible scaffold before leaning on novel ideas.

Regime coverage. Under my high-level brief it systematically tried momentum (EMA crosses, breakouts, ATR stops) and mean reversion (RSI, Bollinger, range fades). Most candidates failed fast on the gates; WandB showed DISCARD with clear numeric causes (Sharpe, drawdown, trade count).

Adaptation. When returns were weak but non-terrible, it tuned parameters (EMA lengths, sizing, vol filters). It also moved the backtest start date on its own to get more history.

Finding on the date move: That behavior is a leak / overfitting lever if left unrestricted. My takeaway is to freeze evaluation windows and walk-forward rules in the harness so the model cannot silently widen the training corridor.

Grind. Over ~2 hours it cycled EMA stacks (10/20, 5/15, 50/200), RSI + confirmation, MACD, 4h Bollinger reversion, vol-adjusted trend with jump detection, grid-style scalps, fixed R/R templates, long/short EMA divergence—dozens of variants, all discarded on the numbers until late in the run.

Eligibility. After ~30+ iterations, one configuration cleared all gates; I stopped there for this MVP.

Human-in-the-loop (what actually happened)

The loop was mostly hands-off: I did not craft each strategy or click through each backtest. The agent ran the generate → backtest → gate → decide cycle on its own until the trajectory looked stuck—same family of tweaks, marginal metrics, no path to the gates.

When I intervened, the input was small and textual: one-line briefs like volatility-adjusted position sizing or dual trend filters, or a concept lifted from a paper or my own notes. The agent ingested that and reprioritized the next hypotheses without me editing the harness or resetting state.

Finding: For this run, human-in-the-loop was the escape hatch from local optima, not a substitute for the loop. The economics still felt like automation: hours of machine time between nudges, and $0 marginal API cost for the search itself.

Comparative note: GA (why I contrast it)

I keep genetic algorithms in mind as a baseline: genome encode, mutation / crossover, batch backtest, selection—no explicit operator that reads a failure and chooses among abandon, patch, or new hypothesis; steering is emergent from selection.

Finding for my setting (multi-constraint gates, not one fitness scalar):

Cost: Generations imply batches of backtests; many individuals are obviously bad but consume slots until selection removes them.
Completion: A run often ends without any individual that simultaneously satisfies Sharpe, drawdown, trade count, etc.; “best so far” still fails a gate.

Contrast I recorded: The LLM loop ends each step with a decision conditioned on the backtest and repo context. That does not remove overfitting, but in this run it gave a bounded path to one gate-passing config without a large per-generation fan-out.

Findings I am carrying forward

Integration over single-strategy hype. The durable artifact is tool-stable loop + backtester + gates + harness, not the first passing parameter set.
Autonomous date-window edits are a policy bug unless intentionally allowed; they read as implicit curve fitting in my book.
Discard speed matters. The model moved on immediately on bad metrics; that matched what I want from mining hygiene.
Reuse worked. It recombined ideas already present in the repo and notes (e.g. vol-adjusted trend with jump suppression).
Economics. ~2h local inference, $0 API line item—material for how long I am willing to let a search run.
Human-in-the-loop scales the search. Mid-run nudges broke local optima; the agent absorbed paper- or note-level hints without harness churn. I am keeping that pattern: autonomous bulk + sparse human steering, not full manual mining.
Diversity risk. With the same harness, prompts, and data, an LLM driver can still converge to the same or nearly the same strategy across runs—inductive bias and “default” repairs narrow the trajectory. I am treating perturbation (sampling, seeds, varied briefs / inits, parallel nudges, small prompt jitter) as part of the method, not an optional polish.

Limitations and planned follow-ups

Eligibility is not alpha: Clearing the backtest gates is not, by itself, a claim of edge out-of-sample.
Next checks: Lock windows, walk-forward or holdout, and use “passes gates once” as a regression signal for the harness, not as validation of edge.

Qwen 3.6 35B-A3B on vLLM: do the Qwen 3.5 tool-calling fixes carry over?

2026-04-20T00:00:00+08:00

This note follows the earlier Qwen 3.5 / vLLM discussion. After weeks spent stabilizing Qwen 3.5-27B/35B for agentic use—the same qwen3_xml parser, qwen3.5-enhanced.jinja template, and GPU-side tuning—readers kept asking whether Qwen 3.6 behaved the same.

Short answer: the same configuration remains the most stable option on Qwen3.6-35B-A3B-FP8, but compared with Qwen3.5-27B the newer checkpoint is more prone to reasoning loops and to malformed tool calls that interrupt long agent runs. This post documents what still works, what I measured in three controlled runs, a partial mitigation on the client side, and why Qwen3.5-27B-FP8 is still my default for reliability-first agents.

What still carries over from the 3.5 setup

`qwen3_xml` tool-call parser

The registry-backed parser continues to handle complex tool arguments without the corruption seen with regex-oriented paths. Official documentation still recommends qwen3_coder; for this workload the recommendation remains not to use it for demanding agentic traces.

`qwen3.5-enhanced.jinja` chat template

The interleaved thinking template still applies to 3.6 35B-A3B: correct boundaries and clean tool-call framing relative to the stock template.

Mixed-GPU precision alignment

RTX 4090 (SM89) prefers W8A8 paths; RTX 3090 (SM80) falls back to W8A16. VLLM_TEST_FORCE_FP8_MARLIN=1 still forces both ranks onto a matched effective precision. Without it, long conversations drift—the same failure mode as on 3.5.

NCCL tuning

Unchanged for this mixed consumer topology:

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring

Agentic test protocol

Each trial used the same prompt shape: grant full ownership of the working folder, ask the model to build a full-stack project (frontend and backend), and cap planning around a ~10k-token budget for the initial instruction. The goal was to observe how long the stack survives before tool-calling or format errors end the session.

Three runs (same hardware, different vLLM template/parser pairs)

Run 1: `enhanced.jinja` + `qwen3_xml` (best observed config; file on disk: `qwen3.5-enhanced.jinja`)

The model chose to build oss-inspect—an autonomous codebase quality analysis tool (project name as generated by the model).

Prompt / phase	Accumulated tokens
Project setup	13.9k
“Did you check if this is bug free? This is your own project.”	135.1k
DCP sweep auto-triggered	107.0k
“Fix it then”	110.0k
Session ended — improper tool calling	111.1k

This configuration lasted the longest (~13m 20s wall time to failure). It reached ~130k+ tokens in the productive phase before improper tool calling terminated the run. After a DCP sweep at 135k (tokens reported dropped to 107k in the log), the session continued until the final failure at 111.1k on the last line—improper tool calling again.

Baseline comparison: Qwen3.5-27B with the same enhanced.jinja + qwen3_xml pairing routinely exceeds 130k tokens without that class of interruption in my runs.

Run 2: `official.jinja` + `qwen3_coder`

The model proposed a knowledge-graph platform oriented toward Graphify-style skill ingestion—the ingestion behavior was aggressive relative to expectations.

Time to failure: 6m 32s — improper tool calling. Too early to trust for long-horizon agentic work.

Run 3: `official.jinja` + `qwen3_xml`

The model proposed TaskFlow—a Kanban app with authentication, drag-and-drop tasks, and a polished UI.

Time to failure: 1m 16s — malformed tool calls emitted inside the thinking block. Again unsuitable for reliable agents.

Note on generated stacks

For the concrete frameworks and libraries the model selected in these runs, I did not have prior familiarity; observations are about tool protocol stability, not stack-specific code review.

Cross-run comparison

Configuration	Survival (approx.)	Failure mode
`enhanced.jinja` + `qwen3_xml`	~111k tokens (~13m 20s)	Improper tool calling (session died)
`official.jinja` + `qwen3_coder`	6m 32s	Improper tool calling
`official.jinja` + `qwen3_xml`	~1m 16s	Malformed tool calls inside thinking box

Summary: even the best 3.6 configuration fails more often than Qwen3.5-27B under the same harness. Qwen3.5-27B remains more stable for agentic use in these tests, despite slower TTFT on 3.5.

Behaviors that look specific to Qwen3.6-35B-A3B

1. More frequent reasoning loops

The model revisits the same analysis step repeatedly, burning tokens before advancing. This reads as a checkpoint behavior change, not a template bug: Qwen3.5-27B showed the pattern occasionally; on 3.6 35B-A3B it is common enough to hurt long sessions.

2. Malformed tool calls despite a “correct” wire format

With enhanced.jinja + qwen3_xml—the pair that works cleanly on 3.5-27B—3.6 35B-A3B still produces malformed tool calls at higher frequency. The XML shape can remain technically valid when it succeeds; the problem is how often failures occur and that one bad turn can abort a run with no recovery.

On 3.5-27B, after template fixes, bad tool turns are a rare edge case. On 3.6 35B-A3B, they are regular enough that any long agentic session eventually hits them, independent of which of the tested template/parser combinations is selected.

Partial mitigation: OpenCode 1.4.18

OpenCode 1.4.18 reduced client-side tool friction. Older OpenCode versions had tool-calling bugs that amplified failures—especially around the question tool. Upgrading to 1.4.18 addressed that class of malformed tool call interaction.

Limitation: the client upgrade does not remove reasoning loops or the model-level higher baseline failure rate on 3.6. The remaining issues sit primarily in the model (and possibly thinking-state handling—e.g. preserved thinking—but that is hypothesis, not a confirmed root cause).

Reference environment and `vllm serve` command

Software pinned for the run:

vLLM: 0.19.1
Transformers: 5.5.4
CUDA: 12.8.1 (nvcc 12.8.93)

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export OMP_NUM_THREADS=4
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

rm -rf ~/.cache/flashinfer

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name Qwen3.6-35B-A3B \
  --chat-template qwen3.5-enhanced.jinja \
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only

Conclusions

enhanced.jinja + qwen3_xml + OpenCode 1.4.18 is still the strongest combination tested on Qwen3.6-35B-A3B, but it does not match Qwen3.5-27B on looping or long-run tool reliability.
It is not obvious why tool regressions reappear on 3.6 35B-A3B when many 3.5 fixes carry over; preserved thinking is one speculative lever worth tracking if Qwen ships flash or template updates aimed at agents.
Qwen3.5-27B, Qwen3.5-35B-A3B, and Qwen3.6-35B-A3B share the same official chat template in distribution—if Qwen3.6 Flash (or similar) launches with different templating, that may indicate intentional handling of tool/thinking edge cases.

Operational choice: I default to Qwen3.5-27B-FP8 for agentic obedience—instruction following, clean tool execution, and low loop rate. Qwen3.6-35B-A3B offers much faster TTFT and similar headline capability to Qwen3.5-27B on AA-style benchmarks, but in these runs it trades that for loops and tool failures that terminate long sessions. For agent work, I prioritize reliability over raw benchmark scores.

Claude Code with local vLLM: client validation, model aliases, and a working settings.json

2026-04-19T00:00:00+08:00

Straight story: I wanted Claude Code to talk to my own model on vLLM, not to Anthropic’s hosted API. Tutorials usually say: set ANTHROPIC_CUSTOM_MODEL_OPTION and ANTHROPIC_BASE_URL. That was not enough. The CLI applies its own checks and can fail with “issue with the selected model” before meaningful traffic hits your server. The fix is a small set of aligned settings: tier aliases ("model": "sonnet" + ANTHROPIC_DEFAULT_*_MODEL), a root base URL (no extra /v1), CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1, and a dummy ANTHROPIC_AUTH_TOKEN so vLLM still gets a header.

Why I care (and maybe you do too): with ANTHROPIC_BASE_URL pointed at local vLLM and a placeholder token, model traffic does not use the real Claude API—no Anthropic API key or inference billing for that path. That matters if you cannot register, are out of region, or cannot obtain API access, but still want the Claude Code loop against a model you control. (You still install and run Claude Code; this is about where completions are served, not a different product.)

vLLM / Qwen: Tooling and template notes for Qwen 3.5 on vLLM are in this vLLM / Qwen 3.5 thread. Below assumes vLLM is already up and passes a simple curl check.

Baseline: vLLM responds, Claude Code does not (yet)

I run Qwen 3.5-27B behind vLLM. Direct HTTP calls succeed:

curl http://127.0.0.1:8000/v1/chat/completions -X POST \
  -d '{"model":"Qwen3.5-27B","messages":[{"role":"user","content":"test"}]}'
# Works

So I expected a quick env change. Instead I iterated through docs and issues, then grepped cli.js to see why validation fired.

The trap: `ANTHROPIC_CUSTOM_MODEL_OPTION`

What the official docs suggest

The Claude Code model configuration docs describe ANTHROPIC_CUSTOM_MODEL_OPTION as a way to add a custom entry to the /model picker and imply relaxed handling for that id.

I tried:

{
  "ANTHROPIC_CUSTOM_MODEL_OPTION": "Qwen3.5-27B",
  "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000"
}

Observed error: There's an issue with the selected model (Qwen3.5-27B). It may not exist or you may not have access to it.

What actually happens

The variable does add a picker entry, but it does not reliably bypass validation when you drive the CLI via --model, settings.json, or similar. In practice you still hit the same guardrails unless you adopt the alias + env pattern later in this note.

This behavior shows up in community threads—for example GitHub issues #18025, #23266, and #34821—while the product docs have not caught up.

Takeaway: when the documented env var does not match runtime behavior, the implementation (not the blog post) is the source of truth.

What I learned from `cli.js`

I stopped relying on tutorials and searched the installed cli.js (on the order of ~50k lines, minified) for the error string:

grep -n "There's an issue with the selected model" ~/.nvm/versions/node/*/lib/node_modules/@anthropic-ai/claude-code/cli.js

The hit landed near line 5146. The logic, paraphrased from the minified source, is:

if (q instanceof AnthropicError && q.status === 404) {
  // Reject custom models on 404
  return {
    content: `There's an issue with the selected model (${K}). 
              It may not exist or you may not have access to it.`,
    error: "invalid_request"
  }
}

So the CLI issues validation-style requests, gets 404 responses when the id is not on Anthropic’s expected list, and returns the “selected model” error before the path you care about (your vLLM /v1/messages traffic) is exercised normally.

That is client-side validation, not “your server returned 404 on chat.”

The undocumented lever that matters

Experimenting with env vars, CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 consistently reduced the failure mode where the client keeps probing endpoints that will never acknowledge a local model id. I did not find this called out in the same place as the high-level “custom model” docs; it is nonetheless necessary for a stable loop in my setup.

Working `~/.claude/settings.json` (tested here, not copy-pasted blind)

{
  "model": "sonnet",
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "Qwen3.5-27B",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "Qwen3.5-27B",
    "API_TIMEOUT_MS": "3000000",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}

The settings that must agree (if any drift, you get confusing errors)

Setting	Why it matters	Typical failure if wrong
`"model": "sonnet"` and `ANTHROPIC_DEFAULT_SONNET_MODEL`	Claude resolves the alias “sonnet” to your real vLLM id; putting the custom id directly in `"model"` triggers list validation	“Issue with the selected model”
`ANTHROPIC_BASE_URL` is `http://127.0.0.1:8000` (no `/v1`)	The client appends `/v1/messages` itself; a base URL that already ends in `/v1` becomes `/v1/v1/messages`	404 on API calls
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`: `"1"`	Cuts non-essential / validation traffic that assumes Anthropic-hosted models	Intermittent validation failures
`ANTHROPIC_AUTH_TOKEN` (e.g. `"dummy"`) plus aligned *`ANTHROPIC_DEFAULT__MODEL` for Opus / Sonnet / Haiku**	vLLM still expects an Authorization-shaped header; mapping all three tiers to the same served name avoids internal tier switches pointing at invalid ids	Auth or “wrong model” surprises when the CLI switches tier

vLLM side (must match the JSON exactly)

--served-model-name Qwen3.5-27B must match the strings in ANTHROPIC_DEFAULT_*_MODEL character for character.
Avoid / in the served name if your settings use a flat id (a Qwen/... vs Qwen3.5-27B mismatch broke one of my attempts).
Server should listen where ANTHROPIC_BASE_URL points (here 8000).

Smoke test

claude "test"
# Expect a normal assistant reply, e.g. readiness to help.

If this fails, reconcile the table above in order before chasing unrelated flags.

Debugging sequence (short)

If something below matches your error, fix that first; the full settings.json block is the target state.

Attempt 1 — vLLM-style base URL with /v1

"ANTHROPIC_BASE_URL": "http://127.0.0.1:8000/v1"

Error: API Error: 404 — the client adds /v1/messages again.

Attempt 2 — custom id in "model" (GitHub #18025-style reports)

"model": "Qwen3.5-27B"

Error: There's an issue with the selected model — no alias mapping; Anthropic list validation wins.

Attempt 3 — slash in --served-model-name vs settings

--served-model-name Qwen/Qwen3.5-27B

vs settings expecting Qwen3.5-27B without /.

Error: model not found / mismatch.

Attempt 4 — ANTHROPIC_CUSTOM_MODEL_OPTION only (official wording)

{
  "ANTHROPIC_CUSTOM_MODEL_OPTION": "Qwen3.5-27B",
  "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000"
}

Error: still validation errors — picker entry ≠ full bypass for settings.json flows.

Attempt 5 — ANTHROPIC_API_KEY instead of token

"ANTHROPIC_API_KEY": "dummy"

Error: authentication friction — ANTHROPIC_AUTH_TOKEN behaved better with vLLM in my tests.

Attempt 6 — correct URL and aliases but no traffic / validation flag

// CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC omitted

Error: intermittent validation failures — sometimes works, sometimes not.

Attempt 7 — minimal working core (before I added timeout / attribution / all three tiers)

{
  "model": "sonnet",
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

Result: stable enough to proceed; I then expanded to the full block at the top (Opus/Haiku defaults, long API_TIMEOUT_MS, CLAUDE_CODE_ATTRIBUTION_HEADER) for day-to-day use.

Why the alias + base URL + flag pattern works

Model tiers and aliases

Claude Code still thinks in Opus / Sonnet / Haiku tiers. If "model": "sonnet", the runtime resolves that label via ANTHROPIC_DEFAULT_SONNET_MODEL. If you put "model": "Qwen3.5-27B" directly, the CLI tries to treat it like an Anthropic-hosted id and fails validation.

"model": "sonnet"
"ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B"

URL construction

The client builds:

{ANTHROPIC_BASE_URL}/v1/messages

So ANTHROPIC_BASE_URL=http://127.0.0.1:8000/v1 becomes:

http://127.0.0.1:8000/v1/v1/messages

which 404s. The base should stop at the host (and port), e.g. http://127.0.0.1:8000.

`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`

With =1, the CLI skips some checks and ancillary calls that assume Anthropic’s catalog. For local ids, those checks are exactly where 404 → “invalid model” loops come from. Without the flag I still saw sporadic failures even when aliases and URLs were otherwise correct.

Common errors (quick map)

Symptom	Likely cause
“There’s an issue with the selected model”	Custom string in `"model"` without alias mapping
`API Error: 404`	`ANTHROPIC_BASE_URL` includes `/v1`
Model not found	`--served-model-name` does not match JSON, or contains `/` when settings do not
Intermittent validation	Missing `CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`
Relying on `ANTHROPIC_CUSTOM_MODEL_OPTION` alone	Docs oversell bypass; picker ≠ full CLI bypass

Pre-flight checklist

Before invoking claude:

"model" is an alias such as sonnet, not the vLLM id.
ANTHROPIC_DEFAULT_SONNET_MODEL (and siblings if you use tier changes) points at the served name.
ANTHROPIC_BASE_URL ends at ...:8000 with no trailing /v1.
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC is "1".
--served-model-name matches the JSON exactly (no stray /).
vLLM is up and reachable at that host/port.
Do not treat ANTHROPIC_CUSTOM_MODEL_OPTION as sufficient on its own.

Summary

Accessibility: Pointing Claude Code at local vLLM means Anthropic API access is not required for the model layer—useful when you cannot register, cannot get API keys, or want zero hosted inference spend. You still use the CLI; completions hit your server.
ANTHROPIC_CUSTOM_MODEL_OPTION alone did not match what I needed; treat tier aliases + env as the real fix.
"model": "sonnet" (or another tier label) plus ANTHROPIC_DEFAULT_*_MODEL → your Qwen3.5-27B (or served name).
ANTHROPIC_BASE_URL stops at http://host:port; the client adds /v1/messages.
--served-model-name matches those env strings exactly (watch / in ids).
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 avoids intermittent validation against Anthropic’s catalog.
ANTHROPIC_AUTH_TOKEN (e.g. dummy) worked better than ANTHROPIC_API_KEY with my vLLM.
When docs and cli.js disagree, the bundle wins; most bad copy-pastes omit one of alias mapping, base URL shape, or the traffic flag.

Resources

ForgeBookAuto — Claude Code third-party models (quick reference)
BigModel docs — coding plan / Claude (working third-party pattern)
vLLM docs — Claude Code integration (useful but incomplete versus real client behavior)
Related GitHub issues: #18025, #23266, #34821

If you want Claude Code’s workflow without Claude API inference, start from the settings.json block and checklist: aliases, root base URL, CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC, then align vLLM’s served name. That order saves time versus chasing ANTHROPIC_CUSTOM_MODEL_OPTION alone.

Stable tool calling for Qwen 3.5 27B/35B on vLLM: template, parser, and mixed-GPU fixes

2026-04-13T00:00:00+08:00

Public write-ups on Qwen 3.5 often emphasize reasoning quality and slow time-to-first-token (TTFT). The reasoning capability on Qwen 3.5-27B is genuinely strong, and yes, TTFT is slow—but for agentic workloads tool calling is frequently what actually breaks: malformed XML, mid-stream stops, and long-context format drift are not always visible in short demos. This note comes from roughly a month of running that model on a mixed-GPU workstation (RTX 4090 + 3090), plus many hours debugging failed runs and reading vLLM source. The same patterns apply to 27B/35B-class checkpoints (including A3B-style variants where instruction-following is comparable). The resulting configuration has stayed stable in production (weeks of use after the initial fix pass). Official model cards describe the happy path; they understate edge cases that smaller models hit more often than 122B+ checkpoints.

1. Chat template: `qwen3.5_official.jinja` and smaller models

Symptoms

The run started from the official qwen3.5_official.jinja template. For the first handful of tool turns, output looked fine. Then failures clustered:

Tool calls appeared mid-thought, including closing without having opened a matching tag.
Premature stops in the middle of XML tool calls—for example, the model produced a line like “Let me do that for you:” and then stopped without finishing the tool payload.
Historical thinking blocks leaked into context, so later turns saw polluted reasoning boundaries and inconsistent tool formatting.

At first the usual suspects were operator error, a possible vLLM bug, or heterogeneous GPUs. Instrumentation and template experiments showed the chat template was the actual root cause.

Cause

The official template contains edge cases that 122B+ models tend to absorb but 27B/35B models do not. Smaller checkpoints have less robust instruction following; the same ambiguity around where “thinking” ends and tool XML begins produces silent parser-level failures rather than self-correction.

Fix

The working approach was a custom M2.5-style interleaved thinking template (qwen3.5-enhanced.jinja) that:

Closes (the structured thinking segment) before tool calls, not after—so tool XML is not interleaved in a way that confuses the runtime.
Hides historical reasoning from the context the model sees on subsequent turns while keeping current reasoning visible where needed.
Uses XML-shaped tool output that does not accidentally trigger -style termination patterns.
Handles the edge cases that smaller models struggle with when the stock template leaves boundaries implicit.

vLLM does not auto-detect the right chat template for this stack. You must pass the Jinja file explicitly; otherwise the default template remains in force and instability tends to persist regardless of other optimizations:

--chat-template qwen3.5-enhanced.jinja

2. Tool-call parser: `qwen3_coder` vs `qwen3_xml`

Official guidance

The Qwen3.5-27B-FP8 Hugging Face page recommends:

--tool-call-parser qwen3_coder

What went wrong in practice

For complex tool calls and long-context agentic runs (on the order of 50K+ tokens in a trace), qwen3_coder was a primary source of breakage. A lot of wall-clock time went into proving that the parser—not only the model—was responsible.

From reading vLLM’s implementation, the distinction is structural:

Parser	How it works	Special characters (`<`, `>`, `&`)	Nested JSON while streaming	Malformed XML
`qwen3_coder`	Regex string extraction	Breaks pattern matching	Often corrupts mid-stream	Fails hard
`qwen3_xml`	C-based parser (`xml.parsers.expat`)	Auto-sanitizes	Deferred / safer parsing	Often auto-heals

Concrete example: a tool argument that contains code such as if (a < b) breaks qwen3_coder because < and > interfere with regex-based extraction. qwen3_xml treats the stream as XML-shaped text under a real parser and does not depend on that fragile string match.

Fix

--tool-call-parser qwen3_xml

This contradicts the one-line official recommendation on the model card. The preference for qwen3_xml here is grounded in vLLM source inspection (not only empirical trial-and-error): the C-based XML path is fundamentally more robust than regex extraction for messy, nested, or streaming tool payloads.

3. Mixed-GPU precision drift (4090 + 3090)

Problem

Tensor parallelism splits matrix multiplications across devices. In this setup:

RTX 4090 (SM89) exposes native FP8 W8A8 tensor-core paths.
RTX 3090 (SM80) has no native FP8 and falls back to W8A16.

So different ranks use different precision: W8A8 on one GPU and W8A16 on the other → mismatched partial products → error accumulation over depth and sequence length.

Symptoms

Beyond roughly 30–40K tokens, conversations drifted: tool calls grew inconsistent and reasoning quality degraded, consistent with numerical divergence rather than a single bad sampling draw.

Fix

export VLLM_TEST_FORCE_FP8_MARLIN=1

This forces the 4090 onto W8A16 (via the Marlin path) so it matches the 3090 instead of using native W8A8 alone. Both ranks then share the same effective precision, which removed the long-run drift in this configuration.

NCCL tuning for stability on this mixed consumer topology (helpful in practice alongside the precision alignment):

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring

4. Checkpoint choice: SFT-distilled variants (e.g. Qwopus3.5) vs official weights

The failure mode

Checkpoints such as QuantTrio/Qwopus3.5-27B-v3-AWQ are SFT-distilled from Claude 4.6 Opus. They can look excellent initially:

For the first ~65K tokens, tool calling stayed stable.
After ~65K+ tokens, output began mixing XML tool format with JSON-style tool messages.

This branch of debugging cost the most calendar time before the hypothesis “wrong checkpoint for the protocol” was confirmed.

Cause

SFT shifted the surface tool format toward a Hermes-style JSON tool protocol to align with Claude-like training targets, but it does not fully realign the underlying token distribution with the base Qwen XML tool priors. In long context, the model drifts between the original Qwen qwen3_xml shape and the SFT’d JSON shape—something post-processing cannot reliably paper over.

What to run instead

If you have about 48 GB VRAM (best quality in this comparison):

Qwen/Qwen3.5-27B-FP8

Near-lossless accuracy relative to denser formats.
Full ~219K context support as advertised for the stack used here.
Stable tool calling when paired with the custom template above.

If VRAM is below ~48 GB (accept some accuracy loss):

Intel/Qwen3.5-27B-int4-AutoRound

Saves on the order of ~4 GB VRAM versus FP8 in this class of setup.
Remains stable with the same custom template in testing.
Higher perplexity than FP8; INT4 is not lossless.

FP8 quantization is near-lossless in practice for this model line: avoid dropping to INT4 unless VRAM truly forces it.

Reference `vllm serve` configuration

After consolidation in an independent repo and roughly three days of production-style use, the following environment and serve command matched the stable behavior described above:

# Environment variables
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
export VLLM_TEST_FORCE_FP8_MARLIN=1

# vLLM serve command
vllm serve Qwen/Qwen3.5-27B-FP8 \
  --served-model-name Qwen3.5-27B \
  --chat-template qwen3.5-enhanced.jinja \
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 219520 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only

Validation

One continuous agentic session lasted about 1h 9m on this configuration:

138.2K tokens generated.
Stable tool calling throughout—no XML/JSON format drift tied to parser or template collapse.
M2.5-style interleaved thinking remained coherent across the run.
The model autonomously implemented a production-oriented knowledge-graph platform (FastAPI + React); 18 minutes of that session were uninterrupted end-to-end work without tool-calling failures—the sort of reliability that matters for real agents rather than toy demos.

The stack has also remained stable for weeks after that validation period in my deployment. As always, numbers are tied to one hardware topology and one workload mix; they are included as evidence, not a universal benchmark.

Summary

Jinja template: For 27B/35B-class Qwen 3.5, the custom interleaved-thinking template is critical; the official template leaves edge cases that smaller models hit routinely.
Parser: Do not treat qwen3_coder as mandatory for agentic work; qwen3_xml’s Expat-based path is more robust than regex for long, messy tool traces—and that conclusion is supported by reading vLLM, not only by trial runs.
Mixed GPU: When FP8 paths differ by generation, VLLM_TEST_FORCE_FP8_MARLIN=1 (or an equivalent precision-alignment strategy) is effectively required to stop long-context drift.
Weights: SFT-distilled “Claude-shaped” forks (e.g. Qwopus3.5) can mix formats after ~65K tokens; for long tool-heavy jobs, prefer official Qwen FP8 (or the INT4 variant if VRAM demands it).
Quantization: FP8 is near-lossless here; downgrade formats only when VRAM leaves no alternative.

Resources

Working setup (templates, env, notes): GitHub — vLLM Qwen 3.5 27B config — includes qwen3.5-enhanced.jinja.
Concrete long-session example: qwen_own_project.
Earlier discussion: Reddit — tool calling fixes thread.

If you run Qwen 3.5 27B/35B-class models for agents and see silent tool failures—truncation, wrong boundaries, or thinking leakage—inspect the Jinja chat template first. In my experience it was almost always the dominant factor: the stock template does not cover the failure modes smaller checkpoints actually hit.

Workaround for Enabling NCCL P2P Communication for NVIDIA RTX 4090 Workstations

2025-05-21T00:00:00+08:00

Why bother with NCCL P2P?

When you train or fine-tune on more than one GPU, libraries such as PyTorch and JAX rely on collective communication so every device sees the same gradients or activations at the right time. On NVIDIA hardware that usually means NCCL is in the hot path. Peer-to-peer (P2P)—direct GPU-to-GPU memory access—is how NCCL prefers to move data when the driver and topology allow it; when P2P is blocked, collectives may still run but through slower fallbacks (see the next section for definitions).

NVIDIA does not officially enable NCCL P2P on some consumer GeForce boards, including the RTX 4090, even when the hardware could support it in principle. If you bought two 4090s for a workstation and expect NCCL to “just work” like on a datacenter GPU, you can end up chasing cryptic logs until you either accept degraded comms or apply a driver/kernel workaround. This post documents one such path: modified open GPU kernel modules plus a few firmware and boot settings.

What is NCCL P2P?

NCCL (NVIDIA Collective Communications Library) implements multi-GPU collective operations: patterns such as all-reduce (combine gradients across GPUs), all-gather, broadcast, and reduce-scatter. Frameworks typically call these through a distributed backend (for example PyTorch’s ProcessGroup using NCCL). Internally, NCCL chooses topologies—rings, trees, or hybrids—and schedules send/recv-style steps along edges between GPUs.

P2P in this context means CUDA peer access: GPU i is allowed to load and store another GPU j’s device memory without an explicit copy through host DRAM. On PCIe-only setups that usually implies P2P DMA over the fabric between those endpoints (when NVLink exists, NCCL can use that too). “NCCL P2P” is shorthand for: NCCL is using peer-to-peer GPU memory paths as part of those collectives, rather than only staging through pinned host buffers.

When peer access is unavailable or disabled, NCCL can fall back to other transports, but you often see higher latency, lower effective bandwidth, or extra PCIe traffic—sometimes bad enough that scaling to two GPUs barely helps. Diagnostic hooks include cudaDeviceCanAccessPeer / cudaDeviceEnablePeerAccess, NCCL’s environment knobs (for example NCCL_P2P_LEVEL, NCCL_P2P_DISABLE), and NCCL debug / topology logs that record whether the runtime believes P2P is usable between pairs of GPUs.

What is ReBAR (and why it shows up here)?

ReBAR is Resizable BAR (part of the PCIe specification). Traditionally the CPU could only map a small fixed window of GPU video memory at a time. With ReBAR, the system can expose a much larger contiguous mapping of VRAM to the CPU. That mainly helps CPU ↔ GPU traffic (textures, uploads, some unified-memory style use). It is not the same thing as GPU-to-GPU P2P, but on consumer platforms BIOS and driver stacks often treat BAR sizing and routing as part of the same configuration story.

For the workaround in this guide, ReBAR should be enabled in firmware where possible. The practical check is: Total BAR size reported by the driver is at least on the order of hundreds of MB (the steps below use nvidia-smi). If ReBAR is off, update motherboard BIOS and enable the relevant “Above 4G decoding” / Resizable BAR options before spending time on kernels.

IOMMU (I/O memory management unit) virtualization for devices can interfere with certain P2P paths on some boards. This guide turns IOMMU off in the kernel command line for the workstation case—only do that if you understand the tradeoff (simpler device DMA; less isolation for PCI passthrough / VFIO workflows).

Once the definitions and motivation are clear, the rest is execution: driver build, firmware/boot knobs, CUDA, and sanity tests.

Expected results

The following images show P2P working end-to-end after configuration:

How: implementation guide

1. Driver installation

1.1 Remove existing NVIDIA drivers

sudo apt purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean

1.2 Reboot

Restart the machine so the old stack is fully unloaded.

1.3 Unload the NVIDIA DRM module (text mode)

systemctl isolate multi-user.target
modprobe -r nvidia-drm

# If the GUI does not return afterward:
systemctl start graphical.target

1.4 Install the modified kernel modules

Clone the open GPU kernel modules repository (use the repo URL, not the GitHub tree page):

git clone https://github.com/tinygrad/open-gpu-kernel-modules.git
cd open-gpu-kernel-modules

Check out the P2P branch that matches your target driver line (example branch name—confirm on the repo):
```
git fetch --all
git branch -a
git switch 565.57.01-p2p
```

Build the modules:

make modules -j$(nproc)

If the build fails on GCC, install GCC 12 and point alternatives at it:

sudo apt update
sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 120 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Install the built modules:
```
sudo make modules_install -j$(nproc)
```

Install the user-space driver from NVIDIA for the same version, without replacing the kernel modules you just built—for example:

# Download the matching runfile from NVIDIA, e.g.:
# https://www.nvidia.com/en-us/drivers/details/233008/
sh ./NVIDIA-Linux-[...].run --no-kernel-modules

Reboot again.

2. System configuration

2.1 Verify ReBAR

nvidia-smi -q | grep -i bar -A 3

Treat Total ≥ 256 MiB (order-of-magnitude) as a sign that ReBAR-style mapping is in play; if numbers look tiny, fix BIOS options before debugging NCCL.

2.2 Disable IOMMU in GRUB (AMD example)

Edit /etc/default/grub and adjust the default command line, for example:

sudo nano /etc/default/grub

Set something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off iommu=off"

Then sudo update-grub (Debian/Ubuntu) and reboot.

Reminder: P2P in this setup expects ReBAR on and IOMMU off for the paths described here.

3. CUDA toolkit

Install a CUDA toolkit from NVIDIA’s CUDA downloads.

Example environment for build tools:

export PATH=/usr/local/cuda-12.9/bin:$PATH
export CUDAHOSTCXX=/usr/bin/g++-12

4. P2P tests

4.1 SimpleP2P

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/0_Introduction/simpleP2P/
mkdir build && cd build
cmake ..
make -j$(nproc)
./simpleP2P

4.2 Bandwidth and latency

cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
mkdir build && cd build
cmake ..
make -j$(nproc)
./p2pBandwidthLatencyTest

Conclusion

NCCL P2P is worth the trouble when multi-GPU collectives are on your critical path and you are on consumer cards with official limitations. ReBAR is primarily about how much GPU memory the CPU can map at once; here it is a firmware prerequisite to line up with the driver workaround, alongside IOMMU settings and a matched open-kernel-module build + runfile install. Treat the whole stack as unsupported by NVIDIA for production—validate with the CUDA samples above, then run your real training job and watch NCCL logs for a clean bill of health.

IELTS - After Class Note, Week 8

2024-08-24T00:00:00+08:00

Listening

Listening 遇到填充題，一定要判別詞性，係 V, Vs, N. Ns, Adj

Writing

Short Task

Diagram (流程圖)

結構同樣係
- 1. intro, 1-3 overiew, 2*3-4 features, 1 conclusion
用現在式
遊戲玩法係將每一個步驟用完整句子＋連接詞連接起來
- First of all, then, next, after that , in the next stage, subsequently, finally, at the end,
Overview:
- 寫有幾多個step, 第一步同最後一步係乜
文中如果吾知佢要做乜，可以寫
- They / the products undergo XXX (N.)

Speaking

遊戲只能靠平時的底子
- 着重於陽長避短
4 domains
- Fluency and coherence
  - 最重要
  - keep talking, 而且要用完整句子
- Grammatical range
  - tense 好看重，如果可以就用多些少tense
  - 多用連接詞，e.g. and beside, moreover instead
  - 多用比較，e.g. instead, but, perfer, A is better than B as
- Lexial resources
  - 吾好作狀，speaking 用既詞語同寫作係吾同
- Pronounciation
  - accent 吾記分，所以吾洗扮口音
  - 但係快慢，聲調，節奏會計
6 分起跳
- 通常往上調
11-14 mins
- 首 30s 吾記分，但係用黎比印象分
聽吾切
- Would you please repeat the question for me (X again)?

回答套路

KFC approach

Key details/ reason
Friends
Contrast
- X if
- opinions differ

PPF approach

Past
Present
Future

5W approach

What
Why
How
When
Where

Part 1

問習慣，住乜，讀書定工作 etc.
要顯示出你肯答，keep talking !
- 不然扣 lexial

Question Bank for review

Part 2-3

錯題集

Reading Incorrect Answer from Lesson 7

Q4

What do the experiments described in the fifth paragraph suggest about the paintings of Mondrian?

A. They are more carefully put together than they appear.
- Quote: Mondrian’s works are deceptively simple, but eye-tracking studies confirm that they are meticulously composed.
- Meticulously : 細致地
- it fit the words carefully
B. They can be interpreted in a number of different ways. (mentioned, not the suggest for the paintings)
C. They challenge our assumptions about shape and color. (X assumption)
D. They are easier to appreciate than many other abstract works. (X compare)

Q7

She also observes that pleasing works of art often contain repeated _____ which occur frequently in the natural world.

Quote: What’s more, appealing pieces both abstract and representational, show signs of ‘fractals’ - repeated motifs recurring in different scales. Fractals are common throughout nature, for example in the shapes of mountain peaks or the branches of trees. It is possible that our visual system, which evolved in the great outdoors, find it easier to process such patterns.
Patterns -> repeated motifs -> repeated ??
A. layout (X)
B. images (O)
Require a plural noun here

Q8-13 Views of the writer

We must use Yes/No/Not Given, not True/False/Not Given
Q10

People’s taste in paintings depends entirely on the current artistic trends of the period.
Quote: While the fashions of the time might shape what is currently popular, works that are best adapted to our visual system may be the most likely to linger once the trends of previous generations have been forgotten.
- Mentioned people’s taste in fashion + current artistic trends
- But not the entirely, as some trends may be the most likely to linger
Ans: No

Q11

Scientists should seek to define the precise rules which govern people’s reactions to works of art.

Quote: It would, however, be foolish to reduce art appreciation to a set of scientific laws.
- Mentioned scientific laws -> precise rules
- Mentioned art appreciation -> people’s reactions to works of act
Ans: No

Listening Section3

Q21-22

Which TWO characteristics were shared by the subjects of Joanna’s psychology study?

Ans: B+D
A. They had all won prizes for their music.
- 有伏，Quote: And quite a few had won prizes and competitions as well
- 有一些，唔係全部都係
B. They had all made music recordings
C. There were all under 27 years old
D. They had all toured internationally
- Quote: They were all very highly regarded in the music world and they’d done quite extensive tours in different continents
- Quote: They were all very highly regarded in the music world and they’d done quite extensive tours in different continents
E. They all played a string instrument.

Q25-26

Which TWO topics did Joanna originally intend to investigate in her research?

係問緊一開始，唔係最後
A. regulations concerning concert dress
- 完全冇講regulations
B. audience reactions to the dress of performers
- Quote: When I started I was more interested in trying to investigate the impact of what was worn on those listening
C. changes in performer attitudes to concert dress
D. how choice of dress relates to performer roles
- 伏位, quote: My research investigated the way players see their role as a musician and how this is linked to the type of clothing they decide to wear, but that focus didn’t emerge immediately
E. links between musical instrument and dress choice
- Quote: and also whether someone like violinist might adopt a different style of clothing from someone playing the flute or the trumpet

Q28

Mike Frost’s article suggests that in popular music, women’s dress s affected by

A. their wish to be taken seriously (O)
B. their tendency to copy each other
C. their reaction to the masculine nature of the music
Quote: He points out that a lot of female signers and musicians in popular music tend to dress down 穿著低調 in performances, and wear less feminine clothes, and he suggests this is because otherwise they’d just be discounted as trivial
- 害怕被人輕視，打折扣

Listening Section4

Q34

some Co2 moves from the __ of plants to microbes in the soil

Ans = roots (plural)
係植物的根部，吾係roof

Q35

uses established practices to make sure soil remains fertile and __

Ans = moist / wet, 串錯字係冇分

Q37

talking place on a big _____ farm

Ans : cattle 牛仔

Q38

uses compost made from waste from agriculture and _____

Ans: gardens
要有複數，唔係 garden

Q39

aims to increase soil carbon by using _____ that are always green

要復數
Ans: grasses

IELTS - After Class Note, Week 7

2024-08-20T00:00:00+08:00

Listening

新題型: matching 題

優先讀曬所有option
然後題目會順序落，留意番 Noun time No. Adj.
- $\because$ 7 選 1 到時會跟吾切
- 相反，MC題多數 3 選 1
（用消化既處理辦法對 map 同 matching 題幾好用，今次一條冇錯）
- 睇清楚佢係入門走向，然後跟住description 行通常冇錯

Writing

Short Task

推薦 20 mins 完成

Structure

Introduction (1 sentence)
Overview (1-3 sentences)
1. Key information only
2. Overall trend (increase/ decrease)
3. Comparison (Beginning vs Latest data only)
4. Polar Data (極大值/極細值) (最大既貢獻vs 最小既貢獻 - pie chart)
5. Fluctuation
  - 吾好講數字！
Features * 2
1. Group data by segment (每個國家講一次)
2. Group data by trend (上升一組，下降＋不變一組)
吾好有結論！！！

Question Type

Line graph, bar chart, pie chart (80%)
Flow chart (20%)

Answer Flow

將題目少少rephr 作為 intro
睇圖既時候留意有乜內容可以做 overview
諗起features 係要點分段，by segment or by trend?
全部行過去式！！！！！
罐頭句系列
- to experience a sharp increase / decrease in XXX
- A sharp rise / fall in XXX can be witnessed (observed)
- There was a surge/ reduction in XXX
- There was an upward / a downward trend in
- The number soared / rocketed to (a record high / low of ) XXX
- The figures peaked at XXX in XXX
- A significant increase / decrease occurred from A to B // between A and B.
- The number hit a low-point of XXX
- XXX accounted for // constituted / contributed to ? % of XXX

blank

qwen3.6-enhanced.jinja: CoT leakage into tool turns and why preserve_thinking works now

What broke in plain terms

Why reasoning extraction silently failed

What qwen3.5-enhanced.jinja actually does (and does not do)

The fix I settled on

Practical scope

What stays the same in the April stack

vLLM launch recipe (qwen3.6-enhanced.jinja, preserve_thinking=true)

Summary

Resources

Why I built this blog?

Qwen 3.6-27B-FP8 on vLLM: enhanced.jinja, qwen3_coder, and fixing NCCL after Studio Driver 595.79

Background: what already worked on Qwen 3.5

Moving to Qwen 3.6-27B and changing the parser

Why qwen3_coder pairs with this template on 3.6

Driver 595.79, NCCL, and vLLM all-reduce

Long run: 180K tokens

Launch script

Summary

Resources

Findings: Karpathy-style autoresearch on a crypto backtester (local LLM)

What I was testing

Setup (fixed for the run)

Finding: tool calling was the real prerequisite

Procedure (what I actually ran)

Observations from the long run

Human-in-the-loop (what actually happened)

Comparative note: GA (why I contrast it)

Findings I am carrying forward

Limitations and planned follow-ups

Qwen 3.6 35B-A3B on vLLM: do the Qwen 3.5 tool-calling fixes carry over?

What still carries over from the 3.5 setup

qwen3_xml tool-call parser

qwen3.5-enhanced.jinja chat template

Mixed-GPU precision alignment

NCCL tuning

Agentic test protocol

Three runs (same hardware, different vLLM template/parser pairs)

Run 1: enhanced.jinja + qwen3_xml (best observed config; file on disk: qwen3.5-enhanced.jinja)

Run 2: official.jinja + qwen3_coder

Run 3: official.jinja + qwen3_xml

Note on generated stacks

Cross-run comparison

Behaviors that look specific to Qwen3.6-35B-A3B

1. More frequent reasoning loops

2. Malformed tool calls despite a “correct” wire format

Partial mitigation: OpenCode 1.4.18

Reference environment and vllm serve command

Conclusions

Claude Code with local vLLM: client validation, model aliases, and a working settings.json

Baseline: vLLM responds, Claude Code does not (yet)

The trap: ANTHROPIC_CUSTOM_MODEL_OPTION

What the official docs suggest

What actually happens

What I learned from cli.js

The undocumented lever that matters

Working ~/.claude/settings.json (tested here, not copy-pasted blind)

The settings that must agree (if any drift, you get confusing errors)

vLLM side (must match the JSON exactly)

Smoke test

Debugging sequence (short)

Why the alias + base URL + flag pattern works

Model tiers and aliases

URL construction

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC

Common errors (quick map)

Pre-flight checklist

Summary

Resources

Stable tool calling for Qwen 3.5 27B/35B on vLLM: template, parser, and mixed-GPU fixes

1. Chat template: qwen3.5_official.jinja and smaller models

Symptoms

Cause

Fix

2. Tool-call parser: qwen3_coder vs qwen3_xml

Official guidance

What went wrong in practice

Fix

3. Mixed-GPU precision drift (4090 + 3090)

What `qwen3.5-enhanced.jinja` actually does (and does not do)

vLLM launch recipe (`qwen3.6-enhanced.jinja`, `preserve_thinking=true`)

Why `qwen3_coder` pairs with this template on 3.6

`qwen3_xml` tool-call parser

`qwen3.5-enhanced.jinja` chat template

Run 1: `enhanced.jinja` + `qwen3_xml` (best observed config; file on disk: `qwen3.5-enhanced.jinja`)

Run 2: `official.jinja` + `qwen3_coder`

Run 3: `official.jinja` + `qwen3_xml`

Reference environment and `vllm serve` command

The trap: `ANTHROPIC_CUSTOM_MODEL_OPTION`

What I learned from `cli.js`

Working `~/.claude/settings.json` (tested here, not copy-pasted blind)

`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`

1. Chat template: `qwen3.5_official.jinja` and smaller models

2. Tool-call parser: `qwen3_coder` vs `qwen3_xml`

Reference `vllm serve` configuration