<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://allanchan339.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://allanchan339.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-05-02T20:52:55+08:00</updated><id>https://allanchan339.github.io/feed.xml</id><title type="html">blank</title><subtitle>My personal website </subtitle><entry><title type="html">qwen3.6-enhanced.jinja: CoT leakage into tool turns and why preserve_thinking works now</title><link href="https://allanchan339.github.io/bug-fixes/2026/05/02/Qwen36-27B-updated-jinja.html" rel="alternate" type="text/html" title="qwen3.6-enhanced.jinja: CoT leakage into tool turns and why preserve_thinking works now"/><published>2026-05-02T00:00:00+08:00</published><updated>2026-05-02T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2026/05/02/Qwen36-27B-updated-jinja</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2026/05/02/Qwen36-27B-updated-jinja.html"><![CDATA[<p>In <a href="/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html">my April note on Qwen 3.6-27B</a> I described a stack that survived a long agentic trace: <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> on the 3.6 checkpoint, <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> for streaming extraction, <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong>, and NCCL tweaks after <strong>Studio Driver 595.79</strong>.</p> <p><strong>That is the same cluster of reasons</strong> <code class="language-plaintext highlighter-rouge">preserve_thinking</code> <strong>had to stay off:</strong> <strong>Qwen 3.6</strong> <strong>sustains</strong> interleaved <strong>thinking</strong> in a way <strong>3.5</strong> largely does not; <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> <strong>does not repair</strong> missing <code>&lt;/redacted_thinking&gt;</code> and can <strong>double-wrap</strong> assistant turns on <strong>3.6</strong>; with <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=true</code></strong> the template <strong>keeps</strong> more of that <strong>broken</strong> structure in <strong>rendered history</strong>, so <strong>prefix pollution</strong>, <strong>CoT bleed</strong>, and <strong>ignored <code class="language-plaintext highlighter-rouge">tool_call</code></strong> <strong>get worse</strong>. <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> was the <strong>pressure-release</strong>—<strong>stripping</strong> much <strong>think</strong> from <strong>earlier</strong> turns so agent runs could finish—not a statement that <strong>3.6</strong> “should not” expose reasoning. I dug in when <strong>reasoning still leaked into <code class="language-plaintext highlighter-rouge">tool_response</code></strong> and <strong>tools stopped firing</strong> even with the flag off.</p> <p>I developed <strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.6-enhanced.jinja"><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></a></strong> so the <strong>Qwen 3.6 family</strong> can use an enhanced chat template <strong>without</strong> that compromise: <strong>multimodal</strong> paths, <strong>interleaved</strong> thinking aligned to how <strong>3.6</strong> actually behaves, <strong>self-healing</strong> before the reasoning split, and <strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code> supported</strong> (<code class="language-plaintext highlighter-rouge">true</code> or <code class="language-plaintext highlighter-rouge">false</code>)—i.e. the <strong>full surface</strong> the <strong>3.6 series</strong> is meant to expose, instead of <strong>turning off</strong> <strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code></strong> to paper over <strong>3.5-enhanced-on-3.6</strong> bugs. <a href="https://raw.githubusercontent.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/main/chat-template/qwen3.6-enhanced.jinja">Raw file</a> for <code class="language-plaintext highlighter-rouge">vllm serve --chat-template</code>. <strong>Working proof (128k token spent):</strong> <strong><a href="https://github.com/allanchan339/qwen36_27B_36jinja_project">qwen36_27B_36jinja_project</a></strong>.</p> <p><img src="/assets/img/posts/2026-05-02-qwen36-jinja-token-trace.png" alt="Token trace after the qwen3.6-enhanced.jinja run (~128k tokens spent; served from this site)"/></p> <p>This post is the template-side story: <strong>why</strong> pointing raw <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> at 3.6 could corner the runtime, why that file <strong>never inserts</strong> a missing <code>&lt;/redacted_thinking&gt;</code> (it <strong>leaves the broken assistant text in the prompt</strong>—<strong>causal</strong> models still <strong>condition on it</strong>), and the <strong>minimal self‑healing</strong> step I put in the <strong><code class="language-plaintext highlighter-rouge">assistant</code></strong> branch of <strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong> before the reasoning split.</p> <h2 id="what-broke-in-plain-terms">What broke in plain terms</h2> <p>Sometimes the assistant emitted something shaped like:</p> <ul> <li> <p>an opening think marker (<strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong> and <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> share the same literal <code>&lt;redacted_thinking&gt;</code> family; the April post used <strong><code class="language-plaintext highlighter-rouge">thinking</code></strong> casually for readability),</p> </li> <li> <p><strong>no closing tag</strong> before a raw <code>&lt;tool_call&gt;</code> block.</p> </li> </ul> <p>Training and runtime prompts encourage <strong>closed</strong> think sections. Reality is messier: the model can wedge a tool payload <strong>inside</strong> what is effectively still “thinking.”</p> <p>Separately, using <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> on <strong>Qwen 3.6</strong> used assistant logic equivalent to wrapping <strong>every</strong> qualifying turn in a synthetic think sandwich <strong>even when <code class="language-plaintext highlighter-rouge">reasoning_content</code> stayed empty</strong>. That interacted badly with malformed history: after rendering, it could look like the model was <strong>still inside</strong> an outer think envelope when <code>&lt;tool_call&gt;</code> appeared. Downstream behaviour matches what I observed as <strong>CoT leakage across turn boundaries</strong> and <strong>tool instructions that never get scheduled</strong>.</p> <p>None of this negates <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> on 3.6—the parser lane still matters—but fixing the template removes a <strong>structural</strong> failure mode rather than leaning only on parsing heuristics.</p> <h2 id="why-reasoning-extraction-silently-failed">Why reasoning extraction silently failed</h2> <p>The template extracts <code class="language-plaintext highlighter-rouge">reasoning_content</code> by looking for <code>&lt;/redacted_thinking&gt;</code> in the <strong>message body</strong>. When the assistant never emits that closing tag, the splitter never runs, <code class="language-plaintext highlighter-rouge">reasoning_content</code> stays <strong>empty</strong>, and the <strong>remainder</strong> stays the full raw string—<strong>including</strong> the unclosed opening think tag ahead of <code>&lt;tool_call&gt;</code>.</p> <p>A <strong>3.6-style</strong> handler that unconditionally wrapped “post–last-user” assistant text in opening and closing redacted-thinking fences, <strong>plus</strong> the recombined body, then effectively produced stacked think markup: a <strong>vacant</strong> fenced block followed by thought text that <strong>still began with</strong> another dangling <code>&lt;redacted_thinking&gt;</code> ahead of <code>&lt;tool_call&gt;</code>.</p> <p>From the model’s point of view that is dangerously close to “tool call emitted while still reasoning,” which rationalizes <strong>ignored tool XML</strong> and <strong>follow-up prose that belongs in Think leaking into structured tool payloads</strong>.</p> <h3 id="what-qwen35-enhancedjinja-actually-does-and-does-not-do">What <code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code> actually does (and does not do)</h3> <p><strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> does not repair a missing <code>&lt;/redacted_thinking&gt;</code>: there is no pass that closes a dangling opener or strips half-open think markup. Whatever the assistant emitted—including <code>&lt;redacted_thinking&gt;</code> with no matching close before <strong><code class="language-plaintext highlighter-rouge">tool_call</code></strong>—can still <strong>show up</strong> in the <strong>serialized prompt</strong> the <strong>causal</strong> model conditions on next step; “letting it be” is <strong>input-side pollution</strong> in principle whenever that text <strong>stays</strong> in prefix.</p> <p><strong>Why the same no-fix workaround looked “fine” on Qwen 3.5:</strong> in my runs <strong>Qwen 3.5 does not really sustain</strong> a long-lived <strong>interleaved thinking</strong> block the way <strong>Qwen 3.6</strong> does—it <strong>lacks</strong> that <strong>stickier</strong> “keep thinking open across turns” behaviour. <strong>Interleaved</strong> chat templating also <strong>discards</strong> many <strong>think</strong> segments for assistant turns <strong>before</strong> the last real user message, so <strong>most</strong> of the half-open scaffold <strong>never re-enters</strong> the prefix the model sees. <strong>3.6</strong> is where that stops being a sufficient safety net, so the <strong>same</strong> “don’t repair the close” policy starts to <strong>hurt</strong> visibly (<strong>CoT bleed</strong>, ignored tools) and <strong>self-healing</strong> in <strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong> becomes worth the complexity.</p> <p>Earlier <strong>3.5</strong> assistant logic only wrapped output in an explicit Think block when <code class="language-plaintext highlighter-rouge">reasoning_content</code> was <strong>non-empty</strong> after splitting. With <strong>no close tag</strong>, <code class="language-plaintext highlighter-rouge">reasoning_content</code> stayed blank, the template <strong>skipped an extra synthetic think envelope</strong>, and the <strong>same dirty assistant string</strong> (still containing the unclosed opener) was emitted as <strong>bare assistant content</strong>. That sometimes kept <strong><code class="language-plaintext highlighter-rouge">tool_call</code></strong> <strong>outside</strong> a <strong>second</strong> layer of scaffolding the template would have invented—helping <strong>scheduling</strong>—but it <strong>did not</strong> make the <strong>token history</strong> structurally clean. On the faulty <strong>3.6-on-3.5-enhanced</strong> path, the unconditional wrapper <strong>added</strong> that outer layer on top of the still-unclosed inner block, which made tool behaviour worse <strong>without</strong> fixing the underlying transcript hygiene problem.</p> <h2 id="the-fix-i-settled-on">The fix I settled on</h2> <p>I wanted <strong>deterministic repair</strong>, not another special case that might leave historic turns ending in <code>&lt;redacted_thinking&gt;</code> without a sibling close before <code>&lt;tool_call&gt;</code>:</p> <ol> <li> <p><strong>Self-healing (before splitting):</strong><br/> When both <code>&lt;tool_call&gt;</code> and <code>&lt;redacted_thinking&gt;</code> appear and the <strong>last</strong> <code>&lt;/redacted_thinking&gt;</code> sits <strong>before</strong> the <strong>last</strong> <code>&lt;redacted_thinking&gt;</code> (including the <code class="language-plaintext highlighter-rouge">-1 / missing</code> cases), inject <code>&lt;/redacted_thinking&gt;</code> immediately <strong>before</strong> the first <code>&lt;tool_call&gt;</code> when that tool call sits after the dangling opener; otherwise append <code>&lt;/redacted_thinking&gt;</code> at the end.</p> </li> <li> <p><strong>Keep the outer think wrapper unchanged</strong> afterward: splitting now sees balanced markers, extracts <code class="language-plaintext highlighter-rouge">reasoning_content</code> cleanly, and the tool payload never sits upstream of <strong>two</strong> contradictory think layers.</p> </li> </ol> <p>Roughly—the snippet lives today in <strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong>; the operative structure is:</p> <div class="language-jinja highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{%- elif message.role == &quot;assistant&quot; -%}
    {%- set content = render_content(message.content, true)|trim -%}

    {# Ensure &lt;/redacted_thinking&gt; exists before tool XML when opener was left dangling #}
    {%- if &#x27;&lt;tool_call&gt;&#x27; in content and &#x27;&lt;redacted_thinking&gt;&#x27; in content -%}
        {%- set last_think = content.rfind(&#x27;&lt;redacted_thinking&gt;&#x27;) -%}
        {%- set last_close = content.rfind(&#x27;&lt;/redacted_thinking&gt;&#x27;) -%}
        {%- set tool_pos = content.find(&#x27;&lt;tool_call&gt;&#x27;) -%}
        {%- if last_close &lt; last_think or last_close == -1 -%}
            {%- if tool_pos &gt; last_think -%}
                {%- set content = content[:tool_pos] ~ &#x27;&lt;/redacted_thinking&gt;&#x27; ~ content[tool_pos:] -%}
            {%- else -%}
                {%- set content = content ~ &#x27;&lt;/redacted_thinking&gt;&#x27; -%}
            {%- endif -%}
        {%- endif -%}
    {%- endif -%}

    {%- set reasoning_content = &#x27;&#x27; -%}
    {# … existing reasoning extraction + interleaved-thinking render … #}
{%- endif -%}
</code></pre></div></div> <p>Above, tags match <strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.6-enhanced.jinja"><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></a></strong> as checked in; if you merge this into another fork, substitute your <strong>literal</strong> open/close think strings verbatim.</p> <p><strong>Branches I deliberately did not adopt:</strong> emitting a <strong>trail-only</strong> opening tag <code>&lt;redacted_thinking&gt;</code> immediately followed by a newline wrapper for assistant history when reasoning is blank but the turn qualifies for preservation. That mirrors the <strong><code class="language-plaintext highlighter-rouge">add_generation_prompt</code></strong> tail—which is appropriate at <strong>generation start</strong>—but is <strong>incorrect</strong> mid-conversation because it nests the next <code>&lt;tool_call&gt;</code> beneath an unfinished think scaffold.</p> <h2 id="practical-scope">Practical scope</h2> <ul> <li><strong>Surface area:</strong> the <code class="language-plaintext highlighter-rouge">assistant</code> message branch through the unchanged <code class="language-plaintext highlighter-rouge">tool</code> message handler—nothing else needed in my audits (system preamble, structured <code class="language-plaintext highlighter-rouge">tool_calls</code> serialization, trailing generation prompt untouched).</li> <li><strong>Interaction with knobs:</strong> with <strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong>, <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=true</code></strong> is a safe option again—histories carry <strong>balanced</strong> fences after self-healing, so interleaved-thinking strip/keep semantics stay predictable. On bare <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> against 3.6 I still recommend <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> until you migrate.</li> </ul> <h2 id="what-stays-the-same-in-the-april-stack">What stays the same in the April stack</h2> <p><a href="/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html"><strong>The April launcher</strong></a> remains the blueprint for parsers, GPUs, MARLIN-aligned FP8, NCCL tweaks, <strong><code class="language-plaintext highlighter-rouge">--disable-custom-all-reduce</code></strong> on <strong>595.79</strong>, and <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> on 3.6. Point <strong><code class="language-plaintext highlighter-rouge">--chat-template</code></strong> at the local path of <strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.6-enhanced.jinja"><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></a></strong> (clone or copy from <a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/tree/main/chat-template">the <code class="language-plaintext highlighter-rouge">chat-template/</code> folder</a>); <strong><code class="language-plaintext highlighter-rouge">--default-chat-template-kwargs</code></strong> can then set <strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code></strong> to <strong><code class="language-plaintext highlighter-rouge">true</code></strong> or <strong><code class="language-plaintext highlighter-rouge">false</code></strong> as you prefer (April’s <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> was keyed to <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> on 3.6, not to vLLM itself).</p> <p>Where I reran transcripts that previously reproduced leakage, executions <strong>scheduled</strong> reliably again and stray reasoning stopped surfacing downstream of repaired <code>&lt;tool_call&gt;</code> markers; the public trace and code live in <strong><a href="https://github.com/allanchan339/qwen36_27B_36jinja_project">qwen36_27B_36jinja_project</a></strong>. Others’ mileage will vary by checkpoint and client parsing, which is exactly why <strong>I publish both halves</strong>: parser ergonomics plus <strong>truthful templating</strong>, plus a <strong>repo you can clone</strong> when a blog post is not enough.</p> <h2 id="vllm-launch-recipe-qwen36-enhancedjinja-preserve_thinkingtrue">vLLM launch recipe (<code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code>, <code class="language-plaintext highlighter-rouge">preserve_thinking=true</code>)</h2> <p>Below is the <strong>vLLM</strong> recipe I use with <strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong> and <strong><code class="language-plaintext highlighter-rouge">preserve_thinking: true</code></strong> (the pairing this post is about). <strong>I tested this configuration on vLLM v0.19.0</strong>; newer or older releases may need small flag or env tweaks. Point <strong><code class="language-plaintext highlighter-rouge">--chat-template</code></strong> at your local copy—e.g. from <a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.6-enhanced.jinja"><code class="language-plaintext highlighter-rouge">chat-template/qwen3.6-enhanced.jinja</code></a>. Adjust <strong><code class="language-plaintext highlighter-rouge">source …/activate</code></strong>, <strong>GPU</strong> indices, and paths for your box. Lines that end with <code class="language-plaintext highlighter-rouge">\</code> plus an inline <code class="language-plaintext highlighter-rouge"># …</code> can trip some shells; drop those comments after <code class="language-plaintext highlighter-rouge">\</code> if paste fails.</p> <p>On <strong>NVIDIA Studio 595.79</strong> with <strong>mixed GPUs</strong> I still needed <strong><code class="language-plaintext highlighter-rouge">--disable-custom-all-reduce</code></strong> for stability (<a href="/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html">April note</a>); it is commented here so you can enable it without hunting the flag.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># vLLM v0.19.0 (recipe tested on this version)</span>
<span class="c"># ------------------------------</span>
<span class="c"># Safe, Speed-Focused Env Vars</span>
<span class="c"># ------------------------------</span>
<span class="nb">export </span><span class="nv">CUDA_DEVICE_ORDER</span><span class="o">=</span>PCI_BUS_ID  <span class="c"># mixed-GPU safeguard</span>
<span class="nb">export </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1
<span class="nb">export </span><span class="nv">NCCL_CUMEM_ENABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">VLLM_ENABLE_CUDAGRAPH_GC</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_USE_FLASHINFER_SAMPLER</span><span class="o">=</span>1

<span class="nb">export </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>8

<span class="c"># NCCL tuning for SYS/PCIe topology</span>
<span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_SHM_DISABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
<span class="nb">export </span><span class="nv">MODEL_NAME</span><span class="o">=</span><span class="s2">"Qwen/Qwen3.6-27B-FP8"</span>
<span class="nb">export </span><span class="nv">NCCL_P2P_LEVEL</span><span class="o">=</span>LOC
<span class="nb">export </span><span class="nv">VLLM_RPC_TIMEOUT</span><span class="o">=</span>180
<span class="nb">export </span><span class="nv">VLLM_WORKER_MULTIPROC_METHOD</span><span class="o">=</span>spawn

<span class="c"># --------------------------</span>
<span class="c"># Clean stale FlashInfer cache</span>
<span class="c"># --------------------------</span>
<span class="nb">rm</span> <span class="nt">-rf</span> ~/.cache/flashinfer

<span class="c"># Activate virtual environment (change to your path)</span>
<span class="nb">source</span> /home/cychan/vLLM/.venv/bin/activate

<span class="nb">export </span><span class="nv">VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_TEST_FORCE_FP8_MARLIN</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_SLEEP_WHEN_IDLE</span><span class="o">=</span>1

vllm serve <span class="nv">$MODEL_NAME</span> <span class="se">\</span>
  <span class="nt">--served-model-name</span> Qwen3.5-27B <span class="se">\</span>
  <span class="nt">--chat-template</span> qwen3.6-enhanced.jinja <span class="se">\</span>
  <span class="nt">--default-chat-template-kwargs</span> <span class="s1">'{"preserve_thinking": true}'</span> <span class="se">\</span>
  <span class="nt">--attention-backend</span> FLASHINFER <span class="se">\</span>
  <span class="nt">--trust-remote-code</span> <span class="se">\</span>
  <span class="nt">--tensor-parallel-size</span> 2 <span class="se">\</span>
  <span class="nt">--max-model-len</span> 219520 <span class="se">\</span>
  <span class="nt">--gpu-memory-utilization</span> 0.91 <span class="se">\</span>
  <span class="nt">--enable-auto-tool-choice</span> <span class="se">\</span>
  <span class="nt">--enable-chunked-prefill</span> <span class="se">\</span>
  <span class="nt">--enable-prefix-caching</span> <span class="se">\</span>
  <span class="nt">--max-num-batched-tokens</span> 12288 <span class="se">\</span>
  <span class="nt">--max-num-seqs</span> 4 <span class="se">\</span>
  <span class="nt">--kv-cache-dtype</span> fp8 <span class="se">\</span>
  <span class="nt">--tool-call-parser</span> qwen3_coder <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3 <span class="se">\</span>
  <span class="nt">--no-use-tqdm-on-load</span> <span class="se">\</span>
  <span class="nt">--host</span> 0.0.0.0 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--language-model-only</span>
<span class="c">#  --disable-custom-all-reduce   # uncomment on Studio 595.79 + mixed GPU if you hit NCCL deadlocks (see April post)</span>

<span class="c"># Optional: Qwen3 MTP speculative decoding (needs headroom; 80B-A3B speculator not on current hardware)</span>
<span class="c">#  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \</span>
</code></pre></div></div> <h2 id="summary">Summary</h2> <p>The flawed <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> assistant branch aimed at <strong>3.6</strong>, combined with <strong>sometimes-unclosed <code>&lt;redacted_thinking&gt;</code> markers</strong>, yielded <strong>double layering</strong> after rendering: vacant synthetic think blocks atop still-open reasoning. Downstream failures looked like ignored tools and polluted tool responses—not always mistakable for NCCL deadlocks.</p> <p><strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> could <strong>look</strong> less explosive on <strong>3.5</strong> partly because an <strong>empty</strong> <code class="language-plaintext highlighter-rouge">reasoning_content</code> <strong>skipped</strong> an extra synthetic wrapper—<strong>not</strong> because it <strong>healed</strong> think markup—and partly because <strong>Qwen 3.5</strong> in my experience <strong>does not keep</strong> a <strong>thinking</strong> block <strong>alive</strong> the way <strong>3.6</strong> does, so <strong>prefix pollution</strong> rarely <strong>compounds</strong>. <strong>That skip disappeared on the faulty 3.6-on-3.5 path</strong>, and <strong>3.6</strong> <strong>does</strong> sustain interleaved thinking, so <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> on the old file masked <strong>double-layer</strong> tool failures while <strong>dirty prefixes</strong> became a <strong>first-class</strong> problem.</p> <p><strong><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code></strong> uses <strong>pre-split self-healing</strong> to <strong>insert</strong> the missing close where needed so <strong><code class="language-plaintext highlighter-rouge">tool_call</code></strong> is not trapped inside an unterminated think region <strong>and</strong> the serialized history is <strong>not</strong> stuck carrying an endless “still thinking” span before the tool payload. That is what lets <strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code></strong> work without the old trade-off between <strong>trace fidelity</strong> and <strong>clean conditioning</strong>. <strong>Operationally</strong>, keep parser and GPU settings from April; swap <strong><code class="language-plaintext highlighter-rouge">--chat-template</code></strong> and <strong>revisit</strong> <strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code></strong> as intended rather than forced off. <strong><a href="https://github.com/allanchan339/qwen36_27B_36jinja_project">qwen36_27B_36jinja_project</a></strong> is the end-to-end proof repository for this template path.</p> <h2 id="resources">Resources</h2> <ul> <li><strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.6-enhanced.jinja"><code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code> (source)</a></strong> — <a href="https://raw.githubusercontent.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/main/chat-template/qwen3.6-enhanced.jinja">raw</a></li> <li><strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/blob/main/chat-template/qwen3.5-enhanced.jinja"><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code> (source)</a></strong></li> <li><strong><a href="https://github.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix">Repository: vLLM Qwen 3 / 3.5 / 3.6 chat-template fix</a></strong> — <code class="language-plaintext highlighter-rouge">chat-template/</code></li> <li><strong><a href="https://github.com/allanchan339/qwen36_27B_36jinja_project">Proof: <code class="language-plaintext highlighter-rouge">qwen3.6-enhanced.jinja</code> agentic run — qwen36_27B_36jinja_project</a></strong></li> <li><a href="/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html">Prior field note: Qwen 3.6-27B on vLLM with <code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></a> — <a href="https://www.reddit.com/r/LocalLLM/comments/1sv6cqk/follow_up_tested_tool_calling_fixes_for_qwen/">Reddit discussion</a></li> <li><a href="https://github.com/allanchan339/qwen36_27B_own_project">April demo project (<code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code> on 3.6)</a></li> </ul>]]></content><author><name></name></author><category term="bug-fixes"/><category term="vllm"/><category term="qwen"/><category term="tool-calling"/><category term="llm"/><category term="jinja"/><category term="agent"/><category term="inference"/><summary type="html"><![CDATA[Why Qwen 3.6 with qwen3.5-enhanced.jinja forced preserve_thinking=false, and how qwen3.6-enhanced.jinja restores full Qwen 3.6-series capability—self-healing think/tool boundaries, safe preserve_thinking. Launch recipe tested on vLLM v0.19.0.]]></summary></entry><entry><title type="html">Why I built this blog?</title><link href="https://allanchan339.github.io/reflection/2026/05/01/reason-blog.html" rel="alternate" type="text/html" title="Why I built this blog?"/><published>2026-05-01T21:00:00+08:00</published><updated>2026-05-01T21:00:00+08:00</updated><id>https://allanchan339.github.io/reflection/2026/05/01/reason-blog</id><content type="html" xml:base="https://allanchan339.github.io/reflection/2026/05/01/reason-blog.html"><![CDATA[<p>Why I built this blog?</p> <p>I built this website because I want one place to record my work and my growth clearly.</p> <p>I want to document four things:</p> <ul> <li>research</li> <li>journey</li> <li>bug fixes and workarounds</li> <li>reflection &amp; progress</li> </ul> <p>I have tried sharing my ideas and notes on Reddit, X, Threads, and LinkedIn. They are useful for quick updates, but they cannot fit everything I need. For example, I often need proper math display, PPT showcase, and embedded PDF rendering. Those are important for how I think, build, and explain technical work.</p> <p>So instead of forcing my content into platform limits, I decided to build my own website.</p> <p>This blog will mainly track my work across AI engineering, quantitative research &amp; trading infrastructure, and practical MLsystems, including LLM-agent harnesses, local LLM deployment, or multi-modal model for image, video, audio, or even combination of them (e.g. digital human). I am currently building end-to-end quant research workflows, and I also spend a lot of time on debugging and implementation details. I want this space to be a long-term technical notebook, not only highlights.</p> <p>If you are working on similar topics, I hope these notes can save you time, give you ideas, or start useful conversations.</p>]]></content><author><name></name></author><category term="reflection"/><category term="intro"/><category term="ai"/><category term="quant"/><summary type="html"><![CDATA[Why I built this website and what I will document here.]]></summary></entry><entry><title type="html">Qwen 3.6-27B-FP8 on vLLM: enhanced.jinja, qwen3_coder, and fixing NCCL after Studio Driver 595.79</title><link href="https://allanchan339.github.io/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html" rel="alternate" type="text/html" title="Qwen 3.6-27B-FP8 on vLLM: enhanced.jinja, qwen3_coder, and fixing NCCL after Studio Driver 595.79"/><published>2026-04-29T00:00:00+08:00</published><updated>2026-04-29T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2026/04/29/Qwen36-27B-tool-calling</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2026/04/29/Qwen36-27B-tool-calling.html"><![CDATA[<p>This post continues from <a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">my earlier notes on Qwen 3.5 tool-calling</a> and <a href="https://www.reddit.com/r/LocalLLM/comments/1sqpsut/qwen_3635ba3b_reddit_asked_so_i_tested_if_the_35/">the Qwen 3.6-35B-A3B follow-up</a>. I reused the same <code class="language-plaintext highlighter-rouge">enhanced.jinja</code> stack and ran <strong>Qwen 3.6-27B-FP8</strong> in a long unsupervised agentic session. <strong>NVIDIA Studio Driver 595.79</strong> introduced <strong>NCCL deadlocks</strong> until I added the environment and flag overrides in the sections below. After that, the run reached about <strong>180 000 tokens</strong> with no malformed tool calls. The resulting project is <a href="https://github.com/allanchan339/qwen36_27B_own_project">on GitHub</a>.</p> <p>The <code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code> template <strong>requires</strong> <code class="language-plaintext highlighter-rouge">preserve_thinking=false</code> under Qwen 3.6 (a new surface-area flag). With <code class="language-plaintext highlighter-rouge">preserve_thinking=true</code>, that template breaks and tool calls fail. The rest of this note assumes the flag is set to <code class="language-plaintext highlighter-rouge">false</code>.</p> <h2 id="background-what-already-worked-on-qwen-35">Background: what already worked on Qwen 3.5</h2> <p>Earlier work on <strong>Qwen 3.5-27B</strong> and <strong>35B-A3B</strong> used an <strong>RTX 4090</strong> and <strong>RTX 3090</strong> together. The configuration that made long agentic runs reliable included:</p> <ul> <li><strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> — interleaved-thinking template that treats an <strong>unclosed <code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code> block as plain content</strong>, not reasoning-only text, so the harness still sees tool output when the model omits the closing tag (“CoT leakage”). <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> is required (and was the workable default for this path).</li> <li><strong>Streaming tool-call parsing</strong> — the template assumes tokens are parsed as they arrive so <strong><code class="language-plaintext highlighter-rouge">&lt;tool_calling&gt;</code></strong> can be recognized while <code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code> is still open. On <strong>Qwen 3.5-27B</strong>, <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> behaved well and was the more robust option. On <strong>Qwen 3.6</strong>, <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code> did not emit tool calls</strong> in that unclosed-thinking situation; <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> did.</li> <li><strong><code class="language-plaintext highlighter-rouge">VLLM_TEST_FORCE_FP8_MARLIN=1</code></strong> — keeps the <strong>4090 (SM89)</strong> on <strong>W8A16</strong> instead of native <strong>W8A8</strong>, avoiding precision drift across the two GPUs.</li> <li><strong>NCCL tuning</strong> (<code class="language-plaintext highlighter-rouge">P2P_DISABLE</code>, <code class="language-plaintext highlighter-rouge">IB_DISABLE</code>, <code class="language-plaintext highlighter-rouge">Ring</code>) for stability on PCIe topologies.</li> </ul> <p>With that setup, <strong>Qwen 3.5-27B</strong> completed a <strong>1h 9m</strong> agentic session at <strong>138K</strong> tokens and built a <strong>FastAPI + React</strong> application without tool-calling failures.</p> <h2 id="moving-to-qwen-36-27b-and-changing-the-parser">Moving to Qwen 3.6-27B and changing the parser</h2> <p>I pointed the same server at <strong><code class="language-plaintext highlighter-rouge">Qwen/Qwen3.6-27B-FP8</code></strong>, still using <strong><code class="language-plaintext highlighter-rouge">enhanced.jinja</code></strong> and <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong>. <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong>, which was preferable on 3.5, <strong>did not trigger tool calls</strong> when <code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code> stayed open—the case the template is designed for—so I moved the <strong><code class="language-plaintext highlighter-rouge">--tool-call-parser</code></strong> to <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong>, which streams more aggressively and still picks up the tool call inside the unclosed block. A related fix is discussed in <a href="https://www.reddit.com/r/Vllm/comments/1suasv2/comment/oi02krw/?context=1">this vLLM thread</a>; a future <strong>vLLM 0.20.1</strong> release might let me revisit <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> for this workload.</p> <p>On <strong>driver 591.86</strong> the stack was stable. After moving to <strong>Studio Driver 595.79</strong>, the server began hitting <strong>NCCL deadlocks</strong>: hard freezes mid-generation that required a restart. In logs the failure showed up as <strong>NCCL timeouts</strong>, not parser errors, so it was easy to confuse with tool-calling regressions.</p> <h2 id="why-qwen3_coder-pairs-with-this-template-on-36">Why <code class="language-plaintext highlighter-rouge">qwen3_coder</code> pairs with this template on 3.6</h2> <p>Two separate behaviors interact usefully here:</p> <ul> <li><strong>Model output:</strong> Qwen 3.6 sometimes emits a <strong><code class="language-plaintext highlighter-rouge">&lt;tool_call&gt;</code></strong> before closing <strong><code class="language-plaintext highlighter-rouge">&lt;/thinking&gt;</code></strong>. The enhanced template leaves that tool call in <strong>plain content</strong> so downstream code can still parse it.</li> <li><strong>Parser behavior:</strong> <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> is tuned for code-like streams and detects tool-call patterns <strong>mid-stream</strong> even when the XML framing is incomplete: it does not need a fully closed <code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code> section and will fire on <strong><code class="language-plaintext highlighter-rouge">&lt;tool_call&gt;</code></strong>. <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> behaves more like strict XML; an unclosed <code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code> can block it from surfacing the nested <strong><code class="language-plaintext highlighter-rouge">&lt;tool_call&gt;</code></strong>.</li> </ul> <p>Together, that yields <strong>more resilient extraction</strong> for this template on 3.6 than either piece alone. Neither behavior is ideal in isolation; combined they amount to a <strong>production-viable</strong> setup. I therefore keep <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> for <strong>Qwen 3.6</strong> with <strong><code class="language-plaintext highlighter-rouge">enhanced.jinja</code></strong>, while <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> remains the better fit for <strong>3.5-27B</strong> on my hardware. If this interpretation disagrees with how the maintainers model the parsers, I am glad to be corrected.</p> <h2 id="driver-59579-nccl-and-vllm-all-reduce">Driver 595.79, NCCL, and vLLM all-reduce</h2> <p>I had been on <strong>591.86</strong> with acceptable behavior. <strong>595.79</strong> introduced <strong>NCCL deadlocks</strong> that froze generation. My working hypothesis is that the newer driver tightens <strong>NCCL</strong> behavior on <strong>mixed-GPU PCIe</strong> topologies enough to break vLLM’s <strong>custom all-reduce</strong> path. I did not roll back the driver; instead I applied:</p> <ol> <li><strong>Additional environment variables:</strong> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">NCCL_SHM_DISABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">NCCL_P2P_LEVEL</span><span class="o">=</span>LOC          <span class="c"># restrict P2P to local GPUs only</span>
<span class="nb">export </span><span class="nv">VLLM_RPC_TIMEOUT</span><span class="o">=</span>180        <span class="c"># prevent premature RPC timeouts</span>
<span class="nb">export </span><span class="nv">VLLM_WORKER_MULTIPROC_METHOD</span><span class="o">=</span>spawn  <span class="c"># more robust worker lifecycle</span>
</code></pre></div> </div> </li> <li><strong><code class="language-plaintext highlighter-rouge">--disable-custom-all-reduce</code></strong> on <strong><code class="language-plaintext highlighter-rouge">vllm serve</code></strong>, forcing <strong>native NCCL</strong> all-reduce instead of vLLM’s custom path on this <strong>PCIe-only</strong> topology.</li> </ol> <p>Without these settings on <strong>595.79</strong>, I saw <strong>intermittent deadlocks</strong> that resembled tool-calling failures in the UI but were not parser issues.</p> <h2 id="long-run-180k-tokens">Long run: 180K tokens</h2> <p>With the driver and NCCL changes in place, I gave <strong>Qwen 3.6-27B</strong> ownership of a directory and a <strong>10 000-token</strong> budget per step, without manual steering.</p> <table> <thead> <tr> <th>Prompt</th> <th>Wall time</th> <th>Accumulated tokens</th> </tr> </thead> <tbody> <tr> <td>“Welcome to life, you are Qwen 3.6-27B. Full leadership. What project do you want to build?”</td> <td>0s</td> <td>0k</td> </tr> <tr> <td>“Don’t ask me – you have full leadership. 10k token budget.” <em>(model used a Question tool to clarify, then proceeded)</em></td> <td>31s</td> <td>14.0k</td> </tr> <tr> <td>“Did you check if this is bug-free? It’s your own project.”</td> <td>17m 13s</td> <td>63.3k</td> </tr> <tr> <td>“Deliver the first possible functional upgrade. Do it nicely.”</td> <td>11m 35s</td> <td>126.7k</td> </tr> <tr> <td><em>(session ended naturally)</em></td> <td>10m 46s</td> <td><strong>180.0k</strong></td> </tr> </tbody> </table> <p>The model built a <strong>React + Vite + TypeScript</strong> front end with a <strong>FastAPI</strong> backend, revised it after critical feedback, and shipped a further upgrade. I did not observe a <strong>malformed tool call</strong> in that trace. Code: <a href="https://github.com/allanchan339/qwen36_27B_own_project">qwen36_27B_own_project</a>.</p> <h2 id="launch-script">Launch script</h2> <p>The script below is the same one I published in the discussion thread. Lines that end with <code class="language-plaintext highlighter-rouge">\</code> followed by an inline <code class="language-plaintext highlighter-rouge">#</code> comment can confuse some shells; if paste fails, drop the comments after <code class="language-plaintext highlighter-rouge">\</code> and keep the flags.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># -------------------------------------------------</span>
<span class="c"># Qwen 3.6-27B-FP8 – Agentic-Ready vLLM Launch Script</span>
<span class="c"># Tested: 180K tokens, zero tool-calling failures</span>
<span class="c"># Driver: NVIDIA Studio 595.79</span>
<span class="c"># -------------------------------------------------</span>

<span class="c"># ---- Safe, Speed-Focused Env Vars ----</span>
<span class="nb">export </span><span class="nv">CUDA_DEVICE_ORDER</span><span class="o">=</span>PCI_BUS_ID
<span class="nb">export </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1
<span class="nb">export </span><span class="nv">NCCL_CUMEM_ENABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">VLLM_ENABLE_CUDAGRAPH_GC</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_USE_FLASHINFER_SAMPLER</span><span class="o">=</span>1

<span class="nb">export </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>8

<span class="c"># ---- NCCL Tuning for SYS/PCIe Topology ----</span>
<span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_SHM_DISABLE</span><span class="o">=</span>0          <span class="c"># NEW for driver 595.79</span>
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
<span class="nb">export </span><span class="nv">NCCL_P2P_LEVEL</span><span class="o">=</span>LOC          <span class="c"># NEW for driver 595.79</span>

<span class="c"># ---- vLLM Stability (Driver-Dependent) ----</span>
<span class="nb">export </span><span class="nv">VLLM_RPC_TIMEOUT</span><span class="o">=</span>180                  <span class="c"># NEW</span>
<span class="nb">export </span><span class="nv">VLLM_WORKER_MULTIPROC_METHOD</span><span class="o">=</span>spawn    <span class="c"># NEW</span>

<span class="c"># ---- FP8 &amp; Memory ----</span>
<span class="nb">export </span><span class="nv">VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_TEST_FORCE_FP8_MARLIN</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_SLEEP_WHEN_IDLE</span><span class="o">=</span>1

<span class="c"># Clean stale FlashInfer cache</span>
<span class="nb">rm</span> <span class="nt">-rf</span> ~/.cache/flashinfer

<span class="c"># Activate environment</span>
<span class="nb">source</span> /home/cychan/vLLM/.venv/bin/activate

vllm serve Qwen/Qwen3.6-27B-FP8 <span class="se">\</span>
  <span class="nt">--served-model-name</span> Qwen3.5-27B <span class="se">\</span>
  <span class="nt">--chat-template</span> qwen3.5-enhanced.jinja <span class="se">\</span>
  <span class="nt">--default-chat-template-kwargs</span> <span class="s1">'{"preserve_thinking": false}'</span> <span class="se">\ </span>  <span class="c"># MANDATORY: the enhanced jinja will break if this is true</span>
  <span class="nt">--attention-backend</span> FLASHINFER <span class="se">\</span>
  <span class="nt">--trust-remote-code</span> <span class="se">\</span>
  <span class="nt">--tensor-parallel-size</span> 2 <span class="se">\</span>
  <span class="nt">--max-model-len</span> 219520 <span class="se">\</span>
  <span class="nt">--gpu-memory-utilization</span> 0.91 <span class="se">\</span>
  <span class="nt">--enable-auto-tool-choice</span> <span class="se">\</span>
  <span class="nt">--enable-chunked-prefill</span> <span class="se">\</span>
  <span class="nt">--enable-prefix-caching</span> <span class="se">\</span>
  <span class="nt">--max-num-batched-tokens</span> 12288 <span class="se">\</span>
  <span class="nt">--max-num-seqs</span> 4 <span class="se">\</span>
  <span class="nt">--kv-cache-dtype</span> fp8 <span class="se">\</span>
  <span class="nt">--tool-call-parser</span> qwen3_coder <span class="se">\ </span>         <span class="c"># REQUIRED for Qwen 3.6 with enhanced.jinja; for Qwen 3.5 27B, qwen3_xml also works (see https://www.reddit.com/r/Vllm/comments/1suasv2/)</span>
  <span class="nt">--reasoning-parser</span> qwen3 <span class="se">\</span>
  <span class="nt">--no-use-tqdm-on-load</span> <span class="se">\</span>
  <span class="nt">--host</span> 0.0.0.0 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--language-model-only</span> <span class="se">\</span>
  <span class="nt">--disable-custom-all-reduce</span>            <span class="c"># CRITICAL for driver 595.79</span>
</code></pre></div></div> <h2 id="summary">Summary</h2> <p>The <strong><code class="language-plaintext highlighter-rouge">enhanced.jinja</code></strong> path needs a <strong>streaming</strong> tool parser that still fires when <strong><code class="language-plaintext highlighter-rouge">&lt;thinking&gt;</code></strong> is left open. On <strong>Qwen 3.5-27B</strong>, <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> met that requirement and stayed the more robust option in my tests (<a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">detail</a>). On <strong>Qwen 3.6</strong>, <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> missed tool calls in that pattern, so <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> was necessary.</p> <p><strong><code class="language-plaintext highlighter-rouge">preserve_thinking</code></strong> must stay <strong><code class="language-plaintext highlighter-rouge">false</code></strong> for <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> on Qwen 3.6; <strong><code class="language-plaintext highlighter-rouge">true</code></strong> is incompatible with that template in my setup.</p> <p><strong>591.86 → 595.79</strong> introduced <strong>NCCL deadlocks</strong> on the mixed <strong>4090/3090</strong> box. Mitigations were <strong><code class="language-plaintext highlighter-rouge">NCCL_SHM_DISABLE=0</code></strong>, <strong><code class="language-plaintext highlighter-rouge">NCCL_P2P_LEVEL=LOC</code></strong>, <strong><code class="language-plaintext highlighter-rouge">VLLM_RPC_TIMEOUT=180</code></strong>, <strong><code class="language-plaintext highlighter-rouge">VLLM_WORKER_MULTIPROC_METHOD=spawn</code></strong>, and <strong><code class="language-plaintext highlighter-rouge">--disable-custom-all-reduce</code></strong>. Without them, deadlocks can look like tool failures.</p> <p><strong><code class="language-plaintext highlighter-rouge">VLLM_TEST_FORCE_FP8_MARLIN=1</code></strong> and <strong>NCCL</strong> tuning remain mandatory for mixed FP8 ranks; the driver upgrade added the extra variables above.</p> <p><strong>Qwen 3.6-27B</strong> is a dense <strong>27B</strong> checkpoint that, on the public agentic-coding figures I cite, beats <strong>Qwen 3.5-397B-A17B</strong> (<strong>SWE-bench Verified</strong> 77.2 vs 76.2, <strong>Pro</strong> 53.5 vs 50.9, <strong>SkillsBench</strong> 48.2 vs 30.0)—a larger step than a minor revision.</p> <p>In one continuous session the stack sustained about <strong>180K</strong> tokens with <strong>no tool-calling errors</strong> and roughly <strong>10 minutes</strong> of uninterrupted agentic use on consumer GPUs, which matches what I wanted from a production-minded local setup.</p> <p>The same template works on <strong>Qwen 3.5</strong> and <strong>3.6</strong> if <strong><code class="language-plaintext highlighter-rouge">preserve_thinking=false</code></strong> and the parser matches the model generation. With the <strong>595.79</strong> workarounds, I get stable <strong>180K-token</strong> agentic runs. Step-by-step background and earlier tuning notes are in the <a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">Qwen 3.5 deep-dive</a>.</p> <h2 id="resources">Resources</h2> <ul> <li><a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">Qwen 3.5 tool-calling deep-dive</a></li> <li><a href="https://www.reddit.com/r/Vllm/comments/1suasv2/">Parser behaviour: Qwen 3.5 vs 3.6</a></li> <li><a href="https://github.com/allanchan339/qwen36_27B_own_project">Qwen 3.6-27B project repository</a></li> <li><a href="https://github.com/allanchan339/vLLM-Qwen3.5-27B">vLLM / Qwen config repository</a></li> </ul>]]></content><author><name></name></author><category term="bug-fixes"/><category term="vllm"/><category term="qwen"/><category term="tool-calling"/><category term="llm"/><category term="agent"/><category term="inference"/><summary type="html"><![CDATA[Same qwen3.5-enhanced.jinja and mixed-GPU stack as earlier Qwen 3.5 notes; switching to qwen3_coder for 3.6, mandatory preserve_thinking=false, and NCCL overrides that stopped deadlocks on NVIDIA Studio 595.79—plus a 180k-token agentic run.]]></summary></entry><entry><title type="html">Findings: Karpathy-style autoresearch on a crypto backtester (local LLM)</title><link href="https://allanchan339.github.io/research/2026/04/24/MVP-LLM-alpha-mining.html" rel="alternate" type="text/html" title="Findings: Karpathy-style autoresearch on a crypto backtester (local LLM)"/><published>2026-04-24T00:00:00+08:00</published><updated>2026-04-24T00:00:00+08:00</updated><id>https://allanchan339.github.io/research/2026/04/24/MVP-LLM-alpha-mining</id><content type="html" xml:base="https://allanchan339.github.io/research/2026/04/24/MVP-LLM-alpha-mining.html"><![CDATA[<p>The thread I started from was whether <strong>Karpathy-style autoresearch</strong>—an LLM in a tight <strong>read / act / evaluate</strong> loop—could apply to <strong>quant mining</strong> on my own stack: <strong>local</strong> inference, <strong>crypto</strong> data, a <strong>custom backtester</strong>, no paid API for the search itself. I connected a <strong>Qwen 3.5</strong> agent to that backtester and DB, let it run for <strong>~2 hours</strong> and <strong>30+</strong> strategy cycles with <strong>no per-step prompting</strong>, and watched it <strong>self-learn</strong> the repo, burn through bad ideas, and land <strong>one</strong> configuration that cleared my gates (<strong>$0</strong> inference bill for that grind). It was <strong>not</strong> a fully closed loop: when it stalled in weak local optima, I used <strong>human-in-the-loop</strong> nudges (short instructions, ideas from notes or papers) and it folded those into the next iterations without restarting the harness.</p> <p>The point of this note is the <strong>trail</strong>: what blocked, what the loop did, and what I take away for the next run. Nothing here claims the strategy generalizes out-of-sample.</p> <p><strong>Recording</strong> (<a href="https://youtu.be/aEvj0SiU6WI">YouTube</a>)—captures how the autonomous loop behaved on my stack:</p> <div style="position:relative;padding-bottom:56.25%;height:0;overflow:hidden;max-width:100%;margin:1rem 0;"> <iframe style="position:absolute;top:0;left:0;width:100%;height:100%;border:0;" src="https://www.youtube.com/embed/aEvj0SiU6WI" title="Karpathy-style autoresearch — local LLM on crypto backtester" loading="lazy" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe> </div> <h2 id="what-i-was-testing">What I was testing</h2> <p><strong>Hypothesis:</strong> The same <strong>generate → backtest → gate → decide</strong> skeleton Karpathy describes for research automation is enough here for the model to learn my strategy API from the codebase, iterate without per-step prompting, and eventually hit <strong>strict</strong> metrics (Sharpe, max drawdown, profit factor, minimum trades).</p> <p><strong>What I wanted out of the experiment:</strong> A <strong>working skeleton</strong>—stable tools, objective discard, bounded wall clock—not a proof that the first passing parameter set is economic edge. I also wanted to see whether <strong>human-in-the-loop</strong> could stay <strong>lightweight</strong>: occasional steering instead of babysitting every iteration.</p> <h2 id="setup-fixed-for-the-run">Setup (fixed for the run)</h2> <table> <thead> <tr> <th>Component</th> <th>Detail</th> </tr> </thead> <tbody> <tr> <td><strong>Model</strong></td> <td>Qwen3.5-27B via vLLM</td> </tr> <tr> <td><strong>GPUs</strong></td> <td>Mixed setup (RTX 4090 + 3090)</td> </tr> <tr> <td><strong>Inference</strong></td> <td>vLLM with FP8, custom Jinja template, <code class="language-plaintext highlighter-rouge">qwen3_xml</code> parser</td> </tr> <tr> <td><strong>Runtime</strong></td> <td>~2 hours; autonomous iterations, <strong>human-in-the-loop</strong> when stuck</td> </tr> <tr> <td><strong>Iterations</strong></td> <td>~30+ strategy cycles</td> </tr> <tr> <td><strong>API cost</strong></td> <td>$0</td> </tr> </tbody> </table> <ul> <li><strong>Data:</strong> Crypto spot, 1m bars; TimescaleDB-HA for continuous aggregation.</li> <li><strong>Backtester:</strong> Custom DB + engine (Nautilus-based).</li> <li><strong>Harness:</strong> Ralph loop + Claude Code as execution harness.</li> </ul> <h2 id="finding-tool-calling-was-the-real-prerequisite">Finding: tool calling was the real prerequisite</h2> <p>Before any autoresearch, Qwen 3.5 in agent mode was <strong>unreliable</strong> for me: premature stops, mid-thought tool calls, format drift—unacceptable for multi-hour loops.</p> <p><strong>What eventually worked:</strong> A <strong>custom M2.5-style Jinja template</strong>, the <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code> parser</strong>, and <strong>precision alignment</strong> across the mixed GPU pair. Calendar time on this was on the order of <strong>weeks</strong>.</p> <p>Reference I kept for myself while debugging: <a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">Qwen 3.5 27B/35B tool calling on vLLM</a>. Without that stack, I would not rerun the experiment and expect the same stability.</p> <h2 id="procedure-what-i-actually-ran">Procedure (what I actually ran)</h2> <p>Each iteration:</p> <ol> <li><strong>Generate</strong> strategy code + YAML config.</li> <li><strong>Backtest</strong> immediately.</li> <li><strong>Score</strong> against the gates.</li> <li><strong>Decide:</strong> abandon, fix error, or change approach.</li> </ol> <p><strong>Observation:</strong> The useful part is not the four bullets in isolation—it is that the same interface and discard semantics repeat every cycle so the run has a <strong>trajectory</strong> instead of isolated guesses.</p> <h2 id="observations-from-the-long-run">Observations from the long run</h2> <p><strong>Bootstrap.</strong> The agent searched the repo, read existing strategies, matched the template, and produced a <code class="language-plaintext highlighter-rouge">run_backtest()</code>-compatible scaffold before leaning on novel ideas.</p> <p><strong>Regime coverage.</strong> Under my high-level brief it systematically tried <strong>momentum</strong> (EMA crosses, breakouts, ATR stops) and <strong>mean reversion</strong> (RSI, Bollinger, range fades). Most candidates failed fast on the gates; WandB showed <strong>DISCARD</strong> with clear numeric causes (Sharpe, drawdown, trade count).</p> <p><strong>Adaptation.</strong> When returns were weak but non-terrible, it <strong>tuned parameters</strong> (EMA lengths, sizing, vol filters). It also <strong>moved the backtest start date</strong> on its own to get more history.</p> <p><strong>Finding on the date move:</strong> That behavior is a <strong>leak / overfitting lever</strong> if left unrestricted. My takeaway is to <strong>freeze</strong> evaluation windows and walk-forward rules in the harness so the model cannot silently widen the training corridor.</p> <p><strong>Grind.</strong> Over ~2 hours it cycled EMA stacks (10/20, 5/15, 50/200), RSI + confirmation, MACD, 4h Bollinger reversion, vol-adjusted trend with jump detection, grid-style scalps, fixed R/R templates, long/short EMA divergence—<strong>dozens</strong> of variants, all discarded on the numbers until late in the run.</p> <p><strong>Eligibility.</strong> After <strong>~30+</strong> iterations, <strong>one</strong> configuration cleared all gates; I stopped there for this MVP.</p> <h2 id="human-in-the-loop-what-actually-happened">Human-in-the-loop (what actually happened)</h2> <p>The loop was <strong>mostly hands-off</strong>: I did not craft each strategy or click through each backtest. The agent ran the <strong>generate → backtest → gate → decide</strong> cycle on its own until the trajectory looked <strong>stuck</strong>—same family of tweaks, marginal metrics, no path to the gates.</p> <p>When I intervened, the input was <strong>small and textual</strong>: one-line briefs like <em>volatility-adjusted position sizing</em> or <em>dual trend filters</em>, or a concept lifted from a paper or my own notes. The agent <strong>ingested</strong> that and <strong>reprioritized</strong> the next hypotheses without me editing the harness or resetting state.</p> <p><strong>Finding:</strong> For this run, <strong>human-in-the-loop</strong> was the escape hatch from <strong>local optima</strong>, not a substitute for the loop. The economics still felt like automation: hours of machine time between nudges, and <strong>$0</strong> marginal API cost for the search itself.</p> <h2 id="comparative-note-ga-why-i-contrast-it">Comparative note: GA (why I contrast it)</h2> <p>I keep <strong>genetic algorithms</strong> in mind as a baseline: genome encode, <strong>mutation / crossover</strong>, batch backtest, selection—<strong>no</strong> explicit operator that reads a failure and chooses among <em>abandon</em>, <em>patch</em>, or <em>new hypothesis</em>; steering is emergent from selection.</p> <p><strong>Finding for my setting (multi-constraint gates, not one fitness scalar):</strong></p> <ul> <li><strong>Cost:</strong> Generations imply <strong>batches</strong> of backtests; many individuals are obviously bad but consume slots until selection removes them.</li> <li><strong>Completion:</strong> A run often ends without any individual that <strong>simultaneously</strong> satisfies Sharpe, drawdown, trade count, etc.; “best so far” still fails a gate.</li> </ul> <p><strong>Contrast I recorded:</strong> The LLM loop ends each step with a <strong>decision conditioned on the backtest and repo context</strong>. That does not remove overfitting, but in this run it gave a <strong>bounded</strong> path to one gate-passing config without a large per-generation fan-out.</p> <h2 id="findings-i-am-carrying-forward">Findings I am carrying forward</h2> <ol> <li><strong>Integration over single-strategy hype.</strong> The durable artifact is <strong>tool-stable loop + backtester + gates + harness</strong>, not the first passing parameter set.</li> <li><strong>Autonomous date-window edits are a policy bug</strong> unless intentionally allowed; they read as implicit curve fitting in my book.</li> <li><strong>Discard speed matters.</strong> The model moved on immediately on bad metrics; that matched what I want from mining hygiene.</li> <li><strong>Reuse worked.</strong> It recombined ideas already present in the repo and notes (e.g. vol-adjusted trend with jump suppression).</li> <li><strong>Economics.</strong> ~2h local inference, <strong>$0</strong> API line item—material for how long I am willing to let a search run.</li> <li><strong>Human-in-the-loop scales the search.</strong> Mid-run nudges broke local optima; the agent absorbed paper- or note-level hints without harness churn. I am keeping that pattern: <strong>autonomous bulk + sparse human steering</strong>, not full manual mining.</li> <li><strong>Diversity risk.</strong> With the <strong>same</strong> harness, prompts, and data, an LLM driver can still <strong>converge</strong> to the same or nearly the same strategy across runs—inductive bias and “default” repairs narrow the trajectory. I am treating <strong>perturbation</strong> (sampling, seeds, varied briefs / inits, parallel nudges, small prompt jitter) as <strong>part of the method</strong>, not an optional polish.</li> </ol> <h2 id="limitations-and-planned-follow-ups">Limitations and planned follow-ups</h2> <ul> <li><strong>Eligibility is not alpha:</strong> Clearing the backtest gates is not, by itself, a claim of edge out-of-sample.</li> <li><strong>Next checks:</strong> Lock windows, walk-forward or holdout, and use “passes gates once” as a <strong>regression signal for the harness</strong>, not as validation of edge.</li> </ul>]]></content><author><name></name></author><category term="research"/><category term="llm"/><category term="quant"/><category term="backtesting"/><category term="vllm"/><category term="qwen"/><category term="automation"/><category term="crypto"/><category term="local-inference"/><summary type="html"><![CDATA[Local Qwen 3.5 autoresearch on my crypto DB + Nautilus-style backtester (~2h, 30+ iter, $0 API): tool-calling blocker, run observations, human-in-the-loop steering, GA contrast, diversity, gates.]]></summary></entry><entry><title type="html">Qwen 3.6 35B-A3B on vLLM: do the Qwen 3.5 tool-calling fixes carry over?</title><link href="https://allanchan339.github.io/bug-fixes/2026/04/20/Qwen36-35B-A3B-tool-calling.html" rel="alternate" type="text/html" title="Qwen 3.6 35B-A3B on vLLM: do the Qwen 3.5 tool-calling fixes carry over?"/><published>2026-04-20T00:00:00+08:00</published><updated>2026-04-20T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2026/04/20/Qwen36-35B-A3B-tool-calling</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2026/04/20/Qwen36-35B-A3B-tool-calling.html"><![CDATA[<p>This note follows the earlier <a href="https://www.reddit.com/r/vLLM/comments/1skks8n/">Qwen 3.5 / vLLM discussion</a>. After weeks spent stabilizing <strong>Qwen 3.5-27B/35B</strong> for agentic use—the same <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong> parser, <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong> template, and GPU-side tuning—readers kept asking whether <strong>Qwen 3.6</strong> behaved the same.</p> <p><strong>Short answer:</strong> the <strong>same configuration remains the most stable option</strong> on <strong>Qwen3.6-35B-A3B-FP8</strong>, but compared with <strong>Qwen3.5-27B</strong> the newer checkpoint is <strong>more prone to reasoning loops</strong> and to <strong>malformed tool calls</strong> that <strong>interrupt</strong> long agent runs. This post documents what still works, what I measured in three controlled runs, a <strong>partial mitigation</strong> on the client side, and why <strong>Qwen3.5-27B-FP8</strong> is still <strong>my</strong> default for <strong>reliability-first</strong> agents.</p> <h1 id="what-still-carries-over-from-the-35-setup">What still carries over from the 3.5 setup</h1> <h2 id="qwen3_xml-tool-call-parser"><code class="language-plaintext highlighter-rouge">qwen3_xml</code> tool-call parser</h2> <p>The registry-backed parser continues to handle <strong>complex tool arguments</strong> without the corruption seen with regex-oriented paths. <strong>Official documentation still recommends <code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong>; for this workload the recommendation remains <strong>not</strong> to use it for demanding agentic traces.</p> <h2 id="qwen35-enhancedjinja-chat-template"><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code> chat template</h2> <p>The <strong>interleaved thinking</strong> template still applies to <strong>3.6 35B-A3B</strong>: correct <strong><code class="language-plaintext highlighter-rouge">&lt;/thinking&gt;</code></strong> boundaries and <strong>clean tool-call framing</strong> relative to the stock template.</p> <h2 id="mixed-gpu-precision-alignment">Mixed-GPU precision alignment</h2> <p><strong>RTX 4090 (SM89)</strong> prefers <strong>W8A8</strong> paths; <strong>RTX 3090 (SM80)</strong> falls back to <strong>W8A16</strong>. <strong><code class="language-plaintext highlighter-rouge">VLLM_TEST_FORCE_FP8_MARLIN=1</code></strong> still forces both ranks onto a <strong>matched</strong> effective precision. <strong>Without it, long conversations drift</strong>—the same failure mode as on 3.5.</p> <h2 id="nccl-tuning">NCCL tuning</h2> <p>Unchanged for this <strong>mixed consumer topology</strong>:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
</code></pre></div></div> <h1 id="agentic-test-protocol">Agentic test protocol</h1> <p>Each <strong>trial</strong> used the <strong>same prompt shape</strong>: grant <strong>full ownership of the working folder</strong>, ask the model to <strong>build a full-stack project</strong> (frontend and backend), and cap planning around a <strong>~10k-token budget</strong> for the initial instruction. The goal was to observe <strong>how long the stack survives</strong> before <strong>tool-calling or format errors</strong> end the session.</p> <h1 id="three-runs-same-hardware-different-vllm-templateparser-pairs">Three runs (same hardware, different vLLM template/parser pairs)</h1> <h2 id="run-1-enhancedjinja--qwen3_xml-best-observed-config-file-on-disk-qwen35-enhancedjinja">Run 1: <code class="language-plaintext highlighter-rouge">enhanced.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code> (best observed config; file on disk: <code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code>)</h2> <p>The model chose to build <strong>oss-inspect</strong>—an <strong>autonomous codebase quality analysis</strong> tool (project name as generated by the model).</p> <table> <thead> <tr> <th>Prompt / phase</th> <th>Accumulated tokens</th> </tr> </thead> <tbody> <tr> <td>Project setup</td> <td>13.9k</td> </tr> <tr> <td>“Did you check if this is bug free? This is your own project.”</td> <td>135.1k</td> </tr> <tr> <td>DCP sweep auto-triggered</td> <td>107.0k</td> </tr> <tr> <td>“Fix it then”</td> <td>110.0k</td> </tr> <tr> <td><strong>Session ended</strong> — improper tool calling</td> <td>111.1k</td> </tr> </tbody> </table> <p>This configuration <strong>lasted the longest</strong> (<strong>~13m 20s</strong> wall time to failure). It reached <strong>~130k+ tokens</strong> in the productive phase before <strong>improper tool calling</strong> terminated the run. After a <strong>DCP sweep</strong> at <strong>135k</strong> (tokens reported <strong>dropped to 107k</strong> in the log), the session <strong>continued</strong> until the final failure at <strong>111.1k</strong> on the last line—<strong>improper tool calling</strong> again.</p> <p><strong>Baseline comparison:</strong> <strong>Qwen3.5-27B</strong> with the <strong>same</strong> <code class="language-plaintext highlighter-rouge">enhanced.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code> pairing <strong>routinely exceeds 130k tokens without that class of interruption</strong> in <strong>my</strong> runs.</p> <h2 id="run-2-officialjinja--qwen3_coder">Run 2: <code class="language-plaintext highlighter-rouge">official.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_coder</code></h2> <p>The model proposed a <strong>knowledge-graph platform</strong> oriented toward <strong>Graphify</strong>-style skill ingestion—the ingestion behavior was <strong>aggressive</strong> relative to expectations.</p> <p><strong>Time to failure: 6m 32s</strong> — <strong>improper tool calling</strong>. Too early to trust for <strong>long-horizon</strong> agentic work.</p> <h2 id="run-3-officialjinja--qwen3_xml">Run 3: <code class="language-plaintext highlighter-rouge">official.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code></h2> <p>The model proposed <strong>TaskFlow</strong>—a <strong>Kanban</strong> app with <strong>authentication</strong>, <strong>drag-and-drop</strong> tasks, and a <strong>polished UI</strong>.</p> <p><strong>Time to failure: 1m 16s</strong> — <strong>malformed tool calls emitted inside the thinking block</strong>. Again <strong>unsuitable</strong> for reliable agents.</p> <h3 id="note-on-generated-stacks">Note on generated stacks</h3> <p>For the <strong>concrete frameworks and libraries</strong> the model selected in these runs, I did <strong>not</strong> have prior familiarity; observations are about <strong>tool protocol stability</strong>, not stack-specific code review.</p> <h2 id="cross-run-comparison">Cross-run comparison</h2> <table> <thead> <tr> <th>Configuration</th> <th>Survival (approx.)</th> <th>Failure mode</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">enhanced.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code></td> <td>~111k tokens (~13m 20s)</td> <td>Improper tool calling (session died)</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">official.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_coder</code></td> <td>6m 32s</td> <td>Improper tool calling</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">official.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code></td> <td>~1m 16s</td> <td>Malformed tool calls inside thinking box</td> </tr> </tbody> </table> <p><strong>Summary:</strong> even the <strong>best</strong> 3.6 configuration <strong>fails more often</strong> than <strong>Qwen3.5-27B</strong> under the same harness. <strong>Qwen3.5-27B</strong> remains <strong>more stable for agentic</strong> use in these tests, <strong>despite slower TTFT</strong> on 3.5.</p> <h1 id="behaviors-that-look-specific-to-qwen36-35b-a3b">Behaviors that look specific to Qwen3.6-35B-A3B</h1> <h2 id="1-more-frequent-reasoning-loops">1. More frequent reasoning loops</h2> <p>The model <strong>revisits the same analysis step</strong> repeatedly, <strong>burning tokens</strong> before advancing. This reads as a <strong>checkpoint behavior</strong> change, not a template bug: <strong>Qwen3.5-27B</strong> showed the pattern <strong>occasionally</strong>; on <strong>3.6 35B-A3B</strong> it is <strong>common enough</strong> to <strong>hurt long sessions</strong>.</p> <h2 id="2-malformed-tool-calls-despite-a-correct-wire-format">2. Malformed tool calls despite a “correct” wire format</h2> <p>With <strong><code class="language-plaintext highlighter-rouge">enhanced.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong>—the pair that <strong>works cleanly</strong> on <strong>3.5-27B</strong>—<strong>3.6 35B-A3B</strong> still produces <strong>malformed tool calls</strong> at <strong>higher frequency</strong>. The <strong>XML shape</strong> can remain <strong>technically valid</strong> when it succeeds; the problem is <strong>how often</strong> failures occur and that <strong>one bad turn</strong> can <strong>abort</strong> a run with <strong>no recovery</strong>.</p> <p>On <strong>3.5-27B</strong>, after template fixes, <strong>bad tool turns</strong> are a <strong>rare</strong> edge case. On <strong>3.6 35B-A3B</strong>, they are <strong>regular enough</strong> that <strong>any</strong> long agentic session <strong>eventually</strong> hits them, <strong>independent of which</strong> of the tested template/parser combinations is selected.</p> <h1 id="partial-mitigation-opencode-1418">Partial mitigation: OpenCode 1.4.18</h1> <p><strong>OpenCode 1.4.18</strong> reduced <strong>client-side</strong> tool friction. <strong>Older OpenCode</strong> versions had <strong>tool-calling bugs</strong> that <strong>amplified</strong> failures—especially around the <strong><code class="language-plaintext highlighter-rouge">question</code></strong> tool. <strong>Upgrading to 1.4.18</strong> addressed <strong>that</strong> class of <strong>malformed tool call</strong> interaction.</p> <p><strong>Limitation:</strong> the client upgrade <strong>does not remove</strong> <strong>reasoning loops</strong> or the <strong>model-level</strong> <strong>higher baseline failure rate</strong> on <strong>3.6</strong>. The remaining issues sit primarily in the <strong>model</strong> (and possibly <strong>thinking-state handling</strong>—e.g. <strong>preserved thinking</strong>—but that is <strong>hypothesis</strong>, not a confirmed root cause).</p> <h1 id="reference-environment-and-vllm-serve-command">Reference environment and <code class="language-plaintext highlighter-rouge">vllm serve</code> command</h1> <p><strong>Software pinned for the run:</strong></p> <ul> <li><strong>vLLM:</strong> 0.19.1</li> <li><strong>Transformers:</strong> 5.5.4</li> <li><strong>CUDA:</strong> 12.8.1 (<strong>nvcc</strong> 12.8.93)</li> </ul> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">CUDA_DEVICE_ORDER</span><span class="o">=</span>PCI_BUS_ID
<span class="nb">export </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1
<span class="nb">export </span><span class="nv">NCCL_CUMEM_ENABLE</span><span class="o">=</span>0
<span class="nb">export </span><span class="nv">VLLM_ENABLE_CUDAGRAPH_GC</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_USE_FLASHINFER_SAMPLER</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>4
<span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
<span class="nb">export </span><span class="nv">VLLM_TEST_FORCE_FP8_MARLIN</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">VLLM_SLEEP_WHEN_IDLE</span><span class="o">=</span>1

<span class="nb">rm</span> <span class="nt">-rf</span> ~/.cache/flashinfer

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 <span class="se">\</span>
  <span class="nt">--served-model-name</span> Qwen3.6-35B-A3B <span class="se">\</span>
  <span class="nt">--chat-template</span> qwen3.5-enhanced.jinja <span class="se">\</span>
  <span class="nt">--attention-backend</span> FLASHINFER <span class="se">\</span>
  <span class="nt">--trust-remote-code</span> <span class="se">\</span>
  <span class="nt">--tensor-parallel-size</span> 2 <span class="se">\</span>
  <span class="nt">--max-model-len</span> 200000 <span class="se">\</span>
  <span class="nt">--gpu-memory-utilization</span> 0.91 <span class="se">\</span>
  <span class="nt">--enable-auto-tool-choice</span> <span class="se">\</span>
  <span class="nt">--enable-chunked-prefill</span> <span class="se">\</span>
  <span class="nt">--enable-prefix-caching</span> <span class="se">\</span>
  <span class="nt">--max-num-batched-tokens</span> 12288 <span class="se">\</span>
  <span class="nt">--max-num-seqs</span> 4 <span class="se">\</span>
  <span class="nt">--kv-cache-dtype</span> fp8 <span class="se">\</span>
  <span class="nt">--tool-call-parser</span> qwen3_xml <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3 <span class="se">\</span>
  <span class="nt">--no-use-tqdm-on-load</span> <span class="se">\</span>
  <span class="nt">--host</span> 0.0.0.0 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--language-model-only</span>
</code></pre></div></div> <h1 id="conclusions">Conclusions</h1> <ul> <li><strong><code class="language-plaintext highlighter-rouge">enhanced.jinja</code> + <code class="language-plaintext highlighter-rouge">qwen3_xml</code> + OpenCode 1.4.18</strong> is still the <strong>strongest combination tested</strong> on <strong>Qwen3.6-35B-A3B</strong>, but it <strong>does not match</strong> <strong>Qwen3.5-27B</strong> on <strong>looping</strong> or <strong>long-run tool reliability</strong>.</li> <li>It is <strong>not obvious</strong> why <strong>tool regressions</strong> reappear on <strong>3.6 35B-A3B</strong> when many <strong>3.5 fixes</strong> carry over; <strong>preserved thinking</strong> is one <strong>speculative</strong> lever worth tracking if <strong>Qwen</strong> ships <strong>flash</strong> or <strong>template</strong> updates aimed at agents.</li> <li><strong>Qwen3.5-27B</strong>, <strong>Qwen3.5-35B-A3B</strong>, and <strong>Qwen3.6-35B-A3B</strong> share the <strong>same official chat template</strong> in distribution—if <strong>Qwen3.6 Flash</strong> (or similar) launches with <strong>different</strong> templating, that may indicate <strong>intentional</strong> handling of <strong>tool/thinking</strong> edge cases.</li> </ul> <p><strong>Operational choice:</strong> I <strong>default to <code class="language-plaintext highlighter-rouge">Qwen3.5-27B-FP8</code></strong> for <strong>agentic obedience</strong>—<strong>instruction following</strong>, <strong>clean tool execution</strong>, and <strong>low loop rate</strong>. <strong>Qwen3.6-35B-A3B</strong> offers <strong>much faster TTFT</strong> and <strong>similar headline capability</strong> to <strong>Qwen3.5-27B</strong> on <strong>AA-style</strong> benchmarks, but in these runs it <strong>trades</strong> that for <strong>loops</strong> and <strong>tool failures</strong> that <strong>terminate long sessions</strong>. For <strong>agent</strong> work, I <strong>prioritize reliability</strong> over <strong>raw benchmark scores</strong>.</p>]]></content><author><name></name></author><category term="bug-fixes"/><category term="vllm"/><category term="qwen"/><category term="tool-calling"/><category term="llm"/><category term="agent"/><category term="inference"/><summary type="html"><![CDATA[Follow-up testing: same qwen3_xml parser, qwen3.5-enhanced.jinja template, and mixed-GPU tuning as Qwen 3.5-27B—plus three agentic runs comparing official vs enhanced configs on Qwen3.6-35B-A3B-FP8.]]></summary></entry><entry><title type="html">Claude Code with local vLLM: client validation, model aliases, and a working settings.json</title><link href="https://allanchan339.github.io/bug-fixes/2026/04/19/Claude-code-vLLM.html" rel="alternate" type="text/html" title="Claude Code with local vLLM: client validation, model aliases, and a working settings.json"/><published>2026-04-19T00:00:00+08:00</published><updated>2026-04-19T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2026/04/19/Claude-code-vLLM</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2026/04/19/Claude-code-vLLM.html"><![CDATA[<p><strong>Straight story:</strong> I wanted Claude Code to talk to <strong>my own</strong> model on <strong>vLLM</strong>, not to Anthropic’s hosted API. Tutorials usually say: set <code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code> and <code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code>. <strong>That was not enough.</strong> The CLI applies <strong>its own checks</strong> and can fail with “issue with the selected model” <strong>before</strong> meaningful traffic hits your server. The fix is a small set of aligned settings: <strong>tier aliases</strong> (<code class="language-plaintext highlighter-rouge">"model": "sonnet"</code> + <code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_*_MODEL</code>), a <strong>root</strong> base URL (no extra <code class="language-plaintext highlighter-rouge">/v1</code>), <strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1</code></strong>, and a <strong>dummy</strong> <code class="language-plaintext highlighter-rouge">ANTHROPIC_AUTH_TOKEN</code> so vLLM still gets a header.</p> <p><strong>Why I care (and maybe you do too):</strong> with <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code></strong> pointed at <strong>local vLLM</strong> and a <strong>placeholder</strong> token, <strong>model traffic does not use the real Claude API</strong>—no Anthropic API key or inference billing for that path. That matters if you <strong>cannot register</strong>, are <strong>out of region</strong>, or <strong>cannot obtain API access</strong>, but still want the <strong>Claude Code</strong> loop against a model you control. (You still install and run <strong>Claude Code</strong>; this is about <strong>where completions are served</strong>, not a different product.)</p> <p><strong>vLLM / Qwen:</strong> Tooling and template notes for Qwen 3.5 on vLLM are in this <a href="https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/">vLLM / Qwen 3.5 thread</a>. Below assumes <strong>vLLM is already up</strong> and passes a simple <code class="language-plaintext highlighter-rouge">curl</code> check.</p> <h1 id="baseline-vllm-responds-claude-code-does-not-yet">Baseline: vLLM responds, Claude Code does not (yet)</h1> <p>I run <strong>Qwen 3.5-27B</strong> behind vLLM. Direct HTTP calls succeed:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://127.0.0.1:8000/v1/chat/completions <span class="nt">-X</span> POST <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"model":"Qwen3.5-27B","messages":[{"role":"user","content":"test"}]}'</span>
<span class="c"># Works</span>
</code></pre></div></div> <p>So I expected a quick env change. Instead I iterated through docs and issues, then grepped <strong><code class="language-plaintext highlighter-rouge">cli.js</code></strong> to see why validation fired.</p> <h1 id="the-trap-anthropic_custom_model_option">The trap: <code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code></h1> <h2 id="what-the-official-docs-suggest">What the official docs suggest</h2> <p>The <a href="https://docs.anthropic.com/en/docs/claude-code/model-config">Claude Code model configuration docs</a> describe <code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code> as a way to add a custom entry to the <code class="language-plaintext highlighter-rouge">/model</code> picker and imply relaxed handling for that id.</p> <p>I tried:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ANTHROPIC_CUSTOM_MODEL_OPTION"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"ANTHROPIC_BASE_URL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8000"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p><strong>Observed error:</strong> <code class="language-plaintext highlighter-rouge">There's an issue with the selected model (Qwen3.5-27B). It may not exist or you may not have access to it.</code></p> <h2 id="what-actually-happens">What actually happens</h2> <p>The variable <strong>does</strong> add a picker entry, but it <strong>does not</strong> reliably bypass validation when you drive the CLI via <strong><code class="language-plaintext highlighter-rouge">--model</code></strong>, <strong><code class="language-plaintext highlighter-rouge">settings.json</code></strong>, or similar. In practice you still hit the same guardrails unless you adopt the alias + env pattern later in this note.</p> <p>This behavior shows up in community threads—for example GitHub issues <strong>#18025</strong>, <strong>#23266</strong>, and <strong>#34821</strong>—while the product docs have not caught up.</p> <p><strong>Takeaway:</strong> when the documented env var does not match runtime behavior, the implementation (not the blog post) is the source of truth.</p> <h1 id="what-i-learned-from-clijs">What I learned from <code class="language-plaintext highlighter-rouge">cli.js</code></h1> <p>I stopped relying on tutorials and searched the installed <strong><code class="language-plaintext highlighter-rouge">cli.js</code></strong> (on the order of <strong>~50k</strong> lines, minified) for the error string:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-n</span> <span class="s2">"There's an issue with the selected model"</span> ~/.nvm/versions/node/<span class="k">*</span>/lib/node_modules/@anthropic-ai/claude-code/cli.js
</code></pre></div></div> <p>The hit landed near <strong>line 5146</strong>. The logic, paraphrased from the minified source, is:</p> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if </span><span class="p">(</span><span class="nx">q</span> <span class="k">instanceof</span> <span class="nx">AnthropicError</span> <span class="o">&amp;&amp;</span> <span class="nx">q</span><span class="p">.</span><span class="nx">status</span> <span class="o">===</span> <span class="mi">404</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// Reject custom models on 404</span>
  <span class="k">return</span> <span class="p">{</span>
    <span class="na">content</span><span class="p">:</span> <span class="s2">`There's an issue with the selected model (</span><span class="p">${</span><span class="nx">K</span><span class="p">}</span><span class="s2">). 
              It may not exist or you may not have access to it.`</span><span class="p">,</span>
    <span class="na">error</span><span class="p">:</span> <span class="dl">"</span><span class="s2">invalid_request</span><span class="dl">"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div> <p>So the CLI issues <strong>validation-style requests</strong>, gets <strong>404</strong> responses when the id is not on Anthropic’s expected list, and <strong>returns the “selected model” error before</strong> the path you care about (your vLLM <strong><code class="language-plaintext highlighter-rouge">/v1/messages</code></strong> traffic) is exercised normally.</p> <p>That is <strong>client-side validation</strong>, not “your server returned 404 on chat.”</p> <h2 id="the-undocumented-lever-that-matters">The undocumented lever that matters</h2> <p>Experimenting with env vars, <strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1</code></strong> consistently reduced the failure mode where the client keeps probing endpoints that will never acknowledge a local model id. I did not find this called out in the same place as the high-level “custom model” docs; it is nonetheless <strong>necessary</strong> for a stable loop in my setup.</p> <h1 id="working-claudesettingsjson-tested-here-not-copy-pasted-blind">Working <code class="language-plaintext highlighter-rouge">~/.claude/settings.json</code> (tested here, not copy-pasted blind)</h1> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sonnet"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"ANTHROPIC_BASE_URL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8000"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_AUTH_TOKEN"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dummy"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_DEFAULT_OPUS_MODEL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_DEFAULT_SONNET_MODEL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_DEFAULT_HAIKU_MODEL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"API_TIMEOUT_MS"</span><span class="p">:</span><span class="w"> </span><span class="s2">"3000000"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CLAUDE_CODE_ATTRIBUTION_HEADER"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <h2 id="the-settings-that-must-agree-if-any-drift-you-get-confusing-errors">The settings that must agree (if any drift, you get confusing errors)</h2> <table> <thead> <tr> <th>Setting</th> <th>Why it matters</th> <th>Typical failure if wrong</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">"model": "sonnet"</code> <strong>and</strong> <code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_SONNET_MODEL</code></td> <td>Claude resolves the <strong>alias</strong> “sonnet” to your real vLLM id; putting the <strong>custom id</strong> directly in <code class="language-plaintext highlighter-rouge">"model"</code> triggers list validation</td> <td>“Issue with the selected model”</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code> is <strong><code class="language-plaintext highlighter-rouge">http://127.0.0.1:8000</code></strong> (no <code class="language-plaintext highlighter-rouge">/v1</code>)</td> <td>The client appends <strong><code class="language-plaintext highlighter-rouge">/v1/messages</code></strong> itself; a base URL that already ends in <code class="language-plaintext highlighter-rouge">/v1</code> becomes <strong><code class="language-plaintext highlighter-rouge">/v1/v1/messages</code></strong></td> <td><strong>404</strong> on API calls</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code>: <code class="language-plaintext highlighter-rouge">"1"</code></td> <td>Cuts non-essential / validation traffic that assumes Anthropic-hosted models</td> <td><strong>Intermittent</strong> validation failures</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ANTHROPIC_AUTH_TOKEN</code> (e.g. <code class="language-plaintext highlighter-rouge">"dummy"</code>) plus aligned <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_*_MODEL</code></strong> for <strong>Opus / Sonnet / Haiku</strong></td> <td>vLLM still expects an Authorization-shaped header; mapping <strong>all three</strong> tiers to the same served name avoids internal tier switches pointing at invalid ids</td> <td>Auth or “wrong model” surprises when the CLI switches tier</td> </tr> </tbody> </table> <h2 id="vllm-side-must-match-the-json-exactly">vLLM side (must match the JSON exactly)</h2> <ul> <li><strong><code class="language-plaintext highlighter-rouge">--served-model-name Qwen3.5-27B</code></strong> must match the strings in <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_*_MODEL</code></strong> <strong>character for character</strong>.</li> <li>Avoid <strong><code class="language-plaintext highlighter-rouge">/</code></strong> in the served name if your settings use a flat id (a <strong><code class="language-plaintext highlighter-rouge">Qwen/...</code></strong> vs <strong><code class="language-plaintext highlighter-rouge">Qwen3.5-27B</code></strong> mismatch broke one of my attempts).</li> <li>Server should listen where <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code></strong> points (here <strong><code class="language-plaintext highlighter-rouge">8000</code></strong>).</li> </ul> <h2 id="smoke-test">Smoke test</h2> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>claude <span class="s2">"test"</span>
<span class="c"># Expect a normal assistant reply, e.g. readiness to help.</span>
</code></pre></div></div> <p>If this fails, reconcile the table above <strong>in order</strong> before chasing unrelated flags.</p> <h1 id="debugging-sequence-short">Debugging sequence (short)</h1> <p>If something below matches your error, fix that first; the full <strong><code class="language-plaintext highlighter-rouge">settings.json</code></strong> block is the target state.</p> <p><strong>Attempt 1 — vLLM-style base URL with <code class="language-plaintext highlighter-rouge">/v1</code></strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"ANTHROPIC_BASE_URL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8000/v1"</span><span class="w">
</span></code></pre></div></div> <p><strong>Error:</strong> <code class="language-plaintext highlighter-rouge">API Error: 404</code> — the client adds <code class="language-plaintext highlighter-rouge">/v1/messages</code> again.</p> <hr/> <p><strong>Attempt 2 — custom id in <code class="language-plaintext highlighter-rouge">"model"</code> (GitHub #18025-style reports)</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="w">
</span></code></pre></div></div> <p><strong>Error:</strong> <code class="language-plaintext highlighter-rouge">There's an issue with the selected model</code> — no alias mapping; Anthropic list validation wins.</p> <hr/> <p><strong>Attempt 3 — slash in <code class="language-plaintext highlighter-rouge">--served-model-name</code> vs settings</strong></p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">--served-model-name</span> Qwen/Qwen3.5-27B
</code></pre></div></div> <p>vs settings expecting <code class="language-plaintext highlighter-rouge">Qwen3.5-27B</code> without <code class="language-plaintext highlighter-rouge">/</code>.</p> <p><strong>Error:</strong> model not found / mismatch.</p> <hr/> <p><strong>Attempt 4 — <code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code> only (official wording)</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"ANTHROPIC_CUSTOM_MODEL_OPTION"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"ANTHROPIC_BASE_URL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8000"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p><strong>Error:</strong> still validation errors — picker entry ≠ full bypass for <strong><code class="language-plaintext highlighter-rouge">settings.json</code></strong> flows.</p> <hr/> <p><strong>Attempt 5 — <code class="language-plaintext highlighter-rouge">ANTHROPIC_API_KEY</code> instead of token</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"ANTHROPIC_API_KEY"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dummy"</span><span class="w">
</span></code></pre></div></div> <p><strong>Error:</strong> authentication friction — <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_AUTH_TOKEN</code></strong> behaved better with vLLM in my tests.</p> <hr/> <p><strong>Attempt 6 — correct URL and aliases but no traffic / validation flag</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span><span class="w"> </span><span class="err">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</span><span class="w"> </span><span class="err">omitted</span><span class="w">
</span></code></pre></div></div> <p><strong>Error:</strong> <strong>intermittent</strong> validation failures — sometimes works, sometimes not.</p> <hr/> <p><strong>Attempt 7 — minimal working core (before I added timeout / attribution / all three tiers)</strong></p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sonnet"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"ANTHROPIC_BASE_URL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:8000"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_AUTH_TOKEN"</span><span class="p">:</span><span class="w"> </span><span class="s2">"dummy"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"ANTHROPIC_DEFAULT_SONNET_MODEL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p><strong>Result:</strong> <strong>stable enough to proceed</strong>; I then expanded to the full block at the top (Opus/Haiku defaults, long <strong><code class="language-plaintext highlighter-rouge">API_TIMEOUT_MS</code></strong>, <strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_ATTRIBUTION_HEADER</code></strong>) for day-to-day use.</p> <h1 id="why-the-alias--base-url--flag-pattern-works">Why the alias + base URL + flag pattern works</h1> <h2 id="model-tiers-and-aliases">Model tiers and aliases</h2> <p>Claude Code still thinks in <strong>Opus / Sonnet / Haiku</strong> tiers. If <strong><code class="language-plaintext highlighter-rouge">"model": "sonnet"</code></strong>, the runtime resolves that label via <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_SONNET_MODEL</code></strong>. If you put <strong><code class="language-plaintext highlighter-rouge">"model": "Qwen3.5-27B"</code></strong> directly, the CLI tries to treat it like an Anthropic-hosted id and <strong>fails validation</strong>.</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sonnet"</span><span class="w">
</span><span class="nl">"ANTHROPIC_DEFAULT_SONNET_MODEL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Qwen3.5-27B"</span><span class="w">
</span></code></pre></div></div> <h2 id="url-construction">URL construction</h2> <p>The client builds:</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ANTHROPIC_BASE_URL}/v1/messages
</code></pre></div></div> <p>So <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL=http://127.0.0.1:8000/v1</code></strong> becomes:</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://127.0.0.1:8000/v1/v1/messages
</code></pre></div></div> <p>which <strong>404s</strong>. The base should stop at the host (and port), e.g. <strong><code class="language-plaintext highlighter-rouge">http://127.0.0.1:8000</code></strong>.</p> <h2 id="claude_code_disable_nonessential_traffic"><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code></h2> <p>With <strong><code class="language-plaintext highlighter-rouge">=1</code></strong>, the CLI skips <strong>some</strong> checks and ancillary calls that assume Anthropic’s catalog. For <strong>local</strong> ids, those checks are exactly where <strong>404 → “invalid model”</strong> loops come from. <strong>Without</strong> the flag I still saw <strong>sporadic</strong> failures even when aliases and URLs were otherwise correct.</p> <h1 id="common-errors-quick-map">Common errors (quick map)</h1> <table> <thead> <tr> <th>Symptom</th> <th>Likely cause</th> </tr> </thead> <tbody> <tr> <td>“There’s an issue with the selected model”</td> <td>Custom string in <strong><code class="language-plaintext highlighter-rouge">"model"</code></strong> without alias mapping</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">API Error: 404</code></td> <td><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code></strong> includes <strong><code class="language-plaintext highlighter-rouge">/v1</code></strong></td> </tr> <tr> <td>Model not found</td> <td><strong><code class="language-plaintext highlighter-rouge">--served-model-name</code></strong> does not match JSON, or contains <strong><code class="language-plaintext highlighter-rouge">/</code></strong> when settings do not</td> </tr> <tr> <td>Intermittent validation</td> <td>Missing <strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code></strong></td> </tr> <tr> <td>Relying on <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code> alone</strong></td> <td>Docs oversell bypass; <strong>picker ≠ full CLI bypass</strong></td> </tr> </tbody> </table> <h1 id="pre-flight-checklist">Pre-flight checklist</h1> <p>Before invoking <code class="language-plaintext highlighter-rouge">claude</code>:</p> <ul> <li><strong><code class="language-plaintext highlighter-rouge">"model"</code></strong> is an <strong>alias</strong> such as <strong><code class="language-plaintext highlighter-rouge">sonnet</code></strong>, not the vLLM id.</li> <li><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_SONNET_MODEL</code></strong> (and siblings if you use tier changes) points at the served name.</li> <li><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code></strong> ends at <strong><code class="language-plaintext highlighter-rouge">...:8000</code></strong> with <strong>no</strong> trailing <strong><code class="language-plaintext highlighter-rouge">/v1</code></strong>.</li> <li><strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code></strong> is <strong><code class="language-plaintext highlighter-rouge">"1"</code></strong>.</li> <li><strong><code class="language-plaintext highlighter-rouge">--served-model-name</code></strong> matches the JSON <strong>exactly</strong> (no stray <strong><code class="language-plaintext highlighter-rouge">/</code></strong>).</li> <li>vLLM is up and reachable at that host/port.</li> <li>Do <strong>not</strong> treat <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code></strong> as sufficient on its own.</li> </ul> <h1 id="summary">Summary</h1> <ol> <li><strong>Accessibility:</strong> Pointing Claude Code at <strong>local vLLM</strong> means <strong>Anthropic API access is not required for the model layer</strong>—useful when you <strong>cannot register</strong>, <strong>cannot get API keys</strong>, or want <strong>zero</strong> hosted inference spend. You still use the CLI; completions hit <strong>your</strong> server.</li> <li><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code></strong> alone did <strong>not</strong> match what I needed; treat tier <strong>aliases</strong> + env as the real fix.</li> <li><strong><code class="language-plaintext highlighter-rouge">"model": "sonnet"</code></strong> (or another tier label) plus <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_DEFAULT_*_MODEL</code></strong> → your <strong><code class="language-plaintext highlighter-rouge">Qwen3.5-27B</code></strong> (or served name).</li> <li><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_BASE_URL</code></strong> stops at <strong><code class="language-plaintext highlighter-rouge">http://host:port</code></strong>; the client adds <strong><code class="language-plaintext highlighter-rouge">/v1/messages</code></strong>.</li> <li><strong><code class="language-plaintext highlighter-rouge">--served-model-name</code></strong> matches those env strings <strong>exactly</strong> (watch <strong><code class="language-plaintext highlighter-rouge">/</code></strong> in ids).</li> <li><strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1</code></strong> avoids <strong>intermittent</strong> validation against Anthropic’s catalog.</li> <li><strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_AUTH_TOKEN</code></strong> (e.g. <strong><code class="language-plaintext highlighter-rouge">dummy</code></strong>) worked better than <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_API_KEY</code></strong> with my vLLM.</li> <li>When docs and <strong><code class="language-plaintext highlighter-rouge">cli.js</code></strong> disagree, the bundle wins; most bad copy-pastes omit <strong>one</strong> of alias mapping, base URL shape, or the traffic flag.</li> </ol> <h1 id="resources">Resources</h1> <ul> <li><a href="https://github.com/allanchan339/ForgeBookAuto/blob/main/docs/claude-code-third-party-models.md">ForgeBookAuto — Claude Code third-party models (quick reference)</a></li> <li><a href="https://docs.bigmodel.cn/cn/coding-plan/tool/claude">BigModel docs — coding plan / Claude (working third-party pattern)</a></li> <li><a href="https://docs.vllm.ai/en/latest/serving/integrations/claude_code/">vLLM docs — Claude Code integration</a> (useful but incomplete versus real client behavior)</li> <li>Related GitHub issues: <strong>#18025</strong>, <strong>#23266</strong>, <strong>#34821</strong></li> </ul> <p>If you want <strong>Claude Code’s workflow</strong> without <strong>Claude API</strong> inference, start from the <strong><code class="language-plaintext highlighter-rouge">settings.json</code></strong> block and checklist: <strong>aliases</strong>, <strong>root base URL</strong>, <strong><code class="language-plaintext highlighter-rouge">CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC</code></strong>, then align vLLM’s served name. That order saves time versus chasing <strong><code class="language-plaintext highlighter-rouge">ANTHROPIC_CUSTOM_MODEL_OPTION</code></strong> alone.</p>]]></content><author><name></name></author><category term="bug-fixes"/><category term="claude-code"/><category term="vllm"/><category term="llm"/><category term="local-inference"/><category term="anthropic-api"/><summary type="html"><![CDATA[Run Claude Code against local vLLM without Anthropic API access: why common env-only recipes fail, the alias + settings.json pattern that works, and when this matters if you cannot register or use the Claude API.]]></summary></entry><entry><title type="html">Stable tool calling for Qwen 3.5 27B/35B on vLLM: template, parser, and mixed-GPU fixes</title><link href="https://allanchan339.github.io/bug-fixes/2026/04/13/Qwen35-tool-calling.html" rel="alternate" type="text/html" title="Stable tool calling for Qwen 3.5 27B/35B on vLLM: template, parser, and mixed-GPU fixes"/><published>2026-04-13T00:00:00+08:00</published><updated>2026-04-13T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2026/04/13/Qwen35-tool-calling</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2026/04/13/Qwen35-tool-calling.html"><![CDATA[<p>Public write-ups on Qwen 3.5 often emphasize reasoning quality and slow time-to-first-token (TTFT). The reasoning capability on <strong>Qwen 3.5-27B</strong> is genuinely strong, and yes, TTFT is slow—but for <strong>agentic</strong> workloads <strong>tool calling</strong> is frequently what actually breaks: malformed XML, mid-stream stops, and long-context format drift are not always visible in short demos. This note comes from roughly <strong>a month</strong> of running that model on a <strong>mixed-GPU</strong> workstation (<strong>RTX 4090 + 3090</strong>), plus <strong>many hours</strong> debugging failed runs and reading <strong>vLLM</strong> source. The same patterns apply to <strong>27B/35B-class</strong> checkpoints (including <strong>A3B</strong>-style variants where instruction-following is comparable). The resulting configuration has stayed stable in production (<strong>weeks</strong> of use after the initial fix pass). Official model cards describe the happy path; they understate <strong>edge cases that smaller models</strong> hit more often than <strong>122B+</strong> checkpoints.</p> <h1 id="1-chat-template-qwen35_officialjinja-and-smaller-models">1. Chat template: <code class="language-plaintext highlighter-rouge">qwen3.5_official.jinja</code> and smaller models</h1> <h2 id="symptoms">Symptoms</h2> <p>The run started from the official <code class="language-plaintext highlighter-rouge">qwen3.5_official.jinja</code> template. For the first handful of tool turns, output looked fine. Then failures clustered:</p> <ul> <li>Tool calls appeared <strong>mid-thought</strong>, including closing <strong><code class="language-plaintext highlighter-rouge">&lt;/redacted_thinking&gt;</code></strong> without having opened a matching <strong><code class="language-plaintext highlighter-rouge">&lt;redacted_thinking&gt;</code></strong> tag.</li> <li><strong>Premature stops</strong> in the middle of XML tool calls—for example, the model produced a line like “Let me do that for you:” and then <strong>stopped</strong> without finishing the tool payload.</li> <li><strong>Historical thinking blocks leaked into context</strong>, so later turns saw polluted reasoning boundaries and inconsistent tool formatting.</li> </ul> <p>At first the usual suspects were operator error, a possible <strong>vLLM</strong> bug, or <strong>heterogeneous GPUs</strong>. Instrumentation and template experiments showed the <strong>chat template</strong> was the actual root cause.</p> <h2 id="cause">Cause</h2> <p>The official template contains <strong>edge cases that 122B+ models tend to absorb but 27B/35B models do not</strong>. Smaller checkpoints have <strong>less robust instruction following</strong>; the same ambiguity around where “thinking” ends and tool XML begins produces <strong>silent parser-level failures</strong> rather than self-correction.</p> <h2 id="fix">Fix</h2> <p>The working approach was a custom <strong>M2.5-style interleaved thinking</strong> template (<code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code>) that:</p> <ul> <li>Closes <strong><code class="language-plaintext highlighter-rouge">&lt;/thinking&gt;</code></strong> (the structured thinking segment) <strong>before</strong> tool calls, not after—so tool XML is not interleaved in a way that confuses the runtime.</li> <li><strong>Hides historical reasoning</strong> from the context the model sees on subsequent turns while keeping <strong>current</strong> reasoning visible where needed.</li> <li>Uses XML-shaped tool output that <strong>does not accidentally trigger</strong> <code class="language-plaintext highlighter-rouge">&lt;stop&gt;</code>-style termination patterns.</li> <li><strong>Handles the edge cases</strong> that smaller models struggle with when the stock template leaves boundaries implicit.</li> </ul> <p><strong>vLLM does not auto-detect</strong> the right chat template for this stack. You must pass the Jinja file explicitly; otherwise the default template remains in force and instability tends to persist <strong>regardless of other optimizations</strong>:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">--chat-template</span> qwen3.5-enhanced.jinja
</code></pre></div></div> <h1 id="2-tool-call-parser-qwen3_coder-vs-qwen3_xml">2. Tool-call parser: <code class="language-plaintext highlighter-rouge">qwen3_coder</code> vs <code class="language-plaintext highlighter-rouge">qwen3_xml</code></h1> <h2 id="official-guidance">Official guidance</h2> <p>The <a href="https://huggingface.co/Qwen/Qwen3.5-27B-FP8">Qwen3.5-27B-FP8 Hugging Face page</a> recommends:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">--tool-call-parser</span> qwen3_coder
</code></pre></div></div> <h2 id="what-went-wrong-in-practice">What went wrong in practice</h2> <p>For <strong>complex tool calls</strong> and <strong>long-context agentic</strong> runs (on the order of <strong>50K+ tokens</strong> in a trace), <code class="language-plaintext highlighter-rouge">qwen3_coder</code> was a primary source of breakage. A lot of wall-clock time went into proving that the parser—not only the model—was responsible.</p> <p>From reading <strong>vLLM’s implementation</strong>, the distinction is structural:</p> <table> <thead> <tr> <th>Parser</th> <th>How it works</th> <th>Special characters (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, <code class="language-plaintext highlighter-rouge">&amp;</code>)</th> <th>Nested JSON while streaming</th> <th>Malformed XML</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">qwen3_coder</code></td> <td>Regex string extraction</td> <td>Breaks pattern matching</td> <td>Often corrupts mid-stream</td> <td>Fails hard</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">qwen3_xml</code></td> <td>C-based parser (<code class="language-plaintext highlighter-rouge">xml.parsers.expat</code>)</td> <td>Auto-sanitizes</td> <td>Deferred / safer parsing</td> <td>Often auto-heals</td> </tr> </tbody> </table> <p><strong>Concrete example:</strong> a tool argument that contains code such as <code class="language-plaintext highlighter-rouge">if (a &lt; b)</code> breaks <code class="language-plaintext highlighter-rouge">qwen3_coder</code> because <code class="language-plaintext highlighter-rouge">&lt;</code> and <code class="language-plaintext highlighter-rouge">&gt;</code> interfere with regex-based extraction. <code class="language-plaintext highlighter-rouge">qwen3_xml</code> treats the stream as XML-shaped text under a real parser and <strong>does not depend on that fragile string match</strong>.</p> <h2 id="fix-1">Fix</h2> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">--tool-call-parser</span> qwen3_xml
</code></pre></div></div> <p>This <strong>contradicts the one-line official recommendation</strong> on the model card. The preference for <code class="language-plaintext highlighter-rouge">qwen3_xml</code> here is grounded in <strong>vLLM source inspection</strong> (not only empirical trial-and-error): the <strong>C-based XML path</strong> is fundamentally more robust than regex extraction for messy, nested, or streaming tool payloads.</p> <h1 id="3-mixed-gpu-precision-drift-4090--3090">3. Mixed-GPU precision drift (4090 + 3090)</h1> <h2 id="problem">Problem</h2> <p>Tensor parallelism splits matrix multiplications across devices. In this setup:</p> <ul> <li><strong>RTX 4090 (SM89)</strong> exposes <strong>native FP8</strong> <strong>W8A8</strong> tensor-core paths.</li> <li><strong>RTX 3090 (SM80)</strong> has <strong>no native FP8</strong> and falls back to <strong>W8A16</strong>.</li> </ul> <p>So <strong>different ranks use different precision</strong>: <strong>W8A8</strong> on one GPU and <strong>W8A16</strong> on the other → <strong>mismatched partial products</strong> → <strong>error accumulation</strong> over depth and sequence length.</p> <h2 id="symptoms-1">Symptoms</h2> <p>Beyond roughly <strong>30–40K tokens</strong>, conversations <strong>drifted</strong>: tool calls grew inconsistent and reasoning quality degraded, consistent with numerical divergence rather than a single bad sampling draw.</p> <h2 id="fix-2">Fix</h2> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">VLLM_TEST_FORCE_FP8_MARLIN</span><span class="o">=</span>1
</code></pre></div></div> <p>This forces the 4090 onto <strong>W8A16</strong> (via the Marlin path) so it <strong>matches the 3090</strong> instead of using native <strong>W8A8</strong> alone. Both ranks then share the same effective precision, which removed the long-run drift in this configuration.</p> <p><strong>NCCL tuning</strong> for stability on this <strong>mixed consumer topology</strong> (helpful in practice alongside the precision alignment):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
</code></pre></div></div> <h1 id="4-checkpoint-choice-sft-distilled-variants-eg-qwopus35-vs-official-weights">4. Checkpoint choice: SFT-distilled variants (e.g. Qwopus3.5) vs official weights</h1> <h2 id="the-failure-mode">The failure mode</h2> <p>Checkpoints such as <strong><code class="language-plaintext highlighter-rouge">QuantTrio/Qwopus3.5-27B-v3-AWQ</code></strong> are <strong>SFT-distilled from Claude 4.6 Opus</strong>. They can look excellent initially:</p> <ul> <li>For the <strong>first ~65K tokens</strong>, tool calling stayed stable.</li> <li><strong>After ~65K+ tokens</strong>, output began <strong>mixing XML tool format with JSON-style</strong> tool messages.</li> </ul> <p>This branch of debugging <strong>cost the most calendar time</strong> before the hypothesis “wrong checkpoint for the protocol” was confirmed.</p> <h2 id="cause-1">Cause</h2> <p><strong>SFT</strong> shifted the <strong>surface</strong> tool format toward a <strong>Hermes-style JSON</strong> tool protocol to align with <strong>Claude-like</strong> training targets, but it does <strong>not fully realign</strong> the underlying token distribution with the base <strong>Qwen XML</strong> tool priors. In long context, the model <strong>drifts between</strong> the original <strong>Qwen <code class="language-plaintext highlighter-rouge">qwen3_xml</code> shape</strong> and the SFT’d <strong>JSON</strong> shape—something post-processing cannot reliably paper over.</p> <h2 id="what-to-run-instead">What to run instead</h2> <p><strong>If you have about 48 GB VRAM</strong> (best quality in this comparison):</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Qwen/Qwen3.5-27B-FP8
</code></pre></div></div> <ul> <li><strong>Near-lossless</strong> accuracy relative to denser formats.</li> <li><strong>Full ~219K context</strong> support as advertised for the stack used here.</li> <li><strong>Stable tool calling</strong> when paired with the <strong>custom template</strong> above.</li> </ul> <p><strong>If VRAM is below ~48 GB</strong> (accept some accuracy loss):</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Intel/Qwen3.5-27B-int4-AutoRound
</code></pre></div></div> <ul> <li>Saves on the order of <strong>~4 GB VRAM</strong> versus FP8 in this class of setup.</li> <li>Remains <strong>stable with the same custom template</strong> in testing.</li> <li><strong>Higher perplexity than FP8</strong>; <strong>INT4 is not lossless</strong>.</li> </ul> <p><strong>FP8 quantization is near-lossless</strong> in practice for this model line: avoid dropping to INT4 <strong>unless</strong> VRAM truly forces it.</p> <h1 id="reference-vllm-serve-configuration">Reference <code class="language-plaintext highlighter-rouge">vllm serve</code> configuration</h1> <p>After consolidation in an <strong>independent repo</strong> and roughly <strong>three days</strong> of <strong>production-style</strong> use, the following <strong>environment</strong> and <strong>serve</strong> command matched the stable behavior described above:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Environment variables</span>
<span class="nb">export </span><span class="nv">CUDA_DEVICE_ORDER</span><span class="o">=</span>PCI_BUS_ID
<span class="nb">export </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0,1
<span class="nb">export </span><span class="nv">NCCL_P2P_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_IB_DISABLE</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">NCCL_ALGO</span><span class="o">=</span>Ring
<span class="nb">export </span><span class="nv">VLLM_TEST_FORCE_FP8_MARLIN</span><span class="o">=</span>1

<span class="c"># vLLM serve command</span>
vllm serve Qwen/Qwen3.5-27B-FP8 <span class="se">\</span>
  <span class="nt">--served-model-name</span> Qwen3.5-27B <span class="se">\</span>
  <span class="nt">--chat-template</span> qwen3.5-enhanced.jinja <span class="se">\</span>
  <span class="nt">--attention-backend</span> FLASHINFER <span class="se">\</span>
  <span class="nt">--trust-remote-code</span> <span class="se">\</span>
  <span class="nt">--tensor-parallel-size</span> 2 <span class="se">\</span>
  <span class="nt">--max-model-len</span> 219520 <span class="se">\</span>
  <span class="nt">--gpu-memory-utilization</span> 0.92 <span class="se">\</span>
  <span class="nt">--enable-auto-tool-choice</span> <span class="se">\</span>
  <span class="nt">--enable-chunked-prefill</span> <span class="se">\</span>
  <span class="nt">--enable-prefix-caching</span> <span class="se">\</span>
  <span class="nt">--max-num-batched-tokens</span> 4096 <span class="se">\</span>
  <span class="nt">--max-num-seqs</span> 4 <span class="se">\</span>
  <span class="nt">--kv-cache-dtype</span> fp8 <span class="se">\</span>
  <span class="nt">--tool-call-parser</span> qwen3_xml <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3 <span class="se">\</span>
  <span class="nt">--host</span> 0.0.0.0 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--language-model-only</span>
</code></pre></div></div> <h1 id="validation">Validation</h1> <p>One <strong>continuous agentic session</strong> lasted about <strong>1h 9m</strong> on this configuration:</p> <ul> <li><strong>138.2K tokens</strong> generated.</li> <li><strong>Stable tool calling</strong> throughout—<strong>no XML/JSON format drift</strong> tied to parser or template collapse.</li> <li><strong>M2.5-style interleaved thinking</strong> remained coherent across the run.</li> <li>The model <strong>autonomously</strong> implemented a <strong>production-oriented knowledge-graph platform</strong> (<strong>FastAPI + React</strong>); <strong>18 minutes</strong> of that session were <strong>uninterrupted</strong> end-to-end work <strong>without tool-calling failures</strong>—the sort of reliability that matters for real agents rather than toy demos.</li> </ul> <p>The stack has also remained <strong>stable for weeks</strong> after that validation period in <strong>my</strong> deployment. As always, numbers are <strong>tied to one hardware topology</strong> and one workload mix; they are included as <strong>evidence</strong>, not a universal benchmark.</p> <h1 id="summary">Summary</h1> <ol> <li><strong>Jinja template</strong>: For <strong>27B/35B-class</strong> Qwen 3.5, the <strong>custom interleaved-thinking template</strong> is <strong>critical</strong>; the <strong>official template</strong> leaves <strong>edge cases</strong> that <strong>smaller models</strong> hit routinely.</li> <li><strong>Parser</strong>: Do not treat <strong><code class="language-plaintext highlighter-rouge">qwen3_coder</code></strong> as mandatory for agentic work; <strong><code class="language-plaintext highlighter-rouge">qwen3_xml</code></strong>’s <strong>Expat-based</strong> path is <strong>more robust</strong> than <strong>regex</strong> for long, messy tool traces—and that conclusion is supported by <strong>reading vLLM</strong>, not only by trial runs.</li> <li><strong>Mixed GPU</strong>: When <strong>FP8 paths differ by generation</strong>, <strong><code class="language-plaintext highlighter-rouge">VLLM_TEST_FORCE_FP8_MARLIN=1</code></strong> (or an equivalent precision-alignment strategy) is <strong>effectively required</strong> to stop <strong>long-context drift</strong>.</li> <li><strong>Weights</strong>: <strong>SFT-distilled “Claude-shaped”</strong> forks (e.g. <strong>Qwopus3.5</strong>) can <strong>mix formats after ~65K tokens</strong>; for <strong>long</strong> tool-heavy jobs, prefer <strong>official Qwen FP8</strong> (or the INT4 variant if VRAM demands it).</li> <li><strong>Quantization</strong>: <strong>FP8 is near-lossless</strong> here; downgrade formats <strong>only</strong> when VRAM leaves no alternative.</li> </ol> <h1 id="resources">Resources</h1> <ul> <li><strong>Working setup (templates, env, notes):</strong> <a href="https://github.com/allanchan339/vLLM-Qwen3.5-27B">GitHub — vLLM Qwen 3.5 27B config</a> — includes <strong><code class="language-plaintext highlighter-rouge">qwen3.5-enhanced.jinja</code></strong>.</li> <li><strong>Concrete long-session example:</strong> <a href="https://github.com/allanchan339/qwen_own_project">qwen_own_project</a>.</li> <li><strong>Earlier discussion:</strong> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/">Reddit — tool calling fixes thread</a>.</li> </ul> <p>If you run <strong>Qwen 3.5 27B/35B-class</strong> models for <strong>agents</strong> and see <strong>silent</strong> tool failures—<strong>truncation</strong>, <strong>wrong boundaries</strong>, or <strong>thinking leakage</strong>—inspect the <strong>Jinja chat template</strong> first. <strong>In my experience</strong> it was <strong>almost always</strong> the dominant factor: the <strong>stock template</strong> does not cover the <strong>failure modes smaller checkpoints</strong> actually hit.</p>]]></content><author><name></name></author><category term="bug-fixes"/><category term="vllm"/><category term="qwen"/><category term="tool-calling"/><category term="llm"/><category term="inference"/><category term="gpu"/><summary type="html"><![CDATA[Debugging notes on Jinja chat templates, qwen3_xml vs qwen3_coder parsers, mixed-GPU FP8 drift, and SFT-distilled checkpoints when running Qwen 3.5 27B/35B-class models for long agentic sessions on vLLM.]]></summary></entry><entry><title type="html">Workaround for Enabling NCCL P2P Communication for NVIDIA RTX 4090 Workstations</title><link href="https://allanchan339.github.io/bug-fixes/2025/05/21/4090-P2P.html" rel="alternate" type="text/html" title="Workaround for Enabling NCCL P2P Communication for NVIDIA RTX 4090 Workstations"/><published>2025-05-21T00:00:00+08:00</published><updated>2025-05-21T00:00:00+08:00</updated><id>https://allanchan339.github.io/bug-fixes/2025/05/21/4090-P2P</id><content type="html" xml:base="https://allanchan339.github.io/bug-fixes/2025/05/21/4090-P2P.html"><![CDATA[<h1 id="why-bother-with-nccl-p2p">Why bother with NCCL P2P?</h1> <p>When you train or fine-tune on <strong>more than one GPU</strong>, libraries such as PyTorch and JAX rely on <strong>collective communication</strong> so every device sees the same gradients or activations at the right time. On NVIDIA hardware that usually means <strong>NCCL</strong> is in the hot path. <strong>Peer-to-peer (P2P)</strong>—direct GPU-to-GPU memory access—is how NCCL prefers to move data when the driver and topology allow it; when P2P is blocked, collectives may still run but <strong>through slower fallbacks</strong> (see the next section for definitions).</p> <p>NVIDIA <strong>does not officially enable NCCL P2P</strong> on some consumer GeForce boards, including the <strong>RTX 4090</strong>, even when the hardware could support it in principle. If you bought two 4090s for a workstation and expect NCCL to “just work” like on a datacenter GPU, you can end up chasing cryptic logs until you either accept degraded comms or apply a <strong>driver/kernel workaround</strong>. This post documents one such path: <strong>modified open GPU kernel modules</strong> plus a few <strong>firmware and boot</strong> settings.</p> <h1 id="what-is-nccl-p2p">What is NCCL P2P?</h1> <p><strong>NCCL</strong> (NVIDIA Collective Communications Library) implements <strong>multi-GPU collective operations</strong>: patterns such as <strong>all-reduce</strong> (combine gradients across GPUs), <strong>all-gather</strong>, <strong>broadcast</strong>, and <strong>reduce-scatter</strong>. Frameworks typically call these through a distributed backend (for example PyTorch’s <code class="language-plaintext highlighter-rouge">ProcessGroup</code> using NCCL). Internally, NCCL chooses <strong>topologies</strong>—rings, trees, or hybrids—and schedules <strong>send/recv</strong>-style steps along edges between GPUs.</p> <p><strong>P2P</strong> in this context means <strong>CUDA peer access</strong>: GPU <em>i</em> is allowed to <strong>load and store</strong> another GPU <em>j</em>’s device memory <strong>without an explicit copy through host DRAM</strong>. On PCIe-only setups that usually implies <strong>P2P DMA over the fabric</strong> between those endpoints (when <strong>NVLink</strong> exists, NCCL can use that too). <strong>“NCCL P2P”</strong> is shorthand for: <strong>NCCL is using peer-to-peer GPU memory paths</strong> as part of those collectives, rather than only staging through pinned host buffers.</p> <p>When peer access is <strong>unavailable or disabled</strong>, NCCL can fall back to other transports, but you often see <strong>higher latency, lower effective bandwidth, or extra PCIe traffic</strong>—sometimes bad enough that scaling to two GPUs barely helps. Diagnostic hooks include <strong><code class="language-plaintext highlighter-rouge">cudaDeviceCanAccessPeer</code> / <code class="language-plaintext highlighter-rouge">cudaDeviceEnablePeerAccess</code></strong>, NCCL’s environment knobs (for example <code class="language-plaintext highlighter-rouge">NCCL_P2P_LEVEL</code>, <code class="language-plaintext highlighter-rouge">NCCL_P2P_DISABLE</code>), and <strong>NCCL debug / topology logs</strong> that record whether the runtime believes P2P is usable between pairs of GPUs.</p> <h1 id="what-is-rebar-and-why-it-shows-up-here">What is ReBAR (and why it shows up here)?</h1> <p><strong>ReBAR</strong> is <strong>Resizable BAR</strong> (part of the PCIe specification). Traditionally the CPU could only map a <strong>small fixed window</strong> of GPU video memory at a time. With ReBAR, the system can expose a <strong>much larger contiguous mapping</strong> of VRAM to the CPU. That mainly helps <strong>CPU ↔ GPU</strong> traffic (textures, uploads, some unified-memory style use). It is <strong>not</strong> the same thing as GPU-to-GPU P2P, but on consumer platforms <strong>BIOS and driver stacks</strong> often treat BAR sizing and routing as part of the same configuration story.</p> <p>For the workaround in this guide, <strong>ReBAR should be enabled</strong> in firmware where possible. The practical check is: <strong>Total BAR size reported by the driver is at least on the order of hundreds of MB</strong> (the steps below use <code class="language-plaintext highlighter-rouge">nvidia-smi</code>). If ReBAR is off, update <strong>motherboard BIOS</strong> and enable the relevant <strong>“Above 4G decoding” / Resizable BAR</strong> options before spending time on kernels.</p> <p><strong>IOMMU</strong> (I/O memory management unit) virtualization for devices can <strong>interfere with certain P2P paths</strong> on some boards. This guide turns IOMMU off in the kernel command line for the workstation case—<strong>only do that if you understand the tradeoff</strong> (simpler device DMA; less isolation for PCI passthrough / VFIO workflows).</p> <hr/> <p>Once the definitions and motivation are clear, the rest is <strong>execution</strong>: driver build, firmware/boot knobs, CUDA, and sanity tests.</p> <h1 id="expected-results">Expected results</h1> <p>The following images show P2P working end-to-end after configuration:</p> <p><img src="https://github.com/user-attachments/assets/3c27b585-0bbe-4d82-8e49-946658019cbe" alt="P2P Communication Result 1"/> <img src="https://github.com/user-attachments/assets/86d90742-8fcb-4e43-b916-63bc71685872" alt="P2P Communication Result 2"/></p> <h1 id="how-implementation-guide">How: implementation guide</h1> <h2 id="1-driver-installation">1. Driver installation</h2> <h3 id="11-remove-existing-nvidia-drivers">1.1 Remove existing NVIDIA drivers</h3> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt purge <span class="s1">'^nvidia-.*'</span>
<span class="nb">sudo </span>apt autoremove
<span class="nb">sudo </span>apt autoclean
</code></pre></div></div> <h3 id="12-reboot">1.2 Reboot</h3> <p>Restart the machine so the old stack is fully unloaded.</p> <h3 id="13-unload-the-nvidia-drm-module-text-mode">1.3 Unload the NVIDIA DRM module (text mode)</h3> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl isolate multi-user.target
modprobe <span class="nt">-r</span> nvidia-drm

<span class="c"># If the GUI does not return afterward:</span>
systemctl start graphical.target
</code></pre></div></div> <h3 id="14-install-the-modified-kernel-modules">1.4 Install the modified kernel modules</h3> <ol> <li> <p>Clone the open GPU kernel modules repository (use the repo URL, not the GitHub <code class="language-plaintext highlighter-rouge">tree</code> page):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/tinygrad/open-gpu-kernel-modules.git
<span class="nb">cd </span>open-gpu-kernel-modules
</code></pre></div> </div> </li> <li> <p>Check out the P2P branch that matches your target driver line (example branch name—confirm on the repo):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git fetch <span class="nt">--all</span>
git branch <span class="nt">-a</span>
git switch 565.57.01-p2p
</code></pre></div> </div> </li> <li> <p>Build the modules:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make modules <span class="nt">-j</span><span class="si">$(</span><span class="nb">nproc</span><span class="si">)</span>
</code></pre></div> </div> <p>If the build fails on GCC, install GCC 12 and point alternatives at it:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt update
<span class="nb">sudo </span>apt <span class="nb">install </span>gcc-12 g++-12
<span class="nb">sudo </span>update-alternatives <span class="nt">--install</span> /usr/bin/gcc gcc /usr/bin/gcc-12 120 <span class="nt">--slave</span> /usr/bin/g++ g++ /usr/bin/g++-12
</code></pre></div> </div> </li> <li> <p>Install the built modules:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>make modules_install <span class="nt">-j</span><span class="si">$(</span><span class="nb">nproc</span><span class="si">)</span>
</code></pre></div> </div> </li> <li> <p>Install the <strong>user-space</strong> driver from NVIDIA for the same version, <strong>without</strong> replacing the kernel modules you just built—for example:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Download the matching runfile from NVIDIA, e.g.:</span>
<span class="c"># https://www.nvidia.com/en-us/drivers/details/233008/</span>
sh ./NVIDIA-Linux-[...].run <span class="nt">--no-kernel-modules</span>
</code></pre></div> </div> </li> <li> <p>Reboot again.</p> </li> </ol> <h2 id="2-system-configuration">2. System configuration</h2> <h3 id="21-verify-rebar">2.1 Verify ReBAR</h3> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nvidia-smi <span class="nt">-q</span> | <span class="nb">grep</span> <span class="nt">-i</span> bar <span class="nt">-A</span> 3
</code></pre></div></div> <p>Treat <strong>Total ≥ 256 MiB</strong> (order-of-magnitude) as a sign that ReBAR-style mapping is in play; if numbers look tiny, fix <strong>BIOS</strong> options before debugging NCCL.</p> <p><img src="https://github.com/user-attachments/assets/8f951980-498a-46ab-b5cd-a7b07ed6931e" alt="ReBar Configuration"/></p> <h3 id="22-disable-iommu-in-grub-amd-example">2.2 Disable IOMMU in GRUB (AMD example)</h3> <p>Edit <code class="language-plaintext highlighter-rouge">/etc/default/grub</code> and adjust the default command line, for example:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>nano /etc/default/grub
</code></pre></div></div> <p>Set something like:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">GRUB_CMDLINE_LINUX_DEFAULT</span><span class="o">=</span><span class="s2">"quiet splash amd_iommu=off iommu=off"</span>
</code></pre></div></div> <p>Then <code class="language-plaintext highlighter-rouge">sudo update-grub</code> (Debian/Ubuntu) and reboot.</p> <p><strong>Reminder:</strong> P2P in this setup expects <strong>ReBAR on</strong> and <strong>IOMMU off</strong> for the paths described here.</p> <h2 id="3-cuda-toolkit">3. CUDA toolkit</h2> <ol> <li>Install a CUDA toolkit from <a href="https://developer.nvidia.com/cuda-downloads">NVIDIA’s CUDA downloads</a>.</li> <li> <p>Example environment for build tools:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/usr/local/cuda-12.9/bin:<span class="nv">$PATH</span>
<span class="nb">export </span><span class="nv">CUDAHOSTCXX</span><span class="o">=</span>/usr/bin/g++-12
</code></pre></div> </div> </li> </ol> <h2 id="4-p2p-tests">4. P2P tests</h2> <h3 id="41-simplep2p">4.1 SimpleP2P</h3> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/NVIDIA/cuda-samples
<span class="nb">cd </span>cuda-samples/Samples/0_Introduction/simpleP2P/
<span class="nb">mkdir </span>build <span class="o">&amp;&amp;</span> <span class="nb">cd </span>build
cmake ..
make <span class="nt">-j</span><span class="si">$(</span><span class="nb">nproc</span><span class="si">)</span>
./simpleP2P
</code></pre></div></div> <h3 id="42-bandwidth-and-latency">4.2 Bandwidth and latency</h3> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd </span>cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
<span class="nb">mkdir </span>build <span class="o">&amp;&amp;</span> <span class="nb">cd </span>build
cmake ..
make <span class="nt">-j</span><span class="si">$(</span><span class="nb">nproc</span><span class="si">)</span>
./p2pBandwidthLatencyTest
</code></pre></div></div> <h1 id="conclusion">Conclusion</h1> <p>NCCL P2P is worth the trouble when <strong>multi-GPU collectives</strong> are on your critical path and you are on <strong>consumer cards with official limitations</strong>. <strong>ReBAR</strong> is primarily about <strong>how much GPU memory the CPU can map at once</strong>; here it is a <strong>firmware prerequisite</strong> to line up with the driver workaround, alongside <strong>IOMMU settings</strong> and a <strong>matched open-kernel-module build + runfile install</strong>. Treat the whole stack as <strong>unsupported by NVIDIA for production</strong>—validate with the CUDA samples above, then run your real training job and watch NCCL logs for a clean bill of health.</p>]]></content><author><name></name></author><category term="bug-fixes"/><category term="nvidia"/><category term="driver"/><category term="p2p"/><category term="gpu"/><category term="deep learning"/><summary type="html"><![CDATA[What NCCL P2P means, why it matters on multi-GPU workstations, how Resizable BAR fits in, and a concrete setup path for RTX 4090.]]></summary></entry><entry><title type="html">IELTS - After Class Note, Week 8</title><link href="https://allanchan339.github.io/journey/2024/08/24/IELTS-week8.html" rel="alternate" type="text/html" title="IELTS - After Class Note, Week 8"/><published>2024-08-24T00:00:00+08:00</published><updated>2024-08-24T00:00:00+08:00</updated><id>https://allanchan339.github.io/journey/2024/08/24/IELTS-week8</id><content type="html" xml:base="https://allanchan339.github.io/journey/2024/08/24/IELTS-week8.html"><![CDATA[<h1 id="listening">Listening</h1> <ul> <li>Listening 遇到填充題，一定要判別詞性，係 V, Vs, N. Ns, Adj</li> </ul> <h1 id="writing">Writing</h1> <h2 id="short-task">Short Task</h2> <h3 id="diagram-流程圖">Diagram (流程圖)</h3> <ul> <li>結構同樣係 <ul> <li> <ol> <li>intro, 1-3 overiew, 2*3-4 features, 1 conclusion</li> </ol> </li> </ul> </li> <li>用現在式</li> <li>遊戲玩法係將每一個步驟用完整句子＋連接詞連接起來 <ul> <li>First of all, then, next, after that , in the next stage, subsequently, finally, at the end,</li> </ul> </li> <li>Overview: <ul> <li>寫有幾多個step, 第一步同最後一步係乜</li> </ul> </li> <li>文中如果吾知佢要做乜，可以寫 <ul> <li>They / the products undergo XXX (N.)</li> </ul> </li> </ul> <h1 id="speaking">Speaking</h1> <ul> <li>遊戲只能靠平時的底子 <ul> <li>着重於陽長避短</li> </ul> </li> <li>4 domains <ul> <li>Fluency and coherence <ul> <li>最重要</li> <li>keep talking, 而且要用完整句子</li> </ul> </li> <li>Grammatical range <ul> <li>tense 好看重，如果可以就用多些少tense</li> <li>多用連接詞，e.g. and beside, moreover instead</li> <li>多用比較，e.g. instead, but, perfer, A is better than B as</li> </ul> </li> <li>Lexial resources <ul> <li>吾好作狀，speaking 用既詞語同寫作係吾同</li> </ul> </li> <li>Pronounciation <ul> <li>accent 吾記分，所以吾洗扮口音</li> <li>但係快慢，聲調，節奏會計</li> </ul> </li> </ul> </li> <li>6 分起跳 <ul> <li>通常往上調</li> </ul> </li> <li>11-14 mins <ul> <li>首 30s 吾記分，但係用黎比印象分</li> </ul> </li> <li>聽吾切 <ul> <li>Would you please repeat the question for me (X again)?</li> </ul> </li> </ul> <h2 id="回答套路">回答套路</h2> <h3 id="kfc-approach">KFC approach</h3> <ul> <li>Key details/ reason</li> <li>Friends</li> <li>Contrast <ul> <li>X if</li> <li>opinions differ</li> </ul> </li> </ul> <h3 id="ppf-approach">PPF approach</h3> <ul> <li>Past</li> <li>Present</li> <li>Future</li> </ul> <h3 id="5w-approach">5W approach</h3> <ul> <li>What</li> <li>Why</li> <li>How</li> <li>When</li> <li>Where</li> </ul> <h2 id="part-1">Part 1</h2> <ul> <li>問習慣，住乜，讀書定工作 etc.</li> <li>要顯示出你肯答，keep talking ! <ul> <li>不然扣 lexial</li> </ul> </li> </ul> <h3 id="question-bank-for-review">Question Bank for review</h3> <p><img src="https://s2.loli.net/2023/08/24/JGrRT56LwVbuHmg.png" alt="圖 0"/><br/> <img src="https://s2.loli.net/2023/08/24/NK2V7DnOIpBcshk.png" alt="圖 1"/><br/> <img src="https://s2.loli.net/2023/08/24/JTikmcjgYFHlQOb.png" alt="圖 2"/><br/> <img src="https://s2.loli.net/2023/08/24/cvoBqjIUE9Ld8XN.png" alt="圖 3"/><br/> <img src="https://s2.loli.net/2023/08/24/gNPkIrKMaol8iLE.png" alt="圖 4"/><br/> <img src="https://s2.loli.net/2023/08/24/FAc63Piax9X8NqS.png" alt="圖 5"/></p> <h2 id="part-2-3">Part 2-3</h2> <p><img src="https://s2.loli.net/2023/08/24/kdAlrz4c2pDV1b6.png" alt="圖 6"/></p> <h1 id="錯題集">錯題集</h1> <h2 id="reading-incorrect-answer-from-lesson-7">Reading Incorrect Answer from Lesson 7</h2> <h3 id="q4">Q4</h3> <p>What do the experiments described in the fifth paragraph suggest about the paintings of Mondrian?</p> <ul> <li>A. They are more <strong>carefully</strong> put together than they appear. <ul> <li>Quote: Mondrian’s works are deceptively simple, but eye-tracking studies confirm that they are <strong>meticulously</strong> composed.</li> <li>Meticulously : 細致地</li> <li>it fit the words <strong>carefully</strong></li> </ul> </li> <li>B. They can be interpreted in a number of different ways. (mentioned, not the suggest for the paintings)</li> <li>C. They challenge our assumptions about shape and color. (X assumption)</li> <li>D. They are easier to appreciate than many other abstract works. (X compare)</li> </ul> <h3 id="q7">Q7</h3> <p>She also observes that pleasing works of art often contain repeated _____ which occur frequently in the natural world.</p> <ul> <li>Quote: What’s more, appealing pieces both abstract and representational, show signs of ‘fractals’ - <strong>repeated motifs recurring in different scales</strong>. Fractals are common throughout nature, for example in the shapes of mountain peaks or the branches of trees. It is possible that our visual system, which evolved in the great outdoors, find it easier to process <strong>such patterns</strong>.</li> <li>Patterns -&gt; repeated motifs -&gt; repeated ??</li> <li>A. layout (X)</li> <li>B. images (O)</li> <li>Require <strong>a plural noun</strong> here</li> </ul> <h3 id="q8-13-views-of-the-writer">Q8-13 Views of the writer</h3> <ul> <li>We must use <strong>Yes/No/Not Given</strong>, not True/False/Not Given <h4 id="q10">Q10</h4> <p>People’s taste in paintings depends entirely on the current artistic trends of the period.</p> </li> <li>Quote: While the fashions of the time might shape what is currently popular, works that are best adapted to our visual system may be the most likely to linger once the trends of previous generations have been forgotten. <ul> <li>Mentioned people’s taste in fashion + current artistic trends</li> <li>But not the <strong>entirely</strong>, as some trends may be the most likely to linger</li> </ul> </li> <li>Ans: <strong>No</strong></li> </ul> <h4 id="q11">Q11</h4> <p>Scientists should seek to define the precise rules which govern people’s reactions to works of art.</p> <ul> <li>Quote: It would, however, be foolish to reduce art appreciation to a set of scientific laws. <ul> <li>Mentioned scientific laws -&gt; precise rules</li> <li>Mentioned art appreciation -&gt; people’s reactions to works of act</li> </ul> </li> <li>Ans: <strong>No</strong></li> </ul> <h2 id="listening-section3">Listening Section3</h2> <h3 id="q21-22">Q21-22</h3> <p>Which TWO characteristics were shared by the subjects of Joanna’s psychology study?</p> <ul> <li>Ans: B+D</li> <li>A. They had all won prizes for their music. <ul> <li>有伏，Quote: And quite a few had won prizes and competitions as well</li> <li>有一些，唔係全部都係</li> </ul> </li> <li>B. They had all made music recordings</li> <li>C. There were all under 27 years old</li> <li>D. They had all toured internationally <ul> <li>Quote: They were all very <strong>highly regarded</strong> in the music world and they’d done quite extensive <strong>tours</strong> in different <strong>continents</strong></li> <li>Quote: They were all very <strong>highly regarded</strong> in the music world and they’d done quite extensive <strong>tours</strong> in different <strong>continents</strong></li> </ul> </li> <li>E. They all played a string instrument.</li> </ul> <h3 id="q25-26">Q25-26</h3> <p>Which TWO topics did Joanna <strong>originally</strong> intend to investigate in her research?</p> <ul> <li>係問緊一開始，唔係最後</li> <li>A. regulations concerning concert dress <ul> <li>完全冇講regulations</li> </ul> </li> <li>B. audience reactions to the dress of performers <ul> <li>Quote: When I started I was more interested in trying to investigate the impact of what was worn on those listening</li> </ul> </li> <li>C. changes in performer attitudes to concert dress</li> <li>D. how choice of dress relates to performer roles <ul> <li>伏位, quote: My research investigated the way players see their role as a musician and how this is linked to the type of clothing they decide to wear, but that focus didn’t emerge immediately</li> </ul> </li> <li>E. links between musical instrument and dress choice <ul> <li>Quote: and also whether someone like violinist might adopt a different style of clothing from someone playing the flute or the trumpet</li> </ul> </li> </ul> <h3 id="q28">Q28</h3> <p>Mike Frost’s article suggests that in popular music, women’s dress s affected by</p> <ul> <li>A. their wish to be taken seriously (O)</li> <li>B. their tendency to copy each other</li> <li>C. their reaction to the masculine nature of the music</li> <li>Quote: He points out that a lot of female signers and musicians in popular music tend to dress down 穿著低調 in performances, and wear less feminine clothes, and he suggests this is because otherwise they’d just be discounted as trivial <ul> <li>害怕被人輕視，打折扣</li> </ul> </li> </ul> <h2 id="listening-section4">Listening Section4</h2> <h3 id="q34">Q34</h3> <p>some Co2 moves from the <em>__</em> of plants to microbes in the soil</p> <ul> <li>Ans = roots (plural)</li> <li>係植物的根部，吾係roof</li> </ul> <h3 id="q35">Q35</h3> <p>uses established practices to make sure soil remains fertile and <em>__</em></p> <ul> <li>Ans = moist / wet, 串錯字係冇分</li> </ul> <h3 id="q37">Q37</h3> <p>talking place on a big _____ farm</p> <ul> <li>Ans : cattle 牛仔</li> </ul> <h3 id="q38">Q38</h3> <p>uses compost made from waste from agriculture and _____</p> <ul> <li>Ans: gardens</li> <li>要有複數，唔係 garden</li> </ul> <h3 id="q39">Q39</h3> <p>aims to increase soil carbon by using _____ that <strong>are</strong> always green</p> <ul> <li>要復數</li> <li>Ans: grasses</li> </ul>]]></content><author><name></name></author><category term="journey"/><category term="ielts"/><category term="english"/><category term="reading"/><category term="usage"/><summary type="html"><![CDATA[Final-week listening, writing, speaking, and reading reminders for IELTS.]]></summary></entry><entry><title type="html">IELTS - After Class Note, Week 7</title><link href="https://allanchan339.github.io/journey/2024/08/20/IELTS-week7.html" rel="alternate" type="text/html" title="IELTS - After Class Note, Week 7"/><published>2024-08-20T00:00:00+08:00</published><updated>2024-08-20T00:00:00+08:00</updated><id>https://allanchan339.github.io/journey/2024/08/20/IELTS-week7</id><content type="html" xml:base="https://allanchan339.github.io/journey/2024/08/20/IELTS-week7.html"><![CDATA[<h1 id="listening">Listening</h1> <h2 id="新題型-matching-題">新題型: matching 題</h2> <ul> <li>優先讀曬所有option</li> <li>然後題目會順序落，留意番 Noun time No. Adj. <ul> <li>\(\because\) 7 選 1 到時會跟吾切</li> <li>相反，MC題多數 3 選 1</li> </ul> </li> <li>（用消化既處理辦法對 map 同 matching 題幾好用，今次一條冇錯） <ul> <li>睇清楚佢係入門走向，然後跟住description 行通常冇錯</li> </ul> </li> </ul> <h1 id="writing">Writing</h1> <h2 id="short-task">Short Task</h2> <ul> <li>推薦 20 mins 完成</li> </ul> <h2 id="structure">Structure</h2> <ol> <li>Introduction (1 sentence)</li> <li>Overview (1-3 sentences) <ol> <li>Key information only</li> <li>Overall trend (increase/ decrease)</li> <li>Comparison (Beginning vs Latest data only)</li> <li>Polar Data (極大值/極細值) (最大既貢獻vs 最小既貢獻 - pie chart)</li> <li>Fluctuation <ul> <li>吾好講數字！</li> </ul> </li> </ol> </li> <li>Features * 2 <ol> <li>Group data by segment (每個國家講一次)</li> <li>Group data by trend (上升一組，下降＋不變一組)</li> </ol> </li> <li><strong>吾好有結論</strong>！！！</li> </ol> <h2 id="question-type">Question Type</h2> <ul> <li>Line graph, bar chart, pie chart (80%)</li> <li>Flow chart (20%)</li> </ul> <h2 id="answer-flow">Answer Flow</h2> <ol> <li>將題目少少rephr 作為 intro</li> <li>睇圖既時候留意有乜內容可以做 overview</li> <li>諗起features 係要點分段，by segment or by trend?</li> <li>全部行<strong>過去式</strong>！！！！！ <h2 id="罐頭句系列">罐頭句系列</h2> <ul> <li>to experience a sharp increase / decrease in XXX</li> <li>A sharp rise / fall in XXX can be witnessed (observed)</li> <li>There was a surge/ reduction in XXX</li> <li>There was an upward / a downward trend in</li> <li>The number soared / rocketed to (a record high / low of ) XXX</li> <li>The figures peaked at XXX in XXX</li> <li>A significant increase / decrease occurred from A to B // between A and B.</li> <li>The number hit a low-point of XXX</li> <li>XXX accounted for // constituted / contributed to ? % of XXX</li> </ul> </li> </ol>]]></content><author><name></name></author><category term="journey"/><category term="ielts"/><category term="english"/><category term="reading"/><category term="usage"/><summary type="html"><![CDATA[Listening, writing task structure, and vocabulary from IELTS week 7.]]></summary></entry></feed>