Qwen 3.6 35B-A3B on vLLM: do the Qwen 3.5 tool-calling fixes carry over?
This note follows the earlier Qwen 3.5 / vLLM discussion. After weeks spent stabilizing Qwen 3.5-27B/35B for agentic use—the same qwen3_xml parser, qwen3.5-enhanced.jinja template, and GPU-side tuning—readers kept asking whether Qwen 3.6 behaved the same.
Short answer: the same configuration remains the most stable option on Qwen3.6-35B-A3B-FP8, but compared with Qwen3.5-27B the newer checkpoint is more prone to reasoning loops and to malformed tool calls that interrupt long agent runs. This post documents what still works, what I measured in three controlled runs, a partial mitigation on the client side, and why Qwen3.5-27B-FP8 is still my default for reliability-first agents.
What still carries over from the 3.5 setup
qwen3_xml tool-call parser
The registry-backed parser continues to handle complex tool arguments without the corruption seen with regex-oriented paths. Official documentation still recommends qwen3_coder; for this workload the recommendation remains not to use it for demanding agentic traces.
qwen3.5-enhanced.jinja chat template
The interleaved thinking template still applies to 3.6 35B-A3B: correct </thinking> boundaries and clean tool-call framing relative to the stock template.
Mixed-GPU precision alignment
RTX 4090 (SM89) prefers W8A8 paths; RTX 3090 (SM80) falls back to W8A16. VLLM_TEST_FORCE_FP8_MARLIN=1 still forces both ranks onto a matched effective precision. Without it, long conversations drift—the same failure mode as on 3.5.
NCCL tuning
Unchanged for this mixed consumer topology:
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
Agentic test protocol
Each trial used the same prompt shape: grant full ownership of the working folder, ask the model to build a full-stack project (frontend and backend), and cap planning around a ~10k-token budget for the initial instruction. The goal was to observe how long the stack survives before tool-calling or format errors end the session.
Three runs (same hardware, different vLLM template/parser pairs)
Run 1: enhanced.jinja + qwen3_xml (best observed config; file on disk: qwen3.5-enhanced.jinja)
The model chose to build oss-inspect—an autonomous codebase quality analysis tool (project name as generated by the model).
| Prompt / phase | Accumulated tokens |
|---|---|
| Project setup | 13.9k |
| “Did you check if this is bug free? This is your own project.” | 135.1k |
| DCP sweep auto-triggered | 107.0k |
| “Fix it then” | 110.0k |
| Session ended — improper tool calling | 111.1k |
This configuration lasted the longest (~13m 20s wall time to failure). It reached ~130k+ tokens in the productive phase before improper tool calling terminated the run. After a DCP sweep at 135k (tokens reported dropped to 107k in the log), the session continued until the final failure at 111.1k on the last line—improper tool calling again.
Baseline comparison: Qwen3.5-27B with the same enhanced.jinja + qwen3_xml pairing routinely exceeds 130k tokens without that class of interruption in my runs.
Run 2: official.jinja + qwen3_coder
The model proposed a knowledge-graph platform oriented toward Graphify-style skill ingestion—the ingestion behavior was aggressive relative to expectations.
Time to failure: 6m 32s — improper tool calling. Too early to trust for long-horizon agentic work.
Run 3: official.jinja + qwen3_xml
The model proposed TaskFlow—a Kanban app with authentication, drag-and-drop tasks, and a polished UI.
Time to failure: 1m 16s — malformed tool calls emitted inside the thinking block. Again unsuitable for reliable agents.
Note on generated stacks
For the concrete frameworks and libraries the model selected in these runs, I did not have prior familiarity; observations are about tool protocol stability, not stack-specific code review.
Cross-run comparison
| Configuration | Survival (approx.) | Failure mode |
|---|---|---|
enhanced.jinja + qwen3_xml |
~111k tokens (~13m 20s) | Improper tool calling (session died) |
official.jinja + qwen3_coder |
6m 32s | Improper tool calling |
official.jinja + qwen3_xml |
~1m 16s | Malformed tool calls inside thinking box |
Summary: even the best 3.6 configuration fails more often than Qwen3.5-27B under the same harness. Qwen3.5-27B remains more stable for agentic use in these tests, despite slower TTFT on 3.5.
Behaviors that look specific to Qwen3.6-35B-A3B
1. More frequent reasoning loops
The model revisits the same analysis step repeatedly, burning tokens before advancing. This reads as a checkpoint behavior change, not a template bug: Qwen3.5-27B showed the pattern occasionally; on 3.6 35B-A3B it is common enough to hurt long sessions.
2. Malformed tool calls despite a “correct” wire format
With enhanced.jinja + qwen3_xml—the pair that works cleanly on 3.5-27B—3.6 35B-A3B still produces malformed tool calls at higher frequency. The XML shape can remain technically valid when it succeeds; the problem is how often failures occur and that one bad turn can abort a run with no recovery.
On 3.5-27B, after template fixes, bad tool turns are a rare edge case. On 3.6 35B-A3B, they are regular enough that any long agentic session eventually hits them, independent of which of the tested template/parser combinations is selected.
Partial mitigation: OpenCode 1.4.18
OpenCode 1.4.18 reduced client-side tool friction. Older OpenCode versions had tool-calling bugs that amplified failures—especially around the question tool. Upgrading to 1.4.18 addressed that class of malformed tool call interaction.
Limitation: the client upgrade does not remove reasoning loops or the model-level higher baseline failure rate on 3.6. The remaining issues sit primarily in the model (and possibly thinking-state handling—e.g. preserved thinking—but that is hypothesis, not a confirmed root cause).
Reference environment and vllm serve command
Software pinned for the run:
- vLLM: 0.19.1
- Transformers: 5.5.4
- CUDA: 12.8.1 (nvcc 12.8.93)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export OMP_NUM_THREADS=4
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1
rm -rf ~/.cache/flashinfer
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name Qwen3.6-35B-A3B \
--chat-template qwen3.5-enhanced.jinja \
--attention-backend FLASHINFER \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 200000 \
--gpu-memory-utilization 0.91 \
--enable-auto-tool-choice \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-batched-tokens 12288 \
--max-num-seqs 4 \
--kv-cache-dtype fp8 \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--no-use-tqdm-on-load \
--host 0.0.0.0 \
--port 8000 \
--language-model-only
Conclusions
enhanced.jinja+qwen3_xml+ OpenCode 1.4.18 is still the strongest combination tested on Qwen3.6-35B-A3B, but it does not match Qwen3.5-27B on looping or long-run tool reliability.- It is not obvious why tool regressions reappear on 3.6 35B-A3B when many 3.5 fixes carry over; preserved thinking is one speculative lever worth tracking if Qwen ships flash or template updates aimed at agents.
- Qwen3.5-27B, Qwen3.5-35B-A3B, and Qwen3.6-35B-A3B share the same official chat template in distribution—if Qwen3.6 Flash (or similar) launches with different templating, that may indicate intentional handling of tool/thinking edge cases.
Operational choice: I default to Qwen3.5-27B-FP8 for agentic obedience—instruction following, clean tool execution, and low loop rate. Qwen3.6-35B-A3B offers much faster TTFT and similar headline capability to Qwen3.5-27B on AA-style benchmarks, but in these runs it trades that for loops and tool failures that terminate long sessions. For agent work, I prioritize reliability over raw benchmark scores.