This note follows the earlier Qwen 3.5 / vLLM discussion. After weeks spent stabilizing Qwen 3.5-27B/35B for agentic use—the same qwen3_xml parser, qwen3.5-enhanced.jinja template, and GPU-side tuning—readers kept asking whether Qwen 3.6 behaved the same.

Short answer: the same configuration remains the most stable option on Qwen3.6-35B-A3B-FP8, but compared with Qwen3.5-27B the newer checkpoint is more prone to reasoning loops and to malformed tool calls that interrupt long agent runs. This post documents what still works, what I measured in three controlled runs, a partial mitigation on the client side, and why Qwen3.5-27B-FP8 is still my default for reliability-first agents.

What still carries over from the 3.5 setup

qwen3_xml tool-call parser

The registry-backed parser continues to handle complex tool arguments without the corruption seen with regex-oriented paths. Official documentation still recommends qwen3_coder; for this workload the recommendation remains not to use it for demanding agentic traces.

qwen3.5-enhanced.jinja chat template

The interleaved thinking template still applies to 3.6 35B-A3B: correct </thinking> boundaries and clean tool-call framing relative to the stock template.

Mixed-GPU precision alignment

RTX 4090 (SM89) prefers W8A8 paths; RTX 3090 (SM80) falls back to W8A16. VLLM_TEST_FORCE_FP8_MARLIN=1 still forces both ranks onto a matched effective precision. Without it, long conversations drift—the same failure mode as on 3.5.

NCCL tuning

Unchanged for this mixed consumer topology:

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring

Agentic test protocol

Each trial used the same prompt shape: grant full ownership of the working folder, ask the model to build a full-stack project (frontend and backend), and cap planning around a ~10k-token budget for the initial instruction. The goal was to observe how long the stack survives before tool-calling or format errors end the session.

Three runs (same hardware, different vLLM template/parser pairs)

Run 1: enhanced.jinja + qwen3_xml (best observed config; file on disk: qwen3.5-enhanced.jinja)

The model chose to build oss-inspect—an autonomous codebase quality analysis tool (project name as generated by the model).

Prompt / phase Accumulated tokens
Project setup 13.9k
“Did you check if this is bug free? This is your own project.” 135.1k
DCP sweep auto-triggered 107.0k
“Fix it then” 110.0k
Session ended — improper tool calling 111.1k

This configuration lasted the longest (~13m 20s wall time to failure). It reached ~130k+ tokens in the productive phase before improper tool calling terminated the run. After a DCP sweep at 135k (tokens reported dropped to 107k in the log), the session continued until the final failure at 111.1k on the last line—improper tool calling again.

Baseline comparison: Qwen3.5-27B with the same enhanced.jinja + qwen3_xml pairing routinely exceeds 130k tokens without that class of interruption in my runs.

Run 2: official.jinja + qwen3_coder

The model proposed a knowledge-graph platform oriented toward Graphify-style skill ingestion—the ingestion behavior was aggressive relative to expectations.

Time to failure: 6m 32simproper tool calling. Too early to trust for long-horizon agentic work.

Run 3: official.jinja + qwen3_xml

The model proposed TaskFlow—a Kanban app with authentication, drag-and-drop tasks, and a polished UI.

Time to failure: 1m 16smalformed tool calls emitted inside the thinking block. Again unsuitable for reliable agents.

Note on generated stacks

For the concrete frameworks and libraries the model selected in these runs, I did not have prior familiarity; observations are about tool protocol stability, not stack-specific code review.

Cross-run comparison

Configuration Survival (approx.) Failure mode
enhanced.jinja + qwen3_xml ~111k tokens (~13m 20s) Improper tool calling (session died)
official.jinja + qwen3_coder 6m 32s Improper tool calling
official.jinja + qwen3_xml ~1m 16s Malformed tool calls inside thinking box

Summary: even the best 3.6 configuration fails more often than Qwen3.5-27B under the same harness. Qwen3.5-27B remains more stable for agentic use in these tests, despite slower TTFT on 3.5.

Behaviors that look specific to Qwen3.6-35B-A3B

1. More frequent reasoning loops

The model revisits the same analysis step repeatedly, burning tokens before advancing. This reads as a checkpoint behavior change, not a template bug: Qwen3.5-27B showed the pattern occasionally; on 3.6 35B-A3B it is common enough to hurt long sessions.

2. Malformed tool calls despite a “correct” wire format

With enhanced.jinja + qwen3_xml—the pair that works cleanly on 3.5-27B3.6 35B-A3B still produces malformed tool calls at higher frequency. The XML shape can remain technically valid when it succeeds; the problem is how often failures occur and that one bad turn can abort a run with no recovery.

On 3.5-27B, after template fixes, bad tool turns are a rare edge case. On 3.6 35B-A3B, they are regular enough that any long agentic session eventually hits them, independent of which of the tested template/parser combinations is selected.

Partial mitigation: OpenCode 1.4.18

OpenCode 1.4.18 reduced client-side tool friction. Older OpenCode versions had tool-calling bugs that amplified failures—especially around the question tool. Upgrading to 1.4.18 addressed that class of malformed tool call interaction.

Limitation: the client upgrade does not remove reasoning loops or the model-level higher baseline failure rate on 3.6. The remaining issues sit primarily in the model (and possibly thinking-state handling—e.g. preserved thinking—but that is hypothesis, not a confirmed root cause).

Reference environment and vllm serve command

Software pinned for the run:

  • vLLM: 0.19.1
  • Transformers: 5.5.4
  • CUDA: 12.8.1 (nvcc 12.8.93)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export OMP_NUM_THREADS=4
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

rm -rf ~/.cache/flashinfer

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name Qwen3.6-35B-A3B \
  --chat-template qwen3.5-enhanced.jinja \
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only

Conclusions

  • enhanced.jinja + qwen3_xml + OpenCode 1.4.18 is still the strongest combination tested on Qwen3.6-35B-A3B, but it does not match Qwen3.5-27B on looping or long-run tool reliability.
  • It is not obvious why tool regressions reappear on 3.6 35B-A3B when many 3.5 fixes carry over; preserved thinking is one speculative lever worth tracking if Qwen ships flash or template updates aimed at agents.
  • Qwen3.5-27B, Qwen3.5-35B-A3B, and Qwen3.6-35B-A3B share the same official chat template in distribution—if Qwen3.6 Flash (or similar) launches with different templating, that may indicate intentional handling of tool/thinking edge cases.

Operational choice: I default to Qwen3.5-27B-FP8 for agentic obedienceinstruction following, clean tool execution, and low loop rate. Qwen3.6-35B-A3B offers much faster TTFT and similar headline capability to Qwen3.5-27B on AA-style benchmarks, but in these runs it trades that for loops and tool failures that terminate long sessions. For agent work, I prioritize reliability over raw benchmark scores.