Inference Radar

TL;DR

KV cache became the battleground: vLLM, SGLang, TensorRT-LLM, ai-dynamo, oMLX, and Cactus all pushed on cache routing, offload, compression, or speculative execution because long-context inference is now a memory-systems problem.1
Gemma 4 drove ecosystem-wide fixes: llama.cpp, Ollama, MLX, oMLX, LocalAI, LiteRT, and multiple Apple-side runtimes worked through multimodal, QAT, MTP, and tool-calling edge cases.2
Edge inference got less toy-like: ExecuTorch, LiteRT, MNN, sherpa-onnx, FluidAudio, Cactus, and Qualcomm AI Hub all advanced low-bit, NPU, ANE, QNN, Cortex-M, or Android deployment paths.3
Compiler work chased the new low-precision frontier: Triton, TileLang, OpenXLA, ONNX Runtime, TVM, and OpenVINO all moved on FP8, FP4, graph capture, WebGPU, ROCm, or backend lowering.4
Agent gateways hardened into infrastructure: LiteLLM, LocalAI, Ray Serve, ai-dynamo, and Open WebUI spent the week on tool-calling, Responses-style APIs, routing, auth, observability, and production failure modes.5

This Week in Inference

The dominant model story was not one single model drop; it was the downstream blast radius of models designed for local, multimodal, and long-context use. Google’s Gemma 4 continued to propagate through local and edge runtimes, with Ollama adding Gemma QAT tags and llama.cpp fixing multimodal conversion/runtime paths for text, image, video, and audio-oriented flows (Ollama, llama.cpp).6 Hugging Face landed DiffusionGemma in Transformers, DeepSeek V3.2 experimental support, and Cosmos3 integrations across Transformers and Diffusers, while NVIDIA’s stack kept moving Nemotron and Cosmos coverage through TensorRT-LLM and ai-dynamo previews (Transformers, TensorRT-LLM, ai-dynamo).7 Outside the repo stream, market coverage also flagged MiniMax M3 as a native-multimodal, long-context model, reinforcing why inference memory hierarchy is now the limiting layer (coverage).10

The technical frontier was KV cache and low precision. KVarN-style two-bit KV-cache quantization showed up as a live topic in serving-engine communities rather than just as a paper, with vLLM and SGLang users explicitly discussing KV-cache quantization for long-context scaling (vLLM, SGLang).11 In parallel, NVFP4, FP8, MXFP4, and FP4 execution paths kept spreading across TensorRT-LLM, SGLang, ROCm AITER/ATOM, TileLang, Triton, and vLLM, which makes “one quantization format” look increasingly unrealistic for serious fleets (TensorRT-LLM, SGLang, AITER, TileLang).15

The infrastructure news was that edge and browser inference are now being treated as first-class deployment targets rather than downstream ports. Meta’s ExecuTorch pushed mobile, Apple, Qualcomm QNN, Arm, WebGPU, Vulkan, CUDA/AOTI, MLX, Cortex-M, and NXP work in the same week, while Google’s LiteRT/LiteRT-LM stack moved on QNN, OpenVINO, Metal, browser WASM, WebGPU demos, and Android/Python serving ergonomics (ExecuTorch, LiteRT, LiteRT-LM).3 That is the real industry move: edge inference is being institutionalized inside the same open-source governance and CI systems that already serve cloud inference workloads (ExecuTorch).3

Deeper Dive

Everything below is for readers who want the full picture. Feel free to scroll.

Code Changes by Category

Cloud & Datacenter Serving

vLLM’s week centered on KV/offload, speculative decoding, MoE refactors, ROCm/XPU/Gaudi support, and model/API compatibility, with vllm-gaudi tracking upstream churn around offloading, serving imports, environment handling, and the MoE runner split (vLLM, vllm-gaudi).1 SGLang had the highest serving-engine velocity: n-gram speculative decoding, CUDA graph phase runners, unified-KV attention on ROCm, DeepSeek/MoE serving, diffusion serving, NPU/XPU work, HiCache fixes, and GB300/Arm64 CI all landed in one dense slice (SGLang).19 TensorRT-LLM focused on AutoDeploy, disaggregated KV serving, Eagle/MTP speculative decoding, VisualGen, NVFP4 kernels, DeepGemm MoE paths, and hardware matrix validation across B200/GB200/GB300/H100/A100-class systems (TensorRT-LLM).27

Ray focused on production reliability around Serve LLM, Ray Data, autoscaling, object-store pressure, shutdown races, dashboard metrics, and CUDA image publishing (Ray).28 ai-dynamo spent the week on KV router correctness, streaming/frontend hot paths, parser/tool-calling parity, XPU support, runtime-free routing, autoscaling signals, and experimental backend/model previews (ai-dynamo).25 Triton Inference Server had a smaller but useful week, fixing OpenAI frontend compatibility with newer Hugging Face tokenizers, normalizing container install permissions, and hardening HTTP/L0 tests (Triton Inference Server).29

Local LLM Runtimes

llama.cpp expanded Gemma 4 MTP, assistant-model support, video input, Granite vision, Qwen-style frame merging, parser fixes, and backend acceleration across Vulkan, WebGPU, CUDA, HIP, SYCL, OpenCL, Metal, WASM, and RVV (llama.cpp).2 Ollama stabilized its local distribution around Gemma 4 crashes, MLX KV/speculation internals, Hermes Desktop and Oh My Pi launch flows, OpenAI model-list compatibility, and QAT model tags (Ollama).20 LocalAI shipped distributed-mode fixes, realtime and RAG improvements, video input to llama.cpp backends, Nemotron ASR, Ideogram support, and a broad backend refresh across llama.cpp, whisper.cpp, stable-diffusion.cpp, parakeet.cpp, qwen3-tts.cpp, and more (LocalAI).24

llamafile had a targeted Windows reliability fix that moves GPU probing out of process so failed Vulkan/CUDA probes cannot corrupt the runtime before CPU fallback (llamafile).30 WebLLM aligned OPFS cache behavior with TVM, fixed OpenAI-compatible timestamp units, repaired cache docs, and opened example-app HTML-sink hardening work (WebLLM).31 GPT4All and FastChat were code-quiet but saw community PRs around desktop UX, remote-provider headers, benchmarks, T5 context budgeting, and stale bug fixes (GPT4All, FastChat).32

Apple Silicon & MLX Ecosystem

Apple’s MLX core focused on correctness: thread-local tracing, compile-cache safety, overflow fixes, large-offset GEMV/gather-MM support, complex VJPs, and NN helper fixes (MLX).34 mlx-lm added Mellum, Qwen pipelining, Gemma unified loading, LFM2 MoE routing, Granite loading fixes, server/tokenizer fixes, and a chat/LoRA UI, while mlx-swift-lm added audio plumbing and VLM/Gemma correctness fixes (mlx-lm, mlx-swift-lm).35 Blaizzy’s MLX ecosystem was especially active: mlx-audio and mlx-audio-swift added Nemotron ASR ports and cache-aware streaming, while mlx-vlm shipped Gemma/Qwen/APC/server fixes and broader multimodal support (mlx-audio, mlx-vlm).37

oMLX pushed Memory Guard, macOS memory telemetry, SSD KV-cache safety, Gemma 4 Unified multimodal support, DFlash fixes, native embeddings/reranking, and OpenAI/Anthropic API polish (oMLX).21 vllm-mlx hardened Apple-side serving with fail-fast admission, SSD-cache spill snapshots, SpecPrefill knobs, and continuous-batching crash fixes (vllm-mlx).39 Osaurus shipped a rapid macOS agent release train around Gemma4/vMLX, MiMo/N2/LFM/Nemotron paths, Skill import, App Intents, diagnostics, i18n, and crash repair (Osaurus).40

Mobile & Edge Frameworks

ExecuTorch was the week’s broadest edge-runtime story, with Apple image processing, QNN quantization paths, Arm/VGF/Ethos-U/Cortex-M/NXP work, WebGPU/Vulkan updates, CUDA/AOTI delegate composability, and MLX/GGUF kernels (ExecuTorch).3 Google’s LiteRT stack advanced QNN, OpenVINO, Metal, INT2, external weights, large model loading, browser WASM, WebGPU demos, Gemma export, AICore diagnostics, and AI Edge samples for Samsung NPU paths (LiteRT, LiteRT-LM, AI Edge Gallery).17 Alibaba MNN added 2-bit/3-bit GPU weight execution across OpenCL, Metal, and Vulkan, RVV norm/GELU kernels, and GEMM/GEMV profiling coverage (MNN).42

sherpa-onnx expanded Qualcomm QNN execution for offline and streaming Zipformer, added X-ASR export/publishing automation, and landed Arabic CATT diacritization for TTS quality (sherpa-onnx).43 FluidAudio and mobius pushed CoreML/ANE speech inference with Nemotron ASR conversion/runtime work, multilingual streaming support, Supertonic/PocketTTS routing, and ANE optimization records (FluidAudio, mobius).44 Cactus advanced long-context mobile inference with KeyDiff KV compaction, chunked prefill, Gemma 4 multimodal handling, and Needle v2 engine/transpiler work (Cactus).46

Compilers, Runtimes & Graph Engines

Triton’s week was dominated by AMD stabilization, Blackwell TMEM/MMA work, Gluon scheduling, warp-specialization lowering, frontend/JIT correctness, and LLVM artifact cleanup (Triton).4 OpenXLA advanced GPU tiling/fusion, VMM allocation, NCCL cleanup, PDL launch support, ROCm infrastructure, SPMD sharding correctness, and MLIR/CPU math cleanup (OpenXLA).23 TVM and tvm-ffi coordinated release-candidate work around FFI integration, wheel publishing, TIR/Relax fixes, Web/WASM OPFS caching, and serialization hardening (TVM, tvm-ffi).47

ONNX and ONNX Runtime worked from spec to runtime: Attention semantics, ONNX release prep, fuzzing, packed sub-byte validation, CUDA Attention, WebGPU buffer pools, QMoE prepacking, RISC-V MLAS kernels, CoreML builders, and security hardening all moved together (ONNX, ONNX Runtime).49 OpenVINO pushed GPU, RV64 snippets, NPU/NPUW, GenAI tokenizer/VLM benchmarking, and NNCF typing/dependency cleanup (OpenVINO, OpenVINO GenAI).51 TileLang released a major compiler/runtime update around CuTeDSL, SM100, FP8/FP4, JIT caching, and backend modularization (TileLang).22

Models, Quantization & Optimization

Hugging Face added DiffusionGemma, DeepSeek V3.2 experimental support, Cosmos3 support, AutoRound in Diffusers, Triton FP8/FP4 support in Transformers, and low-level Candle fixes for Metal, GGUF loading, and CPU execution (Transformers, Diffusers, Candle).53 ROCm AITER/ATOM focused on gfx1250/gfx950, MI400/MI455-facing kernels, FP4/FP8 MoE/GEMM, MLA, KV cache, DeepSeek V4, vLLM/SGLang plugins, and paired release infrastructure (AITER, ATOM).56 ktransformers added end-to-end Qwen3.5 MoE KT LoRA serving through SGLang and fixed expert-loader correctness for CPU/GPU hybrid MoE deployment (ktransformers).58

Other Notable Changes

LiteLLM had a very high-volume week across provider expansion, Responses routing, MCP gateway features, user-scoped MCP env vars, encrypted storage, guardrails, accounting, dashboard API generation, and deployment cleanup (LiteLLM).5 Qualcomm’s AI Hub stack improved QDC scorecards, Windows setup, model catalogs, LLM evaluation, device matrices, app CLI setup, and Snapdragon workflow reproducibility (ai-hub-models, ai-hub-apps, nexa-sdk).59 RunanywhereAI merged a broad V2 SDK architecture across C++ core, generated IDL bindings, plugin ABI, routing, telemetry, and five frontend SDKs (runanywhere-sdks).62

Community Pulse

The most active communities were the ones absorbing model churn fastest: llama.cpp saw heavy Gemma 4, parser, server, and backend discussion; vLLM saw intense issue traffic around Gemma 4, DeepSeek, Qwen, ROCm, CUDA packaging, tokenizer behavior, and KV quantization; SGLang saw active debate around speculative decoding, DeepSeek/MoE, ROCm, diffusion, and KVarN; and Ollama saw large regression/support volume around Gemma 4, GPU selection, Windows packaging, and MLX behavior (llama.cpp, vLLM, SGLang, Ollama).1

Open WebUI was code-quiet but support-heavy, with regressions around search, Ollama indicators, MCP OAuth scopes, tool-call proxying, and a folder-delete permission bypass fix in flight (Open WebUI).63 JAX maintainers explicitly noted pressure from AI-generated bug reports and special-function corner cases, which is a signal that maintainer bandwidth is now a platform-level constraint (JAX).64 MLX, LiteLLM, oMLX, Osaurus, and LocalAI all showed strong community traction because they sit at the intersection of local runtime, agent UX, and practical deployment pain (MLX, LiteLLM, oMLX, Osaurus, LocalAI).5

Community Debates

Ollama’s Intel GPU question remains unresolved. A long-running SYCL-backed Intel iGPU/Arc proposal closed without merge after strong community interest and debate over SYCL versus Vulkan as the right long-term abstraction (proposal).65 The closure leaves Intel GPU support as one of the most visible gaps in the local inference distribution layer.

SGLang pushed back on abstraction for abstraction’s sake. A PassManager/Fusion proposal closed after concerns about complexity, debuggability, and overhead, even though SGLang is aggressively refactoring its execution paths elsewhere (proposal).66 The direction is not “no architecture”; it is “prove the abstraction pays rent in hot serving paths.”

Cactus split ARM SME2 acceleration into smaller pieces. A broad SME2 acceleration proposal with strong performance claims was closed and decomposed into file-borne packed-panel CQ4 weights and SME2 streaming kernels (original, successor).67 The technical debate centered on layout, runtime weight formats, power/thread tradeoffs, and energy per token.

Uzu rejected weakly justified OpenAI compatibility surface. A proposal for user stop sequences closed after maintainers argued grammar-based control already covers the maintainable structured-output cases and that each OpenAI-compatible feature carries long-term support cost (proposal).69 The same project narrowed response_format support away from unenforceable JSON schema behavior until runtime enforcement is stronger (proposal).70

MLX maintainers kept drawing API boundaries. Several MLX-side PRs closed because maintainers preferred explicit APIs, long-term Metal direction, or compile/fusion-based approaches over quick patches for metallib discovery, bf16 random memory behavior, fp8/bfp8, memory tracking, and NaN semantics (MLX).71 This is healthy friction: Apple’s local stack is becoming production infrastructure, and API shape now matters as much as feature velocity.

Worth Watching

KV-cache quantization is the trend to track next week, with KVarN appearing simultaneously in vLLM and SGLang community threads and with multiple projects already building offload, compression, and disk-persistence paths (vLLM, SGLang, exo).72 DiffusionGemma support is likely to ripple through serving stacks after Hugging Face landed it and vLLM opened support work (Transformers, vLLM).7 Security around model files and web-exposed examples is also rising: WebLLM had example HTML-sink hardening, oobabooga saw SSRF/upload-validation proposals, and SGLang/GGUF RCE coverage remains part of the market backdrop (WebLLM, text-generation-webui, SGLang).75

Major Releases

Version numbers and release-note links live here as the canonical reference.

ai-dynamo shipped five experimental v1.3.0 preview variants, focused on parser/tool-calling parity and model/backend previews for DeepSeek-V4-Pro on TensorRT-LLM, Nemotron-3 Super/Ultra on vLLM, and Kimi K2.6 on vLLM. The main release signal was not production readiness but early validation of runtime-free routing and new model-serving targets..76

Apache shipped tvm v0.25.0.rc0 and tvm-ffi v0.1.12, a coordinated release train centered on FFI integration, renamed package/wheel hardening, TIR/Relax fixes, Web/WASM OPFS caching, and serialization safety. The most important change is TVM’s deeper move onto the newer FFI substrate. TVM release notes.77

BerriAI shipped an unusually dense LiteLLM release train from v1.84.5 through v1.89.0-rc.2, with the available notes emphasizing Docker image signature verification while the code stream focused on provider expansion, Responses coverage, MCP gateway work, enterprise proxy hardening, guardrails, accounting, and dashboard cleanup. The most visible release artifact was the signed-image verification guidance. Latest release.78

Blaizzy shipped mlx-audio v0.4.4 plus mlx-vlm v0.6.2 and v0.6.3, spanning Qwen3-TTS cache work, Higgs Audio streaming, APC multimodal splitting, Qwen quantized KV-cache prompt state, Gemma 4 unified input fixes, Ideogram 4, and Nemotron-H Nano Omni stability. The cross-release theme was MLX multimodal reliability under real inputs. mlx-vlm release notes.79

FluidInference shipped FluidAudio v0.15.0, v0.15.1, and v0.15.2, moving from SenseVoice/Paraformer/offline enforcement into Parakeet GPU placement, Nemotron English streaming, Supertonic-3 ANE-bucketed variants, and PocketTTS fused flow-decoder work. The release arc reflects a serious CoreML/ANE speech-inference optimization program. Latest release.80

ggml ran a very high-cadence llama.cpp release train, with 65 release events across Gemma 4 MTP and assistant support, MTMD video input, LFM parser fixes, backend rolling builds, and a LibreSSL vendor update; whisper.cpp had no release despite a large backend sync. The week’s most important llama.cpp releases were the Gemma 4 and multimodal/runtime fixes. Latest llama.cpp release.81

Hugging Face shipped Transformers v5.11.0 and v5.10.2; the headline release added DiffusionGemma, while the patch release fixed a CLIP conversion issue affecting models such as SAM3. The broader code stream also added DeepSeek V3.2 experimental support, Cosmos3, Triton FP8/FP4 support, and quantization/runtime fixes. Transformers release notes.7

jundot shipped seven oMLX releases from v0.4.2.dev1 through v0.4.4.dev1, focused on Memory Guard, macOS 27 compatibility, SSD KV-cache reliability, DFlash fixes, Gemma 4 Unified multimodal support, native embedding/reranker serving, and audio-model support. The most impactful change was memory-pressure recovery for long-context Apple Silicon serving. Latest release.82

Liquid4All shipped liquid-audio v1.3.0, adding Japanese model support, adjustable interleaved-token ratio in data processing, removal of torchcodec, and Japanese instructions. The release paired with broader cookbook and finetuning work, but the shipped artifact was squarely about productizing Japanese audio support..83

LocalAI shipped v4.4.0, framed around multimodal and distributed deployments, with new audio backends, native object detection/segmentation, distributed serving fixes, realtime/RAG improvements, video input, Nemotron ASR, and extensive backend refreshes. This was one of the week’s most complete local-serving releases..84

NVIDIA shipped TensorRT-LLM v1.3.0rc18, a prerelease covering Nemotron-H NVFP4 on Hopper, Qwen image support, Step-3.7-Flash, Cosmos3 variants, AFMoE Trinity, logprobs_simple_format, and known DeepSeek/CuteDSL MoE perf issues on GB-class systems. The release sits on top of a week heavy in AutoDeploy, speculative decoding, VisualGen, KV, and kernel work..8

Ollama shipped v0.30.5, v0.30.6, and v0.30.7, moving from a Gemma 4 12B crash fix and Hermes Desktop launch support into Gemma 4 QAT tags, Oh My Pi launch support, and native Windows Hermes behavior. The dominant theme was stabilizing the 0.30.x local runtime line under Gemma 4 and MLX pressure. Latest release.85

OpenNMT shipped CTranslate2 v4.8.0, adding Google T5Gemma2 support, enabling PACKED_GEMM by default for Intel MKL, upgrading Thrust to CCCL 2.7.0, and including additional dependency/performance fixes. The release is compact but relevant for CPU inference and model conversion coverage..86

Osaurus shipped nine 0.19.x releases, ending at 0.19.17, focused on Gemma4/vMLX reliability, MiMo/N2/LFM/Nemotron runtime paths, local model discovery, Skill import, App Intents, diagnostics, Chinese translations, and crash repair. The release cadence reflects a native macOS agent app rapidly absorbing local-model runtime churn. Latest release.87

Qualcomm shipped ai-hub-apps v0.30.1 and ai-hub-models v0.55.0, with the apps release adding single-command setup/fetch flows and the models release adding NAFNet, PiperTTS languages, YOLO variants, EdgeTAM tracking, Voice AI SDK links, and refreshed performance data. The broader org theme was Snapdragon AI workflow reproducibility across QDC, scorecards, Windows, and model catalogs. ai-hub-apps release notes.88

ROCm shipped paired AITER and ATOM releases: AITER v0.1.15, v0.1.15.post1, v0.1.14.post1, plus ATOM v0.1.4 and v0.1.4-rc0. The dominant theme was synchronizing ROCm kernel and serving layers for vLLM/SGLang/DeepSeek-style production inference. AITER release notes.89

TileLang shipped v0.1.11, covering dynamic-index atomic_load, SM100 GEMM examples, CuTeDSL updates, FFI host-binder refactors, scan ops, CUDA __ffs, named barriers, mixed-dtype reduce fixes, software-pipeline refactors, shared-memory liveness, and Windows symlink gating. It was one of the week’s most important compiler-runtime releases for sub-byte GPU kernel experimentation..16

try-mirai shipped lalamo v0.13.0, v0.13.1, and a rapid Uzu 0.5.x release sequence through 0.5.12, focused on end-of-thinking tags, Qwen generation parameters, pull/registry fixes, unified sampling, repetition penalty, CLI/server startup polish, telemetry, benchmarks, and quantization prep. The release theme was tightening a small on-device model/runtime stack into a usable product surface. lalamo release notes.90

vLLM shipped v0.22.1, a patch release adding JetBrains Mellum v2 support, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and model-loading regressions. The broader week was far larger than the patch release, but this was the canonical shipped artifact..91

Gemma 4 Exposes Inference’s Memory Wall