TL;DR
- MTP becomes mainstream: llama.cpp, SGLang, vLLM, Ollama, MLX-adjacent projects, and local desktop servers all pushed multi-token prediction or speculative decoding closer to default inference behavior.1
- Blackwell and low-bit serving harden fast: vLLM, SGLang, TensorRT-LLM, ROCm AITER, and Triton all advanced FP8, FP4, MXFP4, NVFP4, MoE, and next-gen GPU paths.2
- Edge stacks look more like cloud servers: LiteRT-LM, ExecuTorch, Qualcomm AI Hub, Cactus, MNN, ncnn, and sherpa-onnx all moved on mobile APIs, hardware backends, model packaging, or deployability.3
- Apple Silicon inference is getting serious infrastructure: Apple MLX, MLX Swift LM, vllm-mlx, omlx, Osaurus, Ollama, FluidAudio, and Blaizzy’s MLX projects all shipped serving, cache, VLM, audio, or app-runtime improvements.
- Security is now part of inference engineering: Open WebUI, LocalAI, Triton, FastChat, and ONNX Runtime all surfaced concrete hardening work around SSRF, file access, unsafe deserialization, archive safety, or supply-chain verification.4
This Week in Inference
The model news was less about a new open-weight frontier release and more about the ecosystem digesting recent model families into production runtimes. Google’s Gemini 3.5 Flash showed up quickly in LiteLLM provider metadata, while TensorRT-Edge-LLM added Alpamayo-1-10B, Qwen3.5 MTP, and Qwen3-TTS serving paths for edge deployment workflows (LiteLLM, TensorRT-Edge-LLM).6 Hugging Face Transformers added Cohere2 MoE / Command A+, PP-OCRv6, HRM-Text, and Parakeet TDT ASR support, showing that “model support” increasingly means tokenizer quirks, processors, multimodal inputs, serving metadata, and distributed loading rather than just a config class (Transformers).7
The dominant technical theme was speculative decoding moving from an advanced option to expected infrastructure. llama.cpp wired MTP heads from the same GGUF into a draft context, SGLang formalized DeepSeek V4 and MTP-heavy serving paths, Ollama added DFlash speculative decoding in its MLX runner, and vLLM continued folding speculative and offload-oriented work into a Blackwell-era serving release (llama.cpp, SGLang, Ollama, vLLM).2 At the same time, KV-cache and memory work showed up everywhere: vLLM’s hybrid memory allocator and KV offload, SGLang’s HiCache and disaggregated serving fixes, ai-dynamo routing metadata, omlx hot-cache work, and Google/Apple edge runtimes all treated memory layout as the real serving constraint (vLLM, SGLang, ai-dynamo, omlx).10
Hardware and infrastructure work followed the same pattern: Blackwell, Hopper, AMD MI35x, Intel XPU, Qualcomm, Apple GPUs, RISC-V, Arm, and WebGPU all got meaningful attention in the same week. Triton-lang and TileLang worked on compiler/backend plumbing, ROCm AITER moved its Triton baseline forward, Google’s LiteRT stack broadened WebGPU, NPU, and mobile coverage, and Meta’s ExecuTorch pushed MLX/Gemma/LFM2.5 export and edge backend enablement (Triton, TileLang, AITER, LiteRT-LM, ExecuTorch).3 The broader industry signal is clear: inference optimization is now a product surface, not a backend detail, and the projects that win will be the ones that make speed, memory, safety, and compatibility boring.
Top Stories
llama.cpp makes MTP portable
llama.cpp landed the week’s most important local-runtime shift by loading MTP heads from the same GGUF into a separate draft context, making speculative decoding usable in the portability layer that anchors much of desktop and edge inference (llama.cpp).1 Follow-up work cleaned up rollback, graph reuse, prompt overhead, backend sampling direction, conversion help, and speculative-device crash behavior, which is the kind of unglamorous stabilization required before a speed feature becomes default infrastructure (llama.cpp).18
vLLM and SGLang race into Blackwell-era serving
vLLM centered its release around hybrid memory allocation, KV offload, Blackwell-oriented backend work, FP8/FP4 kernels, and a toolchain break that signals a more demanding serving baseline (vLLM).2 SGLang, meanwhile, paired DeepSeek V4 support with disaggregated serving, HiCache, Blackwell FP8 defaults, speculative decoding fixes, and hardware-specific kernel work across NVIDIA, AMD, XPU, and NPU paths (SGLang).8
Google’s edge stack starts looking like a full serving platform
LiteRT-LM shipped Swift/iOS Metal APIs, WebGPU and CPU JavaScript inference, CLI NPU support, and OpenAI-compatible serving work, while Gallery added experimental MCP-enabled agent chat and permission flows (LiteRT-LM).3 That matters because Google’s edge work is no longer just “run a small model on-device”; it now spans export, quantization, web/mobile runtime APIs, agent UX, and backend selection (Gallery).19
Apple Silicon inference gets a production layer
Apple MLX hardened Windows JIT, CPU allocator behavior, Metal correctness, and Swift LM performance, while MLX-adjacent projects added OpenAI-style APIs, prefix caching, speculative decoding, VLM serving, audio models, and macOS app runtime fixes (MLX).20 The Apple ecosystem is increasingly mirroring the cloud stack: not just kernels, but schedulers, caches, parsers, APIs, release channels, and model-management UX (vllm-mlx).21
Security and supply chain become inference features
Open WebUI hardened SSRF and shared-file authorization paths, LocalAI added backend OCI image verification with a hard-fail mode, and Triton added request-path and archive-safety hardening work (Open WebUI, LocalAI, Triton).4 FastChat also received a high-severity report around image loading enabling SSRF and local file reads, reinforcing that inference servers now sit in the same threat model as API gateways and data services (FastChat).24
Deeper Dive
Everything below is for readers who want the full picture. Feel free to scroll.
Code Changes by Category
Cloud & Datacenter Serving
vLLM shipped a major serving release focused on KV offload, hybrid memory allocation, Blackwell kernels, FP8/FP4 paths, MoE correctness, disaggregated serving, and Gaudi plugin compatibility (vLLM).2 SGLang had the heaviest single-repo week in cloud serving, with DeepSeek V4, disaggregated execution, HiCache, speculative decoding, scheduler fixes, diffusion paths, and hardware-specific kernels moving together (SGLang).8
NVIDIA’s TensorRT-LLM work centered on serving/runtime maturity, multimodal models, MoE quantization, disaggregated metrics, and regression testing, while TensorRT-Edge-LLM shipped new edge model support and a migration path toward its experimental loader (TensorRT-LLM, TensorRT-Edge-LLM).6 ai-dynamo advanced routing constraints, stable worker identity, request metadata propagation, XPU/NIXL support, Prometheus hardening, and Kubernetes/operator reliability in one tightly coordinated serving pass (ai-dynamo).12
Ray Serve added HAProxy ingress metrics, splice defaults, retry knobs, a pick-only LLM ingress path, and stalled-replica watchdogs while Ray Data moved Parquet reads to its newer datasource path by default (Ray).26 LMDeploy added OpenAI-compatible routed-expert/token telemetry, Qwen3/Qwen3.5 Lite AWQ support, TurboMind diagnostics, and VLM serving memory/correctness fixes (LMDeploy).27 LiteLLM continued broad provider-surface expansion with Gemini managed agents, Gemini metadata, Anthropic/DeepSeek compatibility, MCP routing, proxy admin controls, and Docker signature guidance (LiteLLM).28
Local LLM Runtimes
llama.cpp dominated local runtime work with MTP support, a unified executable, GPU backend acceleration across CUDA/HIP/Vulkan/Metal/OpenCL/SYCL/Hexagon, and a large wave of server/UI/multimodal changes (llama.cpp).1 Ollama shipped Codex App support, MLX sampling rewrites, DFlash speculative decoding, startup caching improvements for large local model stores, and issue-driven work around GGUF parsing and tool calling (Ollama).29
LocalAI improved OpenAI-compatible streaming and realtime behavior, Ollama/Home Assistant compatibility, llama.cpp/MTP backend pins, supply-chain verification, Nix support, and wrapper alignment for stable-diffusion.cpp and ACE-Step (LocalAI).30 text-generation-webui shipped MTP speculative decoding, automatic MTP GGUF enablement, multimodal projector discovery, Electron packaging polish, and path/CORS/web-search hardening (text-generation-webui).31
exo improved distributed-runtime diagnostics, OpenAI/Ollama API compatibility, first-run setup, and model-card backend metadata validation (exo).32 llamafile focused on docs, GPU-offload troubleshooting, AMD ROCm guidance, CPU regression investigation, Unix socket diagnostics, and local history privacy expectations (llamafile).33
Apple Silicon & MLX Ecosystem
Apple MLX landed Windows CPU JIT, allocator safety, thread/lifetime fixes, Metal/steel GEMM correctness, depthwise Conv2D fixes, and Swift LM decode/model-parity improvements (MLX, mlx-swift-lm).20 Blaizzy’s MLX projects added FSMN-VAD, Irodori-TTS, Anthropic-style VLM messages, OpenAI-style responses, speculative decoding refactors, text-only compatibility, LoRA unification, and Swift audio fixes (mlx-audio, mlx-vlm).35
vllm-mlx added system-prompt KV snapshots, MTP fixes, Gemma 4 JSON-schema output fixes, lifecycle tests, hot-path trimming, and streamed reasoning-parser behavior improvements (vllm-mlx).21 omlx focused on DFlash/MTP reliability, oQ quantization, VLM loading, per-model cache hit-rate UI, hot-cache eviction, OpenAI-compatible reasoning/tool flows, and low-memory Mac protection (omlx).37 Osaurus had a rapid-release macOS app week around vMLX consolidation, provider compatibility, MCP/plugin expansion, local model stability, document extraction, and app UX hardening (Osaurus).38
Mobile & Edge Frameworks
Google’s LiteRT ecosystem moved across LiteRT, LiteRT-LM, litert-torch, samples, Gallery, MediaPipe, XNNPACK, and AI Edge Quantizer with Gemma 4, Swift/iOS, WebGPU, NPU, Tensor G5, OpenVINO, 2-bit quantization, and MCP-enabled mobile agent work (LiteRT-LM, Gallery).3 Meta ExecuTorch added Android LLM instrumentation, MLX/Gemma/LFM2.5 export and artifacts, RISC-V CI, Arm bare-metal modernization, Qualcomm QNN debugging, NXP Neutron work, and CoreML fallback fixes (ExecuTorch).17
Qualcomm AI Hub expanded model packaging for multi-file downloads, added Pi0.5 support, modernized Qwen2.5-VL, enabled LLM Scorecard/eval work, and removed llama.cpp runtime integration from ai-hub-models (Qualcomm AI Hub).39 Alibaba MNN added W8A8 Vulkan cooperative-matrix work, DFlash speculative decoding, QNN Windows fixes, Metal ROIAlign, RVV quant/matmul helpers, and converter fixes (MNN).40 Tencent ncnn pushed RISC-V, MIPS/LoongArch, ARM BF16/SDPA, x86 int8, Vulkan driver workarounds, and quantization tooling cleanup (ncnn).41
Cactus consolidated conversion/transpilation, added model-family support for Qwen, LFM, Gemma4, Parakeet, and Whisper, refactored quantization parameters, fixed KV-cache reset behavior, and continued Android/iOS bindings work (Cactus).42 sherpa-onnx improved C/C++ docs and examples, Node.js TTS/microphone paths, React Native discoverability, and RISC-V64 tooling (sherpa-onnx).43 FluidAudio exposed Supertonic-3 CoreML TTS, added PocketTTS voices, improved Parakeet long-form ASR arbitration, and expanded CLI configurability (FluidAudio).44
Compilers, Runtimes & Graph Engines
Apache TVM introduced initial TIRx infrastructure for low-level GPU programming, Blackwell-oriented codegen, CUDA intrinsics, TVMScript support, and extensive test coverage (TVM).45 TileLang migrated to TVM’s newer TIRx/TVM-FFI stack, added Hopper/Blackwell TMA cluster-copy features, CUDA SM75 MMA GEMM, Programmatic Dependent Launch support, ROCm wheels, and sparse/pipeline fixes (TileLang).15
OpenXLA had a huge GPU/compiler week with command-buffer conditionals, collective buffer analysis, SPMD fixes, cuBLASLt cleanup, dynamic-slice fusion, CPU multi-module HLO compilation, ROCm/oneAPI coverage, and build hygiene (OpenXLA).46 TensorFlow mirrored much of that XLA/GPU work in framework land while also fixing LiteRT calibration symbols, tf.data service teardown, and distributed/runtime issues (TensorFlow).47 JAX shipped a release while pushing Pallas/Mosaic GPU work, explicit sharding, random APIs, ROCm hermetic LLVM, and compiler/runtime performance fixes (JAX).48
Triton-lang focused on AMD/ROCm backend correctness, Proton profiling modularity, NVIDIA/Gluon lowering, interpreter parity, FP4 dot-scaled layout support, and sanitizer behavior (Triton).14 Microsoft’s ONNX stack advanced ONNX Runtime QMoE, WebGPU attention, CPU MLAS, quantization correctness, overflow hardening, release modernization, SBOM/protobuf cleanup, and ONNX shape inference fixes (ONNX Runtime, ONNX).49 OpenVINO worked across GenAI chat APIs, Gemma/GQA/sliding-window attention, GPU LoRA/MoE/paged attention, NPU graph lifetime, and NNCF compression correctness (OpenVINO, OpenVINO GenAI).51
Models, Quantization & Optimization
Hugging Face Transformers expanded distributed TP/FSDP2, continuous batching, serving behavior, Cohere2 MoE / Command A+, PP-OCRv6, HRM-Text, Parakeet TDT ASR, and multimodal processor paths (Transformers).7 Diffusers added Motif-Video, LTX-2 IC LoRA/HDR pipelines, LLaDA-2 fixes, TorchAO dequantization, GGUF diagnostics, and Flux/SD3 LoRA training fixes (Diffusers).53 Candle improved Metal concurrent dispatch, inter-encoder synchronization, quantized cache clearing, and layout support for local Rust inference (Candle).54
Intel Neural Compressor carried FP8 functionality into mainline, added JAX/Keras Conv2D quantization, added Gemma 3 JAX quantization validation, and made ViT tests less brittle (Neural Compressor).55 Google AI Edge Quantizer added profiler-based calibration and 2-bit recipes, while XNNPACK added int2/int4 conversion kernels and a qint2 datatype (AI Edge Quantizer, XNNPACK).56 ktransformers added native MXFP4 MoE acceleration for AVX512F-only CPUs and fixed Qwen3.5 packed BF16 MoE loading (ktransformers).58
Other Notable Changes
DeepSpeed added AMD MI300 SDMA/Mori allgather for ZeRO-3, bf16 optimizer states with CPU offload, ZeRO-3 late-module wrapping fixes, ZenFlow hang fixes, and data-analyzer command-injection hardening (DeepSpeed).59 Triton Inference Server removed Windows support docs, added C++ gRPC cancellation tests, Torch AOTI QA, BF16 and TensorRT QA modernization, and security-oriented frontend work (Triton Server).60 CTranslate2 added Gemma4 dense model support, CUDA curand cleanup, PyPI publishing fixes, and Wav2Vec2 test reliability improvements (CTranslate2).61
Community Pulse
The hottest community clusters were the same ones showing up in code: MTP correctness, cache pressure, model-family compatibility, and hardware-specific regressions. llama.cpp users reported MTP speed regressions and speculative decoding timeouts after cleanup work, while SGLang users pushed DeepSeek, GLM, B300, H20, MI325X, and /v1/responses edge cases through rapid triage (llama.cpp, SGLang).62
Open WebUI’s issue traffic stayed intense around production deployments, RAG/Knowledge behavior, API tool-calling, MCP OAuth scopes, and contribution-process friction (Open WebUI).64 vLLM’s community discussion focused on architecture simplification, ROCm regressions, DeepSeek-V4 prefix-cache behavior, Gemma4 loading, NIXL limits, and wheel/toolchain expectations (vLLM).65 Apple/local communities concentrated on determinism, memory pressure, TurboQuant expectations, MLX crashes, Mac app reliability, and self-hosted Claude Code-style workflows (MLX, omlx, vllm-mlx).66
Community Debates
Where should TurboQuant-style kernels live? Apple MLX rejected built-in TurboQuant SDPA as too specialized before a generic quantized SDPA path exists, and related Swift-side proposals were redirected away from bindings toward core MLX primitives (MLX).69 The debate matters because Apple inference projects clearly want aggressive KV/cache compression, but maintainers are trying to avoid baking model-specific kernels into the wrong layer.
Should Open WebUI run native tool loops server-side? A REST API proposal for server-side native tool-calling loops was closed as intended behavior, despite contributor arguments that automation and MCP use cases need it (Open WebUI).70 The disagreement is really about product boundaries: chat UI orchestration versus API-native agent execution.
llama.cpp maintainers drew lines around speculative decoding design and AI-generated contributions. A partial rollback design for GDN speculative decoding closed in favor of checkpoint-based speculative decoding, and a Gemma 4 assistant-decoding PR was closed with maintainers citing both overlap and policy concerns around fully AI-agent-coded changes (llama.cpp, llama.cpp).71 That is a useful signal for contributors: performance PRs need architecture fit, reproducible evidence, and maintainable authorship.
Triton rejected a library split on complexity grounds. A proposal to split Triton into separate shared libraries was rejected over exported-symbol risk, ODR conflicts, loading complexity, and unclear value (Triton).73 As inference stacks increasingly embed Triton kernels, this debate will keep returning under the banners of wheel size, plugin boundaries, and backend-free builds.
ai-dynamo refused to vendor unreleased vLLM internals. A compatibility backport proposed bringing in an unreleased vLLM method, but maintainers preferred contributing upstream rather than depending on unstable internals (ai-dynamo).74 That is the right instinct for a serving orchestrator trying to stay compatible across fast-moving engines.
Worth Watching
- MTP everywhere: llama.cpp, SGLang, vLLM, Ollama, MLX-derived stacks, text-generation-webui, Cactus, TensorRT-Edge-LLM, and omlx all moved on speculative decoding or MTP paths this week (llama.cpp, Ollama).1
- KV-cache compression and offload pressure: vLLM’s HMA/KV offload, SGLang HiCache, omlx hot-cache work, Cactus KV fixes, and Google/Apple edge memory issues suggest cache policy is becoming a product differentiator (vLLM, omlx).10
- Blackwell plus AMD MI35x as forcing functions: vLLM, SGLang, TensorRT-LLM, ROCm AITER, Triton, TileLang, and OpenXLA are all bending kernel and compiler stacks around new low-bit hardware paths (AITER, TileLang).75
- MCP and local agents are spreading into inference apps: Google Gallery, Open WebUI, Osaurus, LiteLLM, and LocalAI all touched MCP, tools, permissions, plugins, or agent UX this week (Gallery, Osaurus).77
- Security is no longer optional: SSRF, unsafe file access, archive extraction, deserialization, supply-chain signatures, and header-forwarding behavior all appeared in active inference repos this week (LocalAI, Triton Server).22
Major Releases
vLLM shipped v0.21.0, a substantial infrastructure release with a Transformers v4 deprecation path, a C++20 compiler requirement, KV Offload, Hybrid Memory Allocator work, and Blackwell-oriented serving changes. This was the week’s most consequential datacenter serving release..2
SGLang shipped v0.5.12, centered on DeepSeek V4 full inference, parallelism modes, NVIDIA and AMD accelerator coverage, prefill-decode disaggregation, HiSparse CPU KV offload, parsers, and DeepGEMM/FlashMLA kernels. The release formalized the week’s biggest SGLang engineering themes..8
ggml / llama.cpp shipped an unusually large run of build releases, with the most important themes being Qwen3.x MTP, unified app packaging, Hopper+ PDL, Metal pad/copy optimizations, RDNA tuning, OpenCL Adreno MoE support, and Hexagon kernels. The week’s most strategic llama.cpp milestone was the initial MTP release path..79
Ollama shipped v0.24.0, headlined by Codex App support through ollama launch codex-app and a broader week of MLX speculative decoding and startup-cache work. The release made app-level local-agent workflows the visible product surface..29
Google AI Edge shipped Gallery 1.0.14, LiteRT v2.1.5, LiteRT-LM v0.12.0, and litert-torch v0.9.1, with the dominant theme being mobile/web/runtime expansion across MCP, Swift/iOS, WebGPU, NPU, and packaging. LiteRT-LM was the most technically significant release because it broadened local LLM APIs across Apple, browser, CLI, and Flutter surfaces..3
NVIDIA shipped TensorRT-Edge-LLM v0.7.1 with Alpamayo-1-10B, Qwen3.5 MTP, Qwen3-TTS streaming, FP8 ViT/Qwen3-TTS experimental-loader support, Mamba prefill improvements, and composable runtime stacks. The release also signaled deprecation of the older tensorrt_edgellm path..6
Hugging Face shipped Transformers v5.9.0, led by Cohere2Moe / Command A+ support and backed by a week of distributed loading, serving, multimodal, tokenizer, and security/CI work. This release reinforced Transformers’ role as both model library and increasingly serious serving substrate..7
BerriAI shipped a rapid LiteLLM sequence from v1.84.0 through v1.87.0-dev.1, with the stable train focused on provider expansion, Gemini/Gemini-managed-agent work, proxy hardening, observability, and Docker signature verification. The most representative stable release was v1.85.0..28
LocalAI shipped v4.2.5 and v4.2.6, focused on Ollama/Home Assistant compatibility, realtime output modality behavior, float-encoded Ollama options, MTP llama.cpp defaults, docs, and backend pin churn. v4.2.6 best captures the week’s MTP/runtime-update direction..30
text-generation-webui shipped v4.9 with MTP speculative decoding via draft-mtp, automatic MTP enablement for MTP GGUFs, web-search snippet support, cleaner webpage fetching, and a large set of desktop/runtime/security fixes. It was the week’s biggest end-user local UI release..31
JAX shipped v0.10.1 with ResizeMethod.AREA and new jax.scipy.linalg constructors including hadamard, circulant, dft, and leslie. The release sat atop a very active week of Pallas/Mosaic GPU and sharding work..48
OpenNMT shipped CTranslate2 v4.7.2, adding Gemma4 dense-model support, Gemma 3 conversion fixes, ROCm source-version updates, and CUDA curand-state cleanup before thread destruction. The CUDA cleanup addressed long-running crash pressure from downstream faster-whisper-style users..61
FluidInference shipped FluidAudio v0.14.6 and v0.14.7, moving Supertonic-3 CoreML TTS into the product, adding PocketTTS native voices, reducing downloads, exposing more CLI config, and improving Parakeet v3 no-mel decode arbitration. v0.14.7 is the best summary of the week’s TTS and ASR stabilization work..44
Qualcomm shipped ai-hub-models v0.54.0 with Pi0.5 / pi05 support and UI/device-list improvements for selected Qualcomm devices. The release tied model-zoo expansion to deployability and hardware visibility..39
ROCm shipped AITER v0.1.14-rc0, highlighting DSv4 fusions, production-model GSM8K validation, and Kimi-K2.5-MXFP4 unblocking with the corresponding ATOM side. The release candidate reflects ROCm’s push to make MI35x-class inference kernels production-relevant..80
try-mirai shipped lalamo v0.8.0, v0.9.0, v0.10.0, and uzu 0.4.9, with the dominant theme being model import support, compression refactors, eval fixes, server RAM cleanup, and runtime shape/cache flexibility. lalamo v0.10.0 best captures the latest user-facing package state..81
jundot shipped omlx v0.3.9rc1, focused on stronger low-memory Mac protection through phys_footprint enforcement and prefill admission control. It is a release candidate, but it directly targets the local-inference failure mode Apple Silicon users care about most..37
Osaurus shipped a dense run from 0.18.17 through 0.18.32, with themes spanning vMLX consolidation, provider compatibility, MCP/plugins, agent orchestration, document handling, app stability, and local model fixes. The latest release focused on maintenance, DeepSeek KV-cache provider updates, and binary-capable read_file..38
