Inference Radar

TL;DR

DeepSeek V4 became the week’s forcing function: vLLM, SGLang, and ai-dynamo all moved quickly on DeepSeek V4 support, showing how fast new frontier models now propagate through open serving infrastructure.1
Apple inference had a threading reckoning: MLX changed core runtime behavior and downstream projects like mlx-lm, mlx-vlm, Ollama, and vllm-mlx spent the week hardening thread-local stream handling.4
Serving engines are converging on memory as the main battleground: vLLM, TensorRT-LLM, OpenVINO, and SGLang all shipped KV-cache, paged attention, or disaggregated-serving fixes that matter more than raw token speed.10
Edge inference kept getting more serious: LiteRT, ExecuTorch, MNN, and ncnn all pushed meaningful runtime improvements for NPUs, mobile CPUs, and on-device LLM workloads.14
Compiler and runtime plumbing is being rebuilt under load: TVM, OpenXLA, ONNX Runtime, and Triton all landed foundational changes that will shape how models move across hardware over the next few quarters.17

This Week in Inference

The most important thing that happened this week is that model support is no longer a single-project event. DeepSeek V4 arrived not as a clean “launch,” but as a distributed systems exercise across the stack: vLLM integrated it into the flagship datacenter server path, SGLang built out deployment recipes and hardware-specific validation, and ai-dynamo wired it into orchestration, frontend parsing, and serving recipes.1 That matters because the open-source inference market increasingly rewards not who can run a model first, but who can operationalize it across heterogeneous fleets fastest.

The second big theme was memory discipline. Across cloud and edge, projects spent less time chasing flashy benchmark claims and more time fixing the hard parts of long-context and multimodal serving: vLLM expanded KV offload behavior, TensorRT-LLM fixed multimodal KV cache reuse in disaggregated serving, OpenVINO added new paged sequence operators, and SGLang cleaned up distributed MoE reduction paths.10 The implication is straightforward: the next competitive frontier in inference is not just faster decode, but predictable memory behavior under mixed workloads.

The third theme was that Apple, mobile, and edge inference are maturing into first-class deployment targets rather than hobbyist side paths. MLX triggered a downstream compatibility wave, Ollama pushed hard on MLX performance and OpenAI-compatible reasoning semantics, LiteRT expanded NPU support, and ExecuTorch improved recurrent and long-context on-device LLM execution.4 The old distinction between “real serving” in the datacenter and “toy inference” on-device keeps getting weaker.

Deeper Dive

Everything below is for readers who want the full picture. Feel free to scroll.

Code Changes by Category

Cloud & Datacenter Serving

The cloud-serving layer had the busiest and most strategically important week. vLLM led with DeepSeek V4 support, but the more durable story was its work on KV offload, disaggregated serving, parser correctness, and MoE cleanup across both the main repo and vllm-gaudi.1 The project is increasingly balancing three roles at once: reference OpenAI-compatible server, frontier-model integration point, and hardware abstraction layer.

SGLang matched that intensity with a more deployment-centric posture.2 DeepSeek V4 support was surrounded by verified recipes, docker flows, hardware-specific docs, and rapid correctness fixes like the swiglu clamp patch.23 It also kept pushing on distributed MoE correctness and diffusion serving, which suggests SGLang is trying to own both the “big MoE cluster” and “next-gen multimodal pipeline” narratives.

TensorRT-LLM had one of the most consequential infrastructure weeks of any repo, even if the changes were less visible to casual users.26 AutoDeploy sharding, cache correctness, routing refactors, speculative decoding work, and Blackwell/FP4 enablement all point in the same direction: NVIDIA is building a serving stack optimized for its newest hardware while trying to reduce the operational friction of deploying very large models. The open issues around DeepSeek V4 and Qwen support show that users increasingly expect TensorRT-LLM to keep pace with vLLM and SGLang, not trail them.

ai-dynamo deserves more attention than its star count suggests.3 It spent the week integrating DeepSeek V4 across frontend parsing, recipes, and docs while also fixing health checks, planner registration, KV transfer timing, and /v1/responses compatibility. That’s exactly the kind of work that matters when inference moves from “single binary” to “fleet of cooperating services.”

Ray Serve kept building the orchestration layer around modern inference, especially for TPU and SGLang integration.27 Multi-host TPU support for Serve LLM, shutdown correctness, label-locality rollout, and the growing SGLang roadmap issues all show Ray trying to become the control plane for heterogeneous serving rather than just a Python cluster framework.

LMDeploy had a smaller but meaningful week focused on parser refactors, scheduler fairness, and runtime correctness.28 The project’s open issues around embeddings, speculative decoding constraints, and TurboMind multimodal gaps suggest it remains important in production deployments, especially where OpenAI-compatible APIs and high-throughput inference meet.

Triton Inference Server was quieter on features and louder on maintenance.29 That’s not a criticism. The work on vLLM integration tests, request validation, filesystem safety, and packaging drift is exactly what a mature serving substrate should be doing.

Local LLM Runtimes

The local runtime story was dominated by llama.cpp, which had another one of its classic “everything everywhere all at once” weeks.30 WebGPU, CUDA, Metal, SYCL, OpenCL, Hexagon, and CPU quantized paths all moved forward, while server crashes and parser issues got fixed in parallel. The project’s breadth remains unmatched: it is still the place where new hardware backends and quantization formats become real for end users.

Ollama spent the week straddling two identities: local runtime and managed product surface.22 On the runtime side, MLX got batched sampling, fused sampling improvements, and thread-safety work. On the product side, ollama launch, OpenClaw onboarding, Kimi integrations, and OpenAI Responses reasoning mapping all expanded. Ollama increasingly looks less like a wrapper around local inference and more like a full-stack inference product with local-first ergonomics.

LocalAI had a very strong operational week.31 Per-node backend installation, distributed custom OCI backend handling, UI redesign, Sherpa ONNX voice support, and broader hardware coverage across CUDA, ROCm, Intel, and MLX-VLM all landed. LocalAI’s differentiator remains its willingness to be messy in the service of supporting everything.

text-generation-webui focused on tool safety and MCP integration.32 Tool-call confirmation, stdio MCP support, Gemma 4 thinking-tag fixes, and SSRF hardening show the project evolving from “UI for local models” into a more agentic local runtime surface.

omlx had a noisy but important week around batching correctness, cache visibility, VLM support, Gemma 4 compatibility, and Metal threading fixes.33 The issue volume suggests the project is under real user load, especially for Qwen 3.6 and Apple-centric deployments.

exo continued to carve out a niche in distributed local inference.34 Sampling defaults, Kimi and DeepSeek fixes, Gemma tensor-parallel behavior, and MLX CUDA support for DGX Spark all point to a future where “local” increasingly means “small cluster of heterogeneous personal devices.”

Apple Silicon & MLX Ecosystem

This was one of the most consequential weeks yet for the Apple inference ecosystem. MLX changed core runtime behavior around threading and streams, and the downstream blast radius was immediate.4 mlx-lm moved generation streams to thread-local handling, mlx-vlm did the same, and multiple serving layers had to patch around “no stream in current thread” failures.5

That kind of cross-repo synchronization is a sign of maturity, not fragility alone. It means MLX is no longer a standalone library; it is the substrate for a real ecosystem. The same week also brought parser fixes, hybrid-cache fixes, speculative decoding work, Gemma 4 support, distributed inference docs, and broader multimodal coverage in mlx-vlm and mlx-audio.35

coremltools landed one of the week’s most important correctness fixes anywhere in the stack: a silent fp16 NaN corruption path affecting Gemma-, Llama-, and Mistral-style decoder exports.37 That’s the kind of bug that can quietly poison an entire on-device deployment pipeline, and its fix matters well beyond Apple’s own repos.

The broader Apple-adjacent ecosystem also kept moving. uzu added Python bindings and improved Metal performance, FluidAudio shipped Cohere Transcribe support on Apple platforms, and vllm-mlx hardened multimodal prefill, constrained decoding, and thread-affinity behavior.38

Mobile & Edge Frameworks

The edge layer had a quietly excellent week. LiteRT expanded Intel OpenVINO NPU support, added more AOT and packaging work, and kept improving Python and WASM paths.13 LiteRT-LM simultaneously broadened its C API and session controls, which is exactly what you want if you’re trying to make on-device LLM inference a real product surface rather than a demo.41

ExecuTorch continued its steady climb toward serious on-device serving.14 Fused recurrent ops for Qwen-style attention, top-k sampling, wider token accounting, and chunked prefill robustness all matter for real mobile deployments. Qualcomm and Arm backend work in the same repo reinforces the sense that ExecuTorch is becoming Meta’s answer to the full edge stack, not just a model export target.

MNN delivered one of the clearest mobile performance wins of the week with ARM decode-path optimizations and broader Vulkan/OpenCL portability work.15 The issue tracker, however, shows the cost of broad device reach: Android ANRs, HarmonyOS CPU usage, and model-specific regressions are all surfacing in production-like conditions.

ncnn had a smaller week but still shipped a meaningful x86 SIMD speedup for PixelShuffle.16 That’s classic ncnn: not flashy, but relentlessly useful for edge deployment.

sherpa-onnx kept expanding the speech edge stack with a Tauri desktop app, JNI fixes, and broader Piper TTS language support.42 It remains one of the most practical open-source speech inference projects for cross-platform deployment.

Compilers, Runtimes & Graph Engines

The compiler layer had a foundational week. TVM split runtime and compiler libraries and kept migrating functionality toward tvm-ffi.17 That’s not just cleanup; it’s a structural change that should make TVM easier to embed, package, and use as a runtime-only dependency.

OpenXLA had huge throughput, but the important bits were platform enablement, parser hardening, PJRT boundary cleanup, and collective/symmetric memory work.18 This is the kind of week that doesn’t generate headlines but determines whether the compiler stack can keep up with increasingly heterogeneous hardware.

ONNX Runtime shipped one of the week’s most important runtime correctness fixes with the CUDA attention NaN patch for large-head GQA models.19 Combined with CoreML, WebGPU, MLAS, and serialization fixes, it was a strong reminder that ORT remains one of the most strategically important “boring” projects in inference.

Triton kept pushing on layout generalization, sanitizer coverage, and AMD backend work.20 The addition of GenericLinearEncoding is the kind of change that will matter more in six months than it does today, because it broadens what the compiler can express without forcing brittle special cases.

TileLang also deserves mention for its Blackwell MXFP8, CUDA int4 GEMM, and AMD CDNA4/RDNA work.44 It’s still earlier-stage than Triton, but it is increasingly relevant in the low-level kernel conversation.

Models, Quantization & Optimization

The model and optimization layer was less about new model announcements and more about support races and quantization plumbing. DeepSeek V4 was the obvious headline, but Gemma 4 remained a strong cross-repo theme in Transformers, mlx-vlm, Ollama, omlx, and coremltools.37

Transformers had a strong week with OpenAI Privacy Filter support, SonicMoe, more distributed support for Qwen and Gemma, and rapid patching around flash attention and FP8 issues.49 Candle added CPU causal flash attention with varlen support, which is a meaningful step for Rust-native inference on CPUs and Apple Silicon.50

ktransformers pushed into AMX-backed MoE SFT with LoRA support and cleaned up its packaging story.51 That’s notable because it blurs the line between inference optimization and training-side adaptation for consumer hardware.

ROCm AITER and ATOM kept pushing DeepSeek, Qwen, Gemma, and Kimi-specific kernel tuning and quantization correctness.52 ROCm’s inference story is still fragmented, but the pace of model-specific optimization is real.

Other Notable Changes

Open WebUI had a classic stabilization week after a major release, with PaddleOCR-vl ingestion, Firecrawl integration changes, and lots of regression triage.54 It remains the most important UI layer in the open inference ecosystem simply because it sits on top of so many backends.

LiteLLM added proxy memory CRUD, improved Bedrock and Claude Code compatibility, and kept hardening guardrails and router behavior.55 It’s increasingly the protocol glue for teams that don’t want to marry one provider or one runtime.

FluidAudio and mobius continued building a vertically integrated on-device speech stack, which is worth watching because speech inference is following the same cloud-to-edge convergence path as text.39

Community Pulse

The hottest community pattern this week was not a benchmark war; it was support pressure around new model families and runtime correctness. DeepSeek V4 requests and bug reports showed up across vLLM, SGLang, TensorRT-LLM, ktransformers, and even local/distributed projects like exo.57 That’s the clearest sign that frontier-model support has become a community expectation, not a nice-to-have.

The second major pulse was Apple-threading fallout. Issues around stream affinity and worker-thread crashes appeared in mlx-vlm, Ollama, vllm-mlx, and omlx.7 This was one of those weeks where a low-level runtime change became visible to ordinary users because it broke real apps.

There was also a notable rise in discussions around protocol correctness and reasoning controls. Projects from LiteLLM to Ollama to vLLM are all wrestling with how to represent thinking, tool calls, and OpenAI-compatible semantics consistently.65 That’s becoming a core interoperability problem across the stack.

Worth Watching

First, watch whether DeepSeek V4 support consolidates around a few serving engines or fragments by hardware. Right now vLLM, SGLang, and TensorRT-LLM are all still working through hardware-specific boundaries.67 The next phase will be less about “supported” and more about “supported well on your accelerator.”

Second, watch the Apple stack for a post-threading stabilization wave. MLX forced downstream projects to adapt quickly, and there will likely be another round of fixes as more edge cases surface in servers, VLMs, and desktop apps.4

Third, keep an eye on disaggregated serving. ai-dynamo, vLLM, TensorRT-LLM, SGLang, and Ray Serve are all touching some version of the same problem.10 That usually means the architecture is moving from experiment to default.

Fourth, the compiler layer is getting more strategic. TVM, Triton, OpenXLA, and OpenVINO all made changes that affect how models travel across hardware.17 As model architectures diversify, these plumbing layers will matter more, not less.

Major Releases

Apple MLX shipped mlx v0.31.2, a consequential core runtime release centered on multi-threaded independent computations, CUDA-side improvements, and low-level stream behavior changes that immediately affected downstream serving stacks. The most important impact wasn’t a single feature but the ecosystem-wide compatibility wave it triggered across MLX-based runtimes. 4

Apple MLX-LM followed quickly with mlx-lm v0.31.3, focused on restoring stability after the MLX runtime changes by moving generation streams to thread-local handling and fixing cache-extension behavior. This was a fast-response compatibility release rather than a feature drop, and it mattered because it stabilized one of the most widely used Apple-native LLM runtimes. 72

ai-dynamo shipped two releases, with the headline being v1.2.0-sglang-deepseek-v4-dev.1, which brought experimental DeepSeek V4 support, SGLang recipes, and KV-router fixes for the new model family. The broader theme across the week was production hardening for disaggregated serving and frontend compatibility. 73

Apache tvm-ffi released v0.1.11-rc2, reflecting the week’s broader TVM push toward cleaner runtime/compiler separation and a stronger FFI boundary. The most important change was improved callback and DLPack conversion behavior that supports the larger architectural migration underway in TVM itself. 74

EXO shipped v1.0.71, a release centered on better sampling defaults, RDMA and M5 Mac fixes, and Kimi support. The dominant theme was making distributed inference less fragile while broadening model coverage. 75

Fluid Inference had a strong release week led by FluidAudio v0.14.0 and its follow-up patch, which introduced Cohere Transcribe support on Apple platforms and then quickly improved ANE placement, diarization, and concurrency safety. The org’s releases show a clear vertical integration strategy across model conversion, runtime, and text normalization. 76

ggml / llama.cpp produced a flood of tagged builds, but the week’s most meaningful release trend was broad backend acceleration across WebGPU, CUDA, OpenCL, Metal, and SYCL, plus server-stability fixes. Rather than list every build, the important takeaway is that llama.cpp kept shipping daily backend improvements at a pace few projects can match. 77

Google LiteRT stack shipped two releases, with the visible release activity anchored in the broader LiteRT ecosystem while code changes concentrated on NPU support, packaging, and LM APIs. The most important practical shift was that Google’s edge stack keeps getting closer to a full deployment platform rather than a collection of demos. LiteRT release activity78

Hugging Face Transformers shipped a major release plus rapid patches, with v5.6.0 as the anchor and follow-up fixes for flash-attention and FP8 regressions. The dominant theme was fast-moving model support and immediate patch response, which is increasingly how Transformers operates when it sits at the center of so many downstream stacks. 79

k2-fsa / sherpa-onnx released v1.12.40, bundling the new Tauri VAD+ASR desktop app, JNI UTF-8 fixes, and broader Piper TTS language support. The release reflects sherpa-onnx’s steady expansion from library to full cross-platform speech toolkit. 80

MLC WebLLM shipped v0.2.83, focused on browser deployment improvements like subgroup-aware WASM distribution, integrity verification, and model-list cleanup. The release matters because browser inference is increasingly a packaging and runtime-discipline problem, not just a demo problem. 81

Microsoft ONNX Runtime shipped v1.25.0, the week’s most important ONNX-stack release, anchored by broader platform updates and a stream of correctness fixes around attention, serialization, and execution providers. The standout change was less the version bump than the project’s continued role as the default portable runtime across hardware classes. 82

Ollama shipped three releases in quick succession, with v0.21.3-rc0 as the most significant because it added OpenAI Responses reasoning mapping and "max" think support while the earlier releases improved OpenClaw onboarding and MLX logprobs. The broader theme was clear: Apple runtime stabilization plus product-surface expansion. 83

Open WebUI shipped a rapid three-release train ending in v0.9.2, focused on post-major-release stabilization, retrieval improvements, and document-processing expansion with PaddleOCR-vl. The cadence reflects heavy user adoption and a tight regression-response loop. 84

Qualcomm shipped releases across both ai-hub-models and ai-hub-apps, with the most important being ai-hub-models v0.51.0, which expanded the model catalog and reinforced Qualcomm’s release automation around Snapdragon-targeted deployment. The companion AI Hub Apps release pushed the CLI and PyPI distribution story forward. 85

Ray shipped ray-2.55.1, a patch release whose headline fix was SSH connectivity in the ray-llm image. The more important weekly story, though, was the code-level push around Serve LLM, TPU topology, and SGLang integration. 86

ROCm AITER shipped v0.1.12.post2 and then v0.1.13-rc1, both focused on ABI compatibility, wheel stability, and downstream smoke testing under real inference stacks like vLLM. This was a release cycle about keeping ROCm inference usable under pressure, not adding flashy features. 87

RunanywhereAI shipped two patch releases, with v0.19.13 the more important because it fixed checksum-sync bugs that broke Swift package distribution. The week was all about packaging correctness and CI health rather than runtime features. 88

TensorRT-LLM shipped v1.2.1, a release centered on KV cache corruption fixes and dependency updates. That fits the week’s broader NVIDIA theme: cache correctness and deployment reliability are now first-order concerns. 89

text-generation-webui shipped three releases in one day, with v4.6.2 as the canonical one, adding tool-call confirmation, stdio MCP support, and related fixes. The release train shows a fast maintainer loop around agentic local inference features. 90

TileLang shipped v0.1.9, summarizing a week of backend/compiler work across Blackwell, AMD, async pipelines, and GEMM infrastructure. The release matters because TileLang is becoming a more serious participant in the low-level inference compiler conversation. 91

vllm-mlx shipped v0.2.9, focused on server hardening and MCP sandboxing, but the more important weekly trend was the project’s rapid response to MLX threading fallout and multimodal scheduling issues. It is quickly becoming a serious Apple-native serving layer rather than an experiment. 92

ZeticMLangeiOS shipped 1.6.0 and 1.7.0-beta.1, moving from quantization selection controls toward multimodal support in its iOS SDK. Low activity overall, but a clear product direction for mobile inference. 93

DeepSeek V4 Sets Off a Stackwide Sprint