Inference Radar

TL;DR

DeepSeek V4 became the week’s forcing function: SGLang, vLLM, TensorRT-LLM, ROCm aiter, and LocalAI all spent the week adapting kernels, MoE routing, quantization, and serving paths around the same fast-moving model family.1
KV cache moved from optimization to architecture: vLLM, OpenVINO, ai-dynamo, and SGLang all pushed deeper into offloading, paging, disaggregation, and cache-aware routing, turning memory management into the core serving battleground.1
Apple Silicon is no longer a side quest: Ollama, Apple MLX, mlx-vlm, oMLX, and vLLM-MLX all shipped meaningful fixes or features, showing that Mac inference is now a first-class deployment target rather than a hobbyist port.8
Edge runtimes are getting serious about production constraints: LiteRT, ExecuTorch, MNN, ncnn, and sherpa-onnx focused on cache behavior, backend enablement, packaging, and real-device reliability instead of just demo throughput.13
API compatibility is becoming a competitive feature: Ollama, Open WebUI, LiteLLM, LMDeploy, and EXO all spent time on OpenAI-, Anthropic-, MCP-, or Ollama-compatible behavior, because the winning runtime increasingly looks like the one that can impersonate everyone else.8

This Week in Inference

The market-trends search layer failed this week, so there’s no responsible way to claim a verified set of external launches, funding rounds, or benchmark announcements from outside the repo corpus. But the code itself was unusually coherent. Across the open-source inference stack, the same themes appeared everywhere: DeepSeek V4 support, Gemma 4 support, Qwen-family fixes, KV-cache redesigns, speculative decoding upgrades, and a scramble to make multimodal serving behave consistently across APIs and hardware targets. That kind of cross-repo convergence is its own market signal.

The most important shift is structural. Cloud serving engines like vLLM, SGLang, TensorRT-LLM, OpenVINO, and ai-dynamo are now wrestling with the same problems local and edge runtimes face: memory admission, cache reuse, multimodal preprocessing, tool-calling semantics, and backend fragmentation.1 Meanwhile, local projects like Ollama, llama.cpp, oMLX, and LocalAI are increasingly adopting datacenter-style concerns such as scheduler behavior, disaggregated serving, speculative decoding controls, and OpenAI-compatible surface area.5

The other big pattern is that “edge” no longer means “small.” Apple Silicon, Qualcomm NPUs, Android runtimes, browser stacks, and embedded speech pipelines are all being asked to keep pace with the same model churn as the datacenter. LiteRT worked on compilation-cache invalidation and bounded JIT reuse; ExecuTorch expanded Gemma 4 export and quantization; MNN improved LinearAttention prefill and speech latency; ncnn added an LLM benchmark path; and sherpa-onnx repaired WASM and expanded multilingual TTS.13 The stack is converging not because the hardware is the same, but because the software expectations are.

Deeper Dive

Everything below is for readers who want the full picture. Feel free to scroll.

Code Changes by Category

Cloud & Datacenter Serving

The center of gravity this week was memory and model churn.

vLLM pushed hard on KV-cache architecture with multi-tier offloading, request-aware tracking, and externalized cache metadata, while also broadening speculative decoding and backend support.2 The practical implication is that vLLM is trying to become the canonical abstraction layer for memory-aware serving across CUDA, ROCm, XPU, and specialized connectors — not just the fastest OpenAI-compatible server.

SGLang was the week’s most intense serving repo by raw momentum.1 DeepSeek V4 touched nearly every subsystem: MoE kernels, FP4/FP8 paths, Hopper and Blackwell tuning, scheduler correctness, KV compression, and disaggregated serving. The project is increasingly optimized for operators who care about getting the newest model family into production before the rest of the ecosystem catches up.

TensorRT-LLM broadened in two directions at once: lower-level kernel work for Blackwell, FP4, paged attention, and MoE; and higher-level runtime work for agent scheduling, tool-calling, and multimodal Gemma support.3 That combination is important. NVIDIA is trying to ensure that the fastest path on its hardware is also the most feature-complete path for modern application patterns.

ai-dynamo looked increasingly like glue for the post-monolithic serving world.7 Its work on disaggregated multimodal serving, routing reliability, frontend cleanup, and Kubernetes/operator support suggests a future where the serving engine, router, cache layer, and control plane are all independently swappable. Dynamo wants to be the thing that makes that complexity survivable.

LMDeploy had a release-driven week focused on TurboMind internals, structured output, API compatibility, and security controls around remote code.20 It’s a reminder that “serving engine” now includes parser behavior, request logging policy, and adapter correctness for multiple API dialects.

OpenVINO and openvino.genai continued their quiet transformation into a serious generative serving stack.6 Paged attention, shared KV support, continuous batching, Gemma and Qwen family support, and GPU memory reductions all point in the same direction: OpenVINO is no longer just an optimization backend; it is becoming a full serving substrate for Intel-centric deployments.

Triton Inference Server had a smaller but high-signal week, focused on HTTP and gRPC memory safety.24 That’s not glamorous, but it matters. As more LLM traffic flows through generic inference gateways, request-path robustness becomes a differentiator.

Ray wasn’t the flashiest inference repo this week, but its Serve and transport changes matter.25 Better object transfer, more flexible routing APIs, and less disruptive config updates all improve the substrate that many higher-level inference systems sit on top of.

Local LLM Runtimes

The local stack is becoming more server-like.

llama.cpp had another sprawling week: MiMo support, speculative decoding refactors, WebGPU work, Adreno and Hexagon improvements, server capability exposure, and continued fallout management around Vulkan and AMD.22 The project remains the broadest hardware portability layer in open-source inference, but that breadth increasingly comes with the same regression-management burden as a cloud runtime.

Ollama spent the week on MLX stability, multimodal correctness, and release engineering, while its in-flight architectural shift toward direct llama.cpp integration remains the bigger story.8 If that migration lands cleanly, Ollama could shorten the lag between upstream model support and downstream product usability. If it lands messily, it will expose how much hidden complexity its current engine abstraction was carrying.

LocalAI had one of the busiest local-runtime weeks, adding a DeepSeek V4 Flash backend, more speculative decoding controls, realtime audio capabilities, and a long list of deployment fixes.5 The project’s ambition is increasingly obvious: be the “bring your own backend” local inference platform that can absorb llama.cpp, SGLang, ds4, Liquid Audio, and more under one API surface.

oMLX is becoming one of the more interesting Apple-native serving projects.11 Chunked prefill, stricter memory admission, overload signaling, ParoQuant support, and audio API iteration all point to a runtime that is moving from “fast local app” toward “serious local server.”

text-generation-webui had a more product-facing week with UI and Electron polish, but the llama.cpp integration fixes are the more durable signal.26 Local frontends now live or die by how quickly they track backend churn.

EXO continued to mature as a distributed local-cluster runtime, with better process safety, logging, shutdown behavior, and API compatibility.21 The Mac cluster niche is still niche — but EXO is one of the clearest examples of local inference inheriting datacenter concerns.

Apple Silicon & MLX Ecosystem

Apple’s ecosystem had one of its strongest collective weeks in months.

MLX itself focused on correctness: DLPack semantics, Metal hangs, distributed all-reduce races, decode-time RoPE behavior, and assorted runtime fixes.9 That’s exactly the kind of work a platform does when it is being used hard enough to expose real edge cases.

mlx-swift-lm expanded speculative decoding, ParoQuant support, tool-calling robustness, and media stability.27 The Swift stack is no longer just a wrapper around MLX; it is becoming its own opinionated application runtime for local LLMs on Apple hardware.

mlx-vlm was arguably the most aggressive Apple-side model-support repo this week, with Gemma 4 and Qwen speculative decoding, server fixes, metrics, and new model families.10 It increasingly looks like the fastest path for getting new multimodal models onto Apple Silicon with a usable server surface.

vLLM-MLX is still early, but its work on multi-model serving, Gemma 4 audio, constrained decoding, and cache correctness is strategically interesting.12 It suggests that the vLLM mental model — scheduler, API, model registry, cache semantics — is portable to Apple hardware, not just datacenter GPUs.

uzu deserves more attention than it gets.28 Sparse buffer support, allocation refactors, Metal copy-path improvements, and RoPE migration work all point to a runtime thinking seriously about long-context memory scaling on-device.

Mobile & Edge Frameworks

The edge story this week was less about flashy model demos and more about infrastructure maturity.

LiteRT worked on compilation cache metadata, invalidation, bounded JIT reuse, and backend cleanup.13 That’s the right work for a runtime that wants to be trusted in production mobile deployments, where stale compiled artifacts and unbounded cache growth are real operational problems.

LiteRT-LM expanded multimodal APIs, NPU support, cache-keying, and deployment stability across CPU, GPU, and NPU paths.29 The project is increasingly trying to make “same app, different accelerator” feel normal.

ExecuTorch had a major week for Gemma 4, quantization, device placement metadata, and backend breadth.14 The key point is not any single feature; it’s that ExecuTorch is steadily building the plumbing needed to make edge deployment less bespoke.

MNN delivered some of the clearest edge-performance wins of the week, including faster LinearAttention prefill and much lower voice-interaction latency.15 Just as important, it also fixed speculative decoding, cache correctness, QNN export, and mobile packaging issues — the unglamorous work that determines whether a fast benchmark survives contact with users.

ncnn added an LLM benchmark path and continued optimizing BF16 and Vulkan behavior.16 That’s a subtle but important signal: ncnn is extending from its CV roots into LLM-oriented edge inference without abandoning its low-level performance culture.

sherpa-onnx had a standout week in speech and WASM, adding new TTS models, buffered streaming ASR support, and repairing web deployment after ONNX Runtime-related breakage.17 It remains one of the best examples of edge inference as a full product surface, not just a model runner.

Compilers, Runtimes & Graph Engines

Compiler and graph-layer work was unusually relevant to inference this week.

Apache TVM shipped a release while investing heavily in ONNX and TFLite frontend correctness, plus Metal cooperative tensor groundwork.30 That matters because the edge and local inference story still depends on reliable import and lowering more than on any single kernel trick.

Triton shipped a major release centered on AMD backend work, layout semantics, and compiler reliability.31 The AMD emphasis is especially notable: the open-source kernel ecosystem is no longer NVIDIA-only, and Triton is adapting accordingly.

OpenXLA, TensorFlow, and JAX all had active weeks around GPU execution, sharding semantics, and platform compatibility.32 The quick revert of an NCCL optimization after downstream failures is a good reminder that the compiler stack is now tightly coupled across projects.

Luminal is still small, but its FlashInfer backend, dynamic-shape support, and CUDA memory-planning work make it one of the more interesting emerging compiler-runtime hybrids to watch.35

ONNX Runtime had a strong release week around low-bit inference, attention optimization, and platform breadth, while ONNX itself pushed schema and tooling modernization.36 Together they show the format/runtime layer responding directly to generative-model pressure.

Models, Quantization & Optimization

Quantization is now a systems problem, not a post-processing step.

ROCm aiter, ATOM, SGLang, TensorRT-LLM, and vLLM all spent time on FP4, FP8, MXFP4, W4A16, or related low-precision paths.1 The common thread is that quantization is now deeply entangled with routing, cache layout, fused kernels, and hardware-specific execution.

oMLX adding ParoQuant, mlx-swift-lm adding ParoQuant INT4, and ExecuTorch expanding Gemma quantization all reinforce the same point: low-bit support is no longer confined to datacenter stacks.11

Transformers and Diffusers were less about raw inference speed and more about model correctness, modularity, and exportability.39 But those changes still matter downstream, because every serving engine inherits their assumptions.

Other Notable Changes

LiteLLM had a huge week around MCP auth, OAuth delegation, provider translation, and operational correctness.19 It’s increasingly the compatibility layer for a fragmented API world.

Open WebUI showed how much inference-adjacent software now revolves around security and permissions.18

Osaurus and FluidAudio both showed that local AI apps are becoming full-stack runtime products, not just wrappers around a model call.41

Community Pulse

The hottest community conversations clustered around three pressure points.

First, DeepSeek V4 operationalization. Across SGLang, vLLM, TensorRT-LLM, ROCm aiter, and ktransformers, users were stress-testing support on Hopper, Blackwell, MI355, Ada, and mixed CPU-memory systems.1 The pattern is familiar: model support lands, then the real work begins.

Second, Apple and MLX reliability. Ollama, MLX, mlx-vlm, and oMLX all saw active issue traffic around Gemma, Qwen, MTP, RoPE, and memory behavior.8 Apple inference is clearly mainstream enough now to generate production-style bug reports rather than novelty feedback.

Third, API compatibility and security regressions. Open WebUI, LiteLLM, EXO, and Ollama all had users reporting breakage at the seams between OpenAI, Anthropic, MCP, Ollama, and custom client expectations.8 The ecosystem still lacks a stable contract for “OpenAI-compatible” behavior, and everyone is paying that tax.

Worth Watching

KV cache standardization: The discussion around backend-agnostic KV layouts in vLLM is bigger than one repo.2 If the ecosystem converges on portable cache metadata and layouts, it could reshape disaggregated serving and cache-sharing products.
Direct llama.cpp integration in Ollama: The open migration in Ollama could materially change how quickly local users get upstream model support.8
Apple-native serving stacks: oMLX, vLLM-MLX, and mlx-vlm are converging on a world where Apple hardware gets real server semantics, not just local inference demos.10
Agent-aware runtimes: TensorRT-LLM, LiteLLM, and Open WebUI all touched agentic infrastructure this week.3 Expect more serving engines to expose tool-aware scheduling and cache hints.
ROCm’s inference stack coherence: aiter, ATOM, and AMDMIGraphX are starting to look less like isolated repos and more like a layered alternative stack.4

Major Releases

vLLM shipped v0.20.2, a patch release centered on DeepSeek V4, gpt-oss, and Qwen3-VL fixes, including sparse-attention correctness and Hopper path stabilization. The broader theme was rapid stabilization around frontier-model support while the main branch kept moving on KV offload and speculative decoding. 45

vLLM Gaudi released v0.19.1, continuing Intel Gaudi support while hardening HPU-specific behavior around cleanup, attention, and decode execution. The release reflects a maintenance-first posture: keep pace with upstream vLLM without letting Gaudi-specific regressions accumulate. 46

TensorRT-LLM published v1.3.0rc14, with the dominant theme of model-family expansion and serving-path stabilization, especially for prefix caching, Qwen support, and Nemotron improvements. The most important signal wasn’t the prerelease itself so much as how much of NVIDIA’s week was spent turning new model support into production-grade runtime behavior. 47

Ollama shipped four releases this week, from v0.23.2 through v0.23.4, plus the v0.30.0-rc15 prerelease. The clear focus was Apple Silicon/MLX stabilization, multimodal correctness, and preparation for the larger architectural move toward direct llama.cpp support. 48

llama.cpp shipped a rapid run of release tags through the week, reflecting its usual rolling-release cadence. The dominant themes were speculative decoding, MiMo support, WebGPU and mobile backend work, and server/API improvements — a reminder that llama.cpp remains both a local runtime and a fast-moving systems project. Latest release49

Apache TVM released v0.24.0, with the week’s work centered on frontend correctness, especially ONNX and TFLite import behavior, plus backend groundwork for Apple Metal and PyTorch interop. This was a compiler release aimed less at headlines than at making real model graphs import and run correctly. 50

PyTorch shipped v2.12.0, the week’s biggest framework release, while the surrounding work emphasized compiler/runtime refactoring and edge deployment through ExecuTorch. For inference readers, the release matters less as a monolith than as the base layer under many downstream serving and export stacks. 51

ONNX Runtime released v1.26.0, focused on low-bit inference, attention optimization, and broader platform support. The most important change was not one kernel but the continued expansion of ONNX Runtime as a serious generative inference engine rather than just a generic execution layer. 52

Triton shipped v3.7.0, with AMD backend work, layout semantics, and compiler reliability as the dominant themes. This was one of the more consequential compiler releases of the week because it directly affects the kernel-generation substrate used by multiple inference stacks. 53

LMDeploy released v0.13.0, centered on TurboMind refactoring, Ascend support, KV-cache quantization, and API compatibility fixes. The release shows LMDeploy leaning into production serving concerns rather than just model execution speed. 54

LocalAI shipped five releases from v4.2.0 through v4.2.4, all orbiting a major feature wave followed by rapid stabilization. The org’s focus was broad backend coverage, realtime audio, OpenAI/Ollama compatibility, and deployment reliability, with the new ds4 backend standing out as the week’s biggest addition. Latest release55

Open WebUI shipped v0.9.3, v0.9.4, and v0.9.5 in quick succession. The theme was clear: new agent and voice features, followed by aggressive security hardening and then fast regression response as users hit breakage in direct API, search, and image workflows. Latest release56

FluidAudio released v0.14.5, with a strong focus on TTS quality, CoreML behavior, and Apple-device startup pain points. The most important shift was the return of StyleTTS2 as a serious CoreML path rather than an experimental branch. 57

sherpa-onnx shipped v1.13.1 and v1.13.2, bundling WASM repairs, new TTS models, and expanded streaming ASR support. The release cadence reflects a project that is increasingly driven by real deployment feedback across web, mobile, and embedded targets. Latest release58

Transformers released v5.8.1, a patch release anchored by DeepSeek V4 correctness fixes and better failure signaling for continuous batching. It was a small release with outsized downstream impact because so many inference stacks inherit model behavior from Transformers. 59

BentoML shipped v1.4.39, a modest patch release focused on input handling and build reliability. Small release, but a useful reminder that serving platforms still win trust through boring correctness. 60

LiteLLM released v1.83.14-stable.patch.3, with the week’s broader work centered on MCP auth, OAuth hardening, provider translation, and operational correctness. The release itself was small, but the project’s strategic role as an API and auth compatibility layer keeps growing. 61

vLLM-MLX released v0.3.0, focused on registry-backed multi-model serving, Gemma 4 audio support, and serving correctness. It’s still early, but this is one of the clearest signs that Apple-native serving is starting to inherit cloud-runtime design patterns. 62

oobabooga text-generation-webui shipped v4.8, centered on UI refreshes, smoother chat behavior, and desktop-app polish, while also keeping llama.cpp integration current. The release underscores how local inference UX now depends on backend tracking as much as frontend design. 63

Liquid Audio released v1.2.0, adding fine-tuning support and reinforcing the org’s train-package-deploy story for audio models. That’s notable because it pushes the project beyond inference into adaptation workflows. 64

aiter shipped v0.1.13-rc5 and then v0.1.13, with the production release focused on DeepSeek, GPT-OSS, Kimi, and GLM enablement on AMD hardware plus new gfx950 kernels. This was the week’s clearest ROCm kernel-stack release. 65

uzu released 0.4.8, capping a week focused on sparse memory infrastructure and runtime refactors. The release itself is less important than the direction: long-context, memory-aware on-device inference is becoming a first-class design goal. 66

Osaurus shipped six releases in seven days, from 0.18.11 through 0.18.16, all centered on plugin architecture, MCP OAuth, runtime compatibility, and agent UX. The cadence reflects a product in tight feedback loops with users rather than a slow-moving framework. Latest release67

DeepSeek V4 Drags Every Runtime