Powered by RunAnywhere — On-device AI infrastructure
← All issues
2026-W13
1,675 commits
730 issues
1,535 PRs
85 releases

Edge AI Weekly — 2026-W13 (2026-03-26 to 2026-04-01)

Tracking 119 repositories across 38 organizations in on-device and edge ML inference.
This week: 1675 commits, 85 releases, 730 new issues, 1535 merged PRs.


This Week in Inference — Market Context

The broader inference market kept pushing in two directions at once this week: down-market toward edge devices and up-stack toward more efficient serving architectures. The clearest external signals were a 1-bit open model release in GGUF form, fresh attention on KV-cache compression via TurboQuant, and new hardware like ASUS’s UGen300 USB AI accelerator aimed at practical local LLM/VLM deployment on commodity systems (Bonsai 8B roundup, TurboQuant coverage, ASUS UGen300). At the same time, framework-level competition intensified around structured decoding, Blackwell-class GPU optimization, and ultra-low-bit runtime support (SGLang summary, vLLM summary, TensorRT-LLM support matrix context).

The code changes below map closely to those trends. ggml/llama.cpp and adjacent projects doubled down on quantization quality, WebGPU portability, and backend breadth; sglang, TensorRT-LLM, and LocalAI all pushed harder on disaggregated or distributed serving; and ExecuTorch, LiteRT, coremltools, and speech-focused Apple stacks kept improving the practical path from model artifact to mobile or embedded deployment (llama.cpp PR #21038, sglang PR #19890, LocalAI PR #9124, ExecuTorch v1.2.0, coremltools PR #2664).

Highlights

  1. SGLang had one of the week’s biggest systems pushes, advancing disaggregated serving and KV transfer with GPU staging-buffer infrastructure and a dynamic ring allocator, while also shipping v0.5.10rc0 (PR #19890, commit 821a8a9, release v0.5.10rc0). This lines up directly with the market’s shift toward heterogeneous serving and cache-aware inference architectures.

  2. ggml/llama.cpp made high-impact quantization and backend moves, most notably activation rotation before quantization, plus CUDA/HIP Flash Attention improvements and WebGPU modernization (PR #21038, PR #20998, PR #21046). In a week dominated by 1-bit and KV-cache efficiency narratives, llama.cpp again looked like the reference edge runtime where those ideas become usable.

  3. Meta shipped ExecuTorch v1.2.0, with meaningful progress for real-time and embedded inference: Voxtral Realtime, stronger Cortex‑M support, backend expansion, and binary-size reduction (release v1.2.0). Follow-on work in streaming audio, Qualcomm, Vulkan, OpenVINO, and NXP backends reinforced ExecuTorch’s role as a serious edge deployment substrate (PR #18637, PR #18309, PR #18516).

  4. NVIDIA’s TensorRT-LLM focused heavily on MoE, speculative decoding, and disaggregated serving, culminating in v1.3.0rc10 (PR #10479, PR #12239, release v1.3.0rc10). That makes NVIDIA’s stack one of the clearest datacenter-side counterparts to the edge/runtime efficiency work happening elsewhere.

  5. Fluid Inference had a standout week in on-device speech, shipping six FluidAudio releases while adding Nemotron Speech Streaming 0.6B, Parakeet-TDT-CTC-110M support, ARPA-backed CTC decoding, and modular export work in mobius (FluidAudio v0.13.0, v0.13.1, v0.13.2, v0.13.2.5, v0.13.2.6, v0.13.4, mobius PR #36). It was one of the strongest examples this week of edge inference moving beyond text into production-grade speech pipelines.

Activity Dashboard

OrganizationCommitsNew IssuesMerged PRsReleases
google24381901
sgl-project238782451
vllm-project2181462361
ggml153919555
openvinotoolkit8818900
meta82131021
huggingface7841761
nvidia6419571
miscellaneous529321
microsoft5024551
mudler478390
apple3530353
ollama2960282
fluid-inference2712276
try-mirai270243
apache263260
k2-fsa2417241
open-webui2378261
exo-explore179171
internlm163160
deepspeedai154151
cactus-compute12280
alibaba11950
oobabooga9200
tencent7270
rocm7170
ubiquitouslearning4120
runanywhereai4100
mozilla-ai4440
triton-inference-server4241
picovoice3030
mlc-ai2420
argmax1211
tensorflow1110

Major Releases

ggml

  • llama.cpp b8611 (2026-04-01) — fixes RWKV thread assignment behavior in the runtime. Release notes
  • llama.cpp b8610 (2026-04-01) — fixes RVV fallback behavior when zvfh is unavailable. Release notes
  • llama.cpp b8609 (2026-04-01) — adds CUDA Flash Attention support for head dimension 512. Release notes
  • llama.cpp b8607 (2026-04-01) — switches WebGPU quantized buffers to u32 for broader compatibility. Release notes
  • llama.cpp b8606 (2026-04-01) — ports WebGPU AOT operators to JIT. Release notes

huggingface

  • transformers v5.4.0 (2026-03-27) — broad model expansion including PaddlePaddle models, Mistral 4, PI0, VidEoMT, UVDoc, SLANeXt, and Jina Embeddings v3. Release notes

meta

  • ExecuTorch v1.2.0 (2026-04-01) — adds Voxtral Realtime, strengthens Cortex‑M support, improves backend coverage, and reduces binary size. Release notes

nvidia

  • TensorRT-LLM v1.3.0rc10 (2026-03-31) — prerelease with Qwen 3.5 NVFP4 support, Nemotron-H fused all-reduce+norm, request priority in the LLM API, and log-prob behavior changes. Release notes

sgl-project

  • sglang v0.5.10rc0 (2026-03-28) — enables piecewise CUDA graph by default and highlights elastic EP for partial failure tolerance. Release notes

fluid-inference

  • FluidAudio v0.13.4 (2026-03-29) — standalone CTC head for custom vocabulary, minimal BPE tokenizer, and benchmark RTFx tracking. Release notes
  • FluidAudio v0.13.2.6 (2026-03-28) — ASR directory structure docs and benchmark/regression updates. Release notes
  • FluidAudio v0.13.2.5 (2026-03-28) — ASR refactor by model family with StreamingAsrEngine protocol. Release notes
  • FluidAudio v0.13.2 (2026-03-26) — Parakeet-TDT-CTC-110M and CTC decoder with ARPA LM support. Release notes
  • FluidAudio v0.13.1 (2026-03-26) — Nemotron Speech Streaming 0.6B support. Release notes
  • FluidAudio v0.13.0 (2026-03-26) — broader model additions and code reorg prep. Release notes

try-mirai

  • uzu-swift v0.3.0 (2026-03-30) — substantial Swift SDK update for Apple-platform client delivery. Release notes

argmax

  • One org release published this week, but version/date details were not included in the source summaries, so they cannot be stated here without risking inaccuracy.

deepspeedai

  • One org release published this week, but version/date details were not included in the source summaries.

exo-explore

  • One org release published this week, but version/date details were not included in the source summaries.

google

  • One org release published this week, but the supplied summaries only exposed linked PRs/commits, not the release tag itself.

k2-fsa

  • One org release published this week, but version/date details were not included in the source summaries.

ollama

  • Two org releases published this week, but version/date details were not included in the source summaries.

open-webui

  • One org release published this week, but version/date details were not included in the source summaries.

triton-inference-server

  • One org release published this week, but version/date details were not included in the source summaries.

Code Changes by Category

LLM Inference Engines

ggml / llama.cpp had a classic “everything at once” week: quality, kernels, browser portability, and serving fixes all moved. The most consequential change was activation rotation before quantization in PR #21038, a notable quality-oriented move in the same week the market was talking about 1-bit models and KV-cache compression. On the backend side, the project added CUDA/HIP Flash Attention support for head dimension 512 in PR #20998, fixed CUDA/HIP FA kernel selection in PR #21271, and broadened NVFP4 support across CUDA and SYCL in PR #21074 and PR #21227. WebGPU also kept maturing through quantized-buffer u32 changes in PR #21046 and AOT→JIT operator porting in PR #20728.

The serving layer also got practical fixes: WebUI auth/static asset handling in PR #21269, proxy header normalization in PR #21235, gzip removal from the WebUI bundle in PR #21073, and better tool/function-call parsing in PR #21242. Rapid releases b8606–b8611 packaged many of those changes for users (b8606, b8607, b8609, b8610, b8611).

sglang was arguably the week’s most aggressive serving-systems project. The headline was new GPU staging-buffer infrastructure with a dynamic ring allocator for heterogeneous TP KV transfer in PR #19890 and commit 821a8a9, a direct response to the industry’s growing focus on cache movement and disaggregated serving. Kernel/backend work also stayed intense: TRT-LLM sparse MLA kernel support for prefill batches in PR #21783, FlashInfer/TRT-LLM MoE deduplication in PR #21233, and flashinfer_trtllm MXFP8 GEMM integration in PR #21576. The release v0.5.10rc0 formalized a lot of this momentum (release).

Beyond raw serving, SGLang also expanded hardware and modality coverage. It landed a full NPU test pipeline in PR #20751, updated Ascend docs in PR #21807, added MUSA FA3 attention backend support for diffusion in PR #18648, and merged native MLX execution backend support for Apple Silicon in PR #20342. That breadth makes SGLang one of the clearest examples of a project trying to span both hyperscale serving and heterogeneous deployment.

NVIDIA / TensorRT-LLM concentrated on high-end serving economics. MoE backend expansion via a new densegemm backend landed in PR #10479, Nemotron-H EPLB/load-balancing support in PR #12280, and Mamba2 speculative decoding kernel work in PR #12537. On the serving architecture side, “gen-first” disaggregated scheduling arrived in PR #12239, FORCE_CHUNK context chunking in PR #12483, and KV-aware routing in PR #12315, with reliability fixes for lost requests and sampler synchronization in PR #12348 and commit d036f74.

The release v1.3.0rc10 packaged several of those themes, including Qwen 3.5 NVFP4 support and request priority in the LLM API (release). This is exactly the kind of stack-level work you’d expect in a week where Blackwell optimization and serving efficiency were major market talking points.

mudler / LocalAI made perhaps the most visible open-source move toward operable distributed inference. The foundational distributed mode landed in PR #9124, then quickly gained node reconciliation in PR #9186, autoscaler min/max controls in commit 8862e3c, redundant model-transfer avoidance in PR #9193, and inflight accounting fixes in PR #9194. Node lifecycle controls also improved with undrain/resume support in PR #9197 and offline-node restoration in commit 3cc05af.

This was not just backend plumbing. LocalAI also surfaced cluster status in the UI via commit 221ff0f, improved node-page presentation in commit b4fff92, and tightened API/auth behavior in PR #9133 and PR #9189. In a market increasingly focused on split pipelines and heterogeneous serving, LocalAI’s progress stood out as a practical, operator-facing implementation.

Hugging Face / transformers had its biggest inference motion in serving infrastructure rather than model count. The serving stack was reorganized into proper modules in PR #44796, while continuous batching improved through PR #45057, PR #45112, commit 8213e0d, and docs in PR #44896. That’s notable because text-generation-inference itself was quiet this week, making transformers the org’s main inference-serving locus.

The other major theme was post-release hardening after v5.4.0 (release). Fixes for remote-code/config resolution and LightGlue remote code execution landed in PR #45094 and PR #45169, with additional regression cleanup in PR #45122, PR #45045, and PR #45007. It was a reminder that broad model/platform releases increasingly require immediate stabilization work as downstream inference users adopt them.

Ollama, vLLM, open-webui, exo-explore, deepspeedai, internlm, oobabooga, and mlc-ai all showed meaningful aggregate activity, but the supplied summaries did not include repo-level links for most of their concrete code changes. The strongest verified signals are that Ollama shipped two releases with 29 commits and 28 merged PRs, while vLLM had one of the busiest weeks in the dataset at 218 commits, 236 merged PRs, and 146 new issues, plus one release at the org level. Because the repo summaries for vllm-project were unavailable due to API error, specific technical claims cannot be safely expanded here beyond those aggregate metrics.

Apple Silicon & MLX Ecosystem

Apple had a quieter but meaningful week centered on coremltools. Conversion coverage improved with TensorFlow OnesLike support in PR #2664 and a meshgrid fix for non-1D inputs in PR #2665. These are not flashy changes, but they matter because edge deployment friction often comes from exactly these conversion edge cases rather than from the runtime itself. The week’s release count shows three Apple-org releases, though the supplied summaries only exposed linked PRs and not the release tags.

The more interesting Apple-adjacent story came from outside Apple’s org. SGLang merged native MLX execution backend support for Apple Silicon Mac in PR #20342, while Fluid Inference and try-mirai both kept investing in Swift-native delivery. That combination—Core ML conversion fixes, MLX backend adoption, and more Swift SDK packaging—suggests Apple Silicon remains one of the most strategically important local inference targets.

Fluid Inference had one of the strongest Apple-native weeks in the dataset. In mobius, the team added KittenTTS Nano CoreML conversion in PR #33 and standalone CTC head export for parakeet-tdt-ctc-110m in PR #36. In FluidAudio, they added Nemotron Speech Streaming 0.6B in PR #432, Parakeet-TDT-CTC-110M support in PR #433, ARPA-backed CTC decoding in PR #436, and a StreamingAsrEngine protocol in PR #440. The rapid release train from v0.13.0 through v0.13.4 shows a team iterating quickly on a native Apple speech stack (v0.13.0, v0.13.4).

try-mirai / uzu-swift shipped v0.3.0 via commit dc69396 and its tagged release (release notes). The source summary didn’t include a detailed changelog body, but the release commit touched enough files to suggest a substantive SDK update. For teams building Apple-native inference clients, that kind of packaging work matters as much as model support.

Mobile & Embedded

Meta / ExecuTorch was the category leader this week. The release v1.2.0 added Voxtral Realtime, improved Cortex‑M support, expanded backends, and reduced binary size (release). Real-time speech work was especially notable: decoder ring-buffer KV cache support for effectively unbounded streaming inference landed in PR #18637 and commit 043a09b, while a streaming Silero VAD runner appeared in commit 3616c3d. That makes ExecuTorch one of the clearest examples this week of edge inference moving toward real-time multimodal interaction, not just offline text generation.

Backend breadth also expanded materially. Vulkan gained a fused HuggingFace RoPE op in commit b93a21a and a compatibility fix for devices lacking VK_KHR_16bit_storage in PR #18653. Qualcomm AI Engine Direct improved AOT lowering time in PR #18516 and added per-channel quantization for embedding ops in PR #18433. OpenVINO backend packaging was added to Linux x86_64 wheels in PR #18309, while NXP support advanced through PR #17623, PR #17818, and PR #18102. That is unusually broad backend motion for a single week.

Google’s mobile/edge stack also had a strong week, though the supplied org summary exposed it through linked repos rather than a single named release. litert-torch migrated from ai_edge_litert to litert_converter in PR #987 and updated support to PyTorch 2.11.0 in PR #990. litert-samples added docs and output assets for an image classification app using the compiled model API in PR #95, while mediapipe-samples refreshed its Interactive Segmentation sample in PR #669. This looks like steady ecosystem consolidation: fewer naming seams, better sample coverage, and more polished conversion/deployment guidance.

UbiquitousLearning / mllm focused tightly on Qualcomm/QNN correctness. The project fixed qnn-aot block sizing for Qwen-4B in PR #661 and added Qualcomm calibration dataset compatibility checks in PR #660. Those are exactly the kinds of changes that matter when edge deployment moves from demos to actual device-specific pipelines. The closed bug report on unsupported Int16 in QNN, issue #662, reinforces that users are actively exercising these paths.

RunAnywhereAI spent the week on mobile SDK integration hardening. runanywhere-sdks fixed JNI symbol names in commit 9764faa, duplicate libc++_shared.so packaging in Flutter Android builds in commit cd703ed, a stale RAG import in commit a32907c, and document picker/version mismatch plus Flutter Android UI issues in commit de307c1. Not glamorous, but highly relevant for teams embedding inference into real apps.

Tencent / TNN, Paddle-Lite, TensorFlow Lite-adjacent tracking, ROCm, and Alibaba all had either minimal visible activity or missing repo-level detail this week. Where links were unavailable or summaries failed, it’s safest to say only that activity existed at the org level without over-interpreting it.

Model Serving & Deployment

OpenVINO had a strong backend-focused week. Attention correctness improved through SDPA fixes in PR #35021 and PR #34177, while NPU support expanded with FlashAttentionTile for GQA in PR #34929. GPU backend work included GatedDeltaNet support in PR #34481, optimized cubic interpolation in PR #34945, and Level Zero runtime unit tests in PR #34922. The project also switched to the new VM Runtime API in PR #34961, a sign of deeper runtime modernization.

OpenVINO’s frontend and portability work also mattered. RV64 support in Snippets advanced in PR #34372, ONNX GroupNormalization v21 landed in PR #32700, aten::rot90 support arrived in PR #33177, and an ONNX converter safety fix landed in PR #34760. In a week where the market was talking about edge hardware proliferation, OpenVINO looked like a project investing in the unglamorous but essential work of making more models run correctly on more devices.

Triton Inference Server shipped one release and had a light maintenance week at the org level, but the supplied summaries did not include repo-level change links. The verified signal is continued release cadence with 4 commits, 4 merged PRs, and 2 new issues. Without source links for specific changes, it’s best treated as a steady-state serving platform week rather than a headline week.

Open WebUI also had a high-signal community week—78 new issues and 26 merged PRs—plus one release at the org level. The lack of repo-level links prevents detailed technical breakdown, but the issue volume alone suggests the project remains one of the most actively used front ends in the local inference ecosystem.

Other Notable Changes

Google deserves a second mention outside mobile because the org posted the single highest commit count in the table at 243 commits and 190 merged PRs. The linked work we do have points to a practical edge-developer focus: converter consolidation in LiteRT, sample refreshes, and MediaPipe maintenance (litert-torch PR #987, litert-torch PR #990, litert-samples PR #95, mediapipe-samples PR #669). That fits the broader market context around Gemma, AI Edge Gallery demand, and local deployment momentum.

Fluid Inference was one of the most coherent cross-repo stories of the week. FluidAudio and mobius both advanced the same speech pipeline: model support, exportability, decoding flexibility, and architecture cleanup. The issue-to-execution loop was also unusually visible, with architecture debt called out in issue #457 and then addressed through refactors like PR #440 and PR #466. For anyone tracking on-device speech, this org is gaining momentum fast.

Cactus-compute’s tracked ecosystem surfaced an interesting pattern even though most of the work happened in adjacent upstreams: Core ML export additions in mobius, Qualcomm/QNN fixes in mllm, LiteRT converter changes, and the uzu-swift release all point to a broader edge-app toolchain maturing around packaging and deployment rather than just raw model execution (mobius PR #33, mllm PR #661, litert-torch PR #987, uzu-swift v0.3.0).

Community Pulse

The loudest discussion this week was around quantization quality and cache efficiency. In llama.cpp, PR #21038 on activation rotation before quantization drew 49 comments and 114 reactions, making it one of the clearest community hotspots in the dataset. That lines up neatly with the external attention on 1-bit models and TurboQuant-style KV compression.

SGLang also had a very active issue stream around release quality and hardware compatibility. The most visible thread was issue #21696, a high-priority regression report after upgrading to 0.5.10rc0 with Qwen3.5-27B-FP8 on Blackwell SM120. Additional issues on ROCm image compatibility and node health in PD separation scenarios—issue #21774 and issue #21837—show that the project’s rapid backend expansion is being tested hard by users.

On the serving side, TensorRT-LLM users focused on disaggregated serving reliability and model onboarding. Notable threads included issue #12560 on hangs during block reuse, issue #12660 on reuse-tree eviction during context transfer, and issue #12628 on degraded FP4 inference quality on RTX 5090. These are exactly the kinds of issues you’d expect when a project is pushing hard on advanced serving architectures and low-precision execution simultaneously.

Hugging Face saw a different kind of community pulse: post-release stabilization. transformers issues on MoE router loss behavior, remote_code breakage, and config typing compatibility—issue #45120, issue #45020, issue #45070, issue #45042—show how quickly a broad release like v5.4.0 gets pressure-tested by downstream users.

A final community signal worth noting: LocalAI, Open WebUI, Ollama, and vLLM all posted high issue counts at the org level—8, 78, 60, and 146 respectively—suggesting that the most-used inference tools continue to accumulate support load quickly as adoption broadens. In vLLM’s case, the missing repo summaries mean we can’t safely identify the hottest threads, but the volume alone is notable.

Worth Watching

KV-cache optimization is moving from research topic to implementation roadmap. The external TurboQuant narrative was echoed internally by issue and PR activity across llama.cpp, OpenVINO, TensorRT-LLM, and SGLang. OpenVINO’s issue #34954 explicitly asks about KV-cache quantization with TurboQuant-style ideas, while SGLang and TensorRT-LLM are both investing in the infrastructure needed to move and reuse cache efficiently.

Disaggregated and distributed serving is no longer niche. SGLang’s KV-transfer work (PR #19890), TensorRT-LLM’s gen-first scheduling (PR #12239), and LocalAI’s distributed mode (PR #9124) all point in the same direction: inference stacks are being redesigned around separate prompt, cache, and decode concerns rather than a monolithic single-node server.

Apple-native inference keeps broadening from hobbyist to product surface. Between coremltools conversion fixes (PR #2664), SGLang’s MLX backend (PR #20342), FluidAudio’s rapid Swift-native speech releases (v0.13.0), and uzu-swift (v0.3.0), the Apple ecosystem looks increasingly complete across conversion, runtime, and app integration.

Speech is becoming a first-class edge workload. ExecuTorch’s Voxtral Realtime and streaming VAD work (v1.2.0, PR #18637) and Fluid Inference’s ASR/TTS pipeline expansion (PR #432, PR #436, PR #471) suggest that on-device speech is now one of the fastest-moving subcategories in edge ML.

Watch the open PRs around next-step backend capability. In llama.cpp, CPU TurboQuant KV cache types remain open in PR #21089, DeepSeek sparse attention support is in PR #21149, and Responses API/Codex CLI compatibility is in PR #21174. In SGLang, the pending Transformers 5.4.0 refactor/upgrade in PR #21569 and FP8 PCG inductor optimization in PR #21734 look especially relevant for next week.


Generated by the ODLM Newsletter Pipeline | Data: git logs, GitHub API, OpenAI GPT-5.4