Powered by RunAnywhere — On-device AI infrastructure
← All issues
2026-W14
1,521 commits
826 issues
1,114 PRs
96 releases

Edge AI Weekly — 2026-W14 (2026-04-02 to 2026-04-08)

Tracking 117 repositories across 36 organizations in on-device and edge ML inference.
This week: 1521 commits, 96 releases, 826 new issues, 1114 merged PRs.


This Week in Inference — Market Context

The broader inference market split cleanly into two tracks this week: small, multimodal open models for edge deployment and very large agent-oriented systems for long-horizon cloud inference. Google’s Gemma 4 launch was the dominant open-weight event, spanning phone-class and workstation-class deployments and immediately triggering a wave of support work across local runtimes, serving stacks, and Apple Silicon tooling. At the same time, discussion around KV-cache compression, heterogeneous serving, and routing-aware infrastructure intensified, helped by fresh attention on techniques like TurboQuant and by infrastructure moves such as NVIDIA’s NVLink-focused expansion and Intel/SambaNova’s heterogeneous inference positioning (market briefing, TurboQuant coverage, Intel/SambaNova coverage).

The code changes below mirrored those trends almost perfectly. Gemma 4 support and stabilization showed up everywhere from vLLM (v0.19.0, #38826) to llama.cpp (#21418) to Ollama (v0.20.0, #15214) to MLX-LM (#1093). Meanwhile, multiple projects pushed on KV reuse, prompt caching, quantization, and backend diversity: MNN added prompt caching and RVV/MUSA support (#4330, #4182, #4331); SGLang expanded shared-prefix and speculative infrastructure (#21405, #22203); and ExecuTorch landed a new MLX delegate for Apple execution (#17803, #17828, #17829).

Highlights

  1. Gemma 4 became the week’s unifying engineering theme across the open inference stack.
    vLLM shipped full Gemma 4 support in v0.19.0 (release, #38826); Ollama launched Gemma 4 in v0.20.0 and spent the rest of the week stabilizing parser, Flash Attention, and tool-calling behavior (release, #15214, #15296); llama.cpp hardened parser/tokenizer/audio/attention support (#21418, #21343, #21513); and MLX-LM added then rapidly fixed Gemma 4 support on Apple hardware (#1093, #1105, #1112, #1114).

  2. SGLang posted the highest visible engineering velocity and broadened beyond text LLM serving into speech and resilience.
    The project logged 246 commits and 205 merged PRs, shipped v0.5.10 (release), added Gemma 4 support (#21952), introduced a new transcription adapter and Qwen3-ASR support (#22073, a5ed507a1639), and continued deep work on speculative decoding, IndexCache, Blackwell/AMD/NPU support, and failure-tolerant serving (#22203, #21405, de0cfed1590b).

  3. ExecuTorch made one of the most important edge-runtime moves of the week with a new Apple MLX delegate.
    Meta landed the delegate in three parts (#17803, #17828, #17829), while also pushing Qualcomm, Vulkan, Arm, and Cortex-M support forward (#18434, #18743, #18690). That aligns directly with the market’s shift toward heterogeneous, deployment-diverse inference.

  4. OpenVINO 2026.1 and MNN 3.5.0 showed that edge deployment breadth is now a first-class product feature.
    OpenVINO 2026.1.0 expanded GenAI coverage with Qwen3 VL and GPT-OSS 120B support while continuing NPU/GPU/runtime work (release, #34798, #35121). Alibaba’s MNN 3.5.0 added MUSA and RVV support plus prompt caching and GPU linear-attention optimization (release, #4182, #4331, #4330).

  5. Local inference UX stacks kept pace with engine changes through rapid release trains.
    LocalAI shipped v4.1.0–v4.1.3 in one week (v4.1.0, v4.1.3) with a new Kokoros backend and multiple Gemma 4 / Anthropic / login fixes (#9212, 92f99b1ec325ad5deb912362bf438c1356a895e3); text-generation-webui shipped v4.4 with MCP server support (release, b1d06dc); and Ollama pushed five releases while chasing Gemma 4 regressions (v0.20.4).

Activity Dashboard

OrganizationCommitsNew IssuesMerged PRsReleases
sgl-project246662051
google229761162
vllm-project1671251352
nvidia11821950
ggml931217358
oobabooga721095
huggingface6447631
meta618430
mudler5815454
openvinotoolkit5811622
microsoft4710390
apache396311
ollama30123235
blaizzy2832222
fluid-inference263132
try-mirai260210
apple2532191
k2-fsa247162
miscellaneous197150
cactus-compute195112
rocm150100
internlm128121
alibaba10961
exo-explore91150
tencent8170
triton-inference-server5020
mlc-ai3320
argmax3011
zetic-ai3003
deepspeedai2210
mozilla-ai2320

Major Releases

vllm-project

  • vLLM v0.19.0 — 2026-04-03. Major serving release led by Gemma 4 support, plus parser, quantization, and platform updates. Release notes
  • LiteRT-LM v0.10.1 — 2026-04-03. Added Gemma 4 support across broader hardware targets. Release notes

sgl-project

  • sglang v0.5.10 — 2026-04-06. Enables piecewise CUDA graph by default and adds Elastic EP for partial failure tolerance. Release notes

ggml

  • llama.cpp b8710 — 2026-04-08. Latest rolling build including --save-logits callback changes. Release notes
  • llama.cpp b8682 — 2026-04-06. Added CPU Q1_0 1-bit quantization support. Release notes
  • llama.cpp b8665 — 2026-04-04. Added Gemma 4 specialized parser support. Release notes

mudler

  • LocalAI v4.1.3 — 2026-04-06. Fixes legacy API key login and Anthropic SSE/tool-call regressions. Release notes
  • LocalAI v4.1.2 — 2026-04-06. Fixes autoparser logprobs and chat delta retry behavior. Release notes
  • LocalAI v4.1.1 — 2026-04-05. Gemma 4 tokenization and login regression fixes. Release notes
  • LocalAI v4.1.0 — 2026-04-02. Major feature release for the week’s LocalAI cycle. Release notes

ollama

  • v0.20.4 — 2026-04-07. MLX M5 performance improvements and Gemma 4 Flash Attention updates. Release notes
  • v0.20.3 — 2026-04-07. Gemma 4 tool-calling improvements and app/model updates. Release notes
  • v0.20.2 — 2026-04-04. App home defaults to new chat. Release notes
  • v0.20.1 — 2026-04-03. Prerelease with benchmark and Gemma 4 parser/build fixes. Release notes
  • v0.20.0 — 2026-04-02. Major Gemma 4 release. Release notes

apple

  • mlx-lm v0.31.2 — 2026-04-07. Patch release focused on cache handling and batch generator refactoring. Release notes

huggingface

  • transformers v5.5.0 — 2026-04-02. Major release featuring Gemma 4 and other model additions. Release notes

openvinotoolkit

  • OpenVINO 2026.1.0 — 2026-04-07. Major GenAI/model coverage update including Qwen3 VL and GPT-OSS 120B support. Release notes

alibaba

  • MNN 3.5.0 — 2026-04-07. Expanded backend coverage across Vulkan/MUSA/QNN/RVV with TurboQuant and voice interaction improvements. Release notes

internlm

  • lmdeploy v0.12.3 — 2026-04-08. Adds video input support and compressed-tensors gs32 in TurboMind. Release notes

cactus-compute

  • cactus-react-native v1.12.0 — 2026-04-08. Adds prompt prefill, reasoning output, speaker diarization, and speaker embeddings. Release notes

oobabooga

  • text-generation-webui v4.4 — 2026-04-07. Adds MCP server support in the UI. Release notes
  • text-generation-webui v4.3.3 — 2026-04-04. Gemma 4 and ik_llama.cpp support. Release notes

zetic-ai

  • ZeticMLangeiOS v1.5.14 — 2026-04-07. Fixes large local-cache model loading. Release notes
  • ZeticMLangeiOS v1.5.13 — 2026-04-06. Improves Gemma 4 support and fixes some LLM runtime failures. Release notes
  • ZeticMLangeiOS v1.5.12 — 2026-04-06. Adds Lite tier enum and Gemma 4 support. Release notes

Code Changes by Category

LLM Inference Engines

vLLM had one of the week’s most important releases with v0.19.0, bringing full Gemma 4 support including MoE, multimodal, reasoning, and tool use (release, #38826). The follow-up work is what mattered operationally: parser and streaming fixes for reasoning/tool-call corruption landed immediately after release (13151a4df43d, d734445fcd79, 8477fe427d17). vLLM also kept pushing low-precision and scheduler work, including KV-cache per-token-head INT8/FP8 quantization (#38378), fused FP8/NVFP4 output quantization in MLA attention (#36205), and DeepSeek-V3.2 decode scheduling improvements (b55d830ec782).

SGLang was the highest-throughput repo in the dataset and looked increasingly like a full-spectrum serving platform rather than just an LLM server. It added Gemma 4 support (#21952), broadened ASR with a transcription adapter and Qwen3-ASR integration (#22073, a5ed507a1639), and strengthened serving APIs with SequenceClassification support and transformer backend upgrades (712c8c50512e, #19163). Its speculative and cache infrastructure also advanced through hybrid linear attention fixes and broader IndexCache coverage (c89afaea7cbd, #21405).

llama.cpp spent the week doing what it often does after a major model launch: absorbing the ecosystem’s bug reports and turning them into backend and parser fixes. Gemma 4 support was hardened through parser support (#21418), tokenizer fixes (#21343), vision+MoE support (#21309), and KV-cache attention rotation updates (#21513). On the performance side, Q1_0 1-bit quantization landed for CPU (#21273) and then Apple GPUs via Metal (dcdcbad42a38a4420384faad714de78ffc9f3ef3), a notable response to the market’s renewed interest in extreme quantization.

Ollama had the most visible “ship fast, stabilize faster” week. v0.20.0 introduced Gemma 4 (release, #15214), then the team iterated on Flash Attention enablement and rollback logic across older CUDA hardware (#15296, #15378, #15403). Parser and tool-calling reliability improved through quoted-string parsing fixes, extra closing-tag suppression, and tool-call repair logic (49d5fd5a3e1a4b4ffc5c232621e98f2dd450fb99, #15254, #15374). The Apple angle also mattered: Ollama improved M5 performance with NAX and switched MLX to the default HTTP client (8968740836d30dc2e96671d829c370b1d6fcd6b6, #15405).

LocalAI tracked upstream engine churn closely while adding its own product-layer features. The biggest addition was a new Kokoros backend (#9212), alongside a much richer model config editor (#9149). The rapid v4.1.0–v4.1.3 release train focused on login regressions, Anthropic SSE/tool-call handling, and Gemma 4 fallout (v4.1.0, v4.1.3, 92f99b1ec325ad5deb912362bf438c1356a895e3, 0f9d516a6c2cb59235c594c59d5f0980c1b1fadf). It also kept updating its vendored llama.cpp revisions (#9269).

lmdeploy continued to strengthen its production-serving profile. The most notable work was broader Ascend support and prefix-cache fixes (#4485, #4448), plus Qwen3.5 multi-token prediction support (#4437). Low-level correctness patches covered AWQ compatibility, paged-attention pointer safety, and GLM-4.7-Flash correctness (#4503, #4494, #4500), while v0.12.3 packaged the week’s work (release).

TensorRT-LLM was NVIDIA’s main engine story this week. AutoDeploy gained a Triton MLA kernel path and Gemma 4 support (#12664, #12710); Qwen-family work included decode-kernel optimization, mixed-precision fixes, and LoRA support repairs (#12740, #12609, #12785). The more strategic signal was disaggregated serving hardening: race avoidance, parameter propagation, and device-selection fixes all landed (#12466, #12513, #12619).

llamafile had a smaller but meaningful week, focused on Windows CUDA build support (8a629f4f656831933b7138b3d17b3ea8f4270840, #924) and a block-size correctness fix (e7e6796f9f4aa491550317e257f3beeea5613df2, #935). That fits the week’s broader theme of widening local deployment surfaces rather than only chasing raw throughput.

Apple Silicon & MLX Ecosystem

Apple’s MLX stack had a strong week, especially around model support and backend maturity. In mlx-lm, Gemma 4 support landed quickly (4469ad464702a66ad49d5aade1c85fa2d42c8251, #1093) and was then hardened through tool parser fixes, quantized projection loading fixes, and think/tool boundary handling (#1105, #1112, #1114). Speculative decoding corruption was also fixed (#1109), which matters as Apple-side inference gets more ambitious.

In mlx itself, the emphasis was lower-level correctness and backend breadth: CUDA thread safety (#3367), quantized gather-matmul on CUDA (#3321), transformer decoder cross-attention fixes (#3382), and Metal header resolution fixes (#3332). mlx-c added graph export and GGUF support (#112, #111), while mlx-examples added WAN 2.1 support (#1409). The pattern is clear: MLX is no longer just “Apple-only toy infra”; it is becoming a broader interoperability layer with Apple as the center of gravity.

Blaizzy’s mlx-audio-swift continued the Apple speech stack buildout. It added Cohere Transcribe support (d5394bd0148a1ad925dd70fcb7d4d124ac92a20b, #129), improved local model loading via fromModelDirectory (da935116eb83b033104e6135aaa7db87320d17d4, #144), and fixed quantized-model issues for Kokoro and Parakeet (#136, #145). That’s a useful counterpoint to the LLM-heavy week elsewhere: Apple-device speech inference is also moving fast.

Argmax was quieter but still product-focused. The Swift playground was refreshed for Argmax Pro SDK 2.0.9 with real-time transcription and speaker support (df284b840edaed736d61eb76ea0ad5332deb20fe, #9, release), while OpenBench improved local WhisperKit Pro benchmarking and custom-vocabulary evaluation (#98, #97).

Mobile & Embedded

ExecuTorch was the standout mobile/embedded project this week. The new MLX delegate was the headline (#17803, #17828, #17829), but the broader story was deployment diversity: Qualcomm AI Engine Direct improvements (#18434, #18743), Vulkan reliability work (21d9c64e7495d10a428d474ca0e2bc61842a9b00, #18776), Arm backend improvements (#18671), and Cortex-M beta status (#18690). It also shipped a long list of memory-safety and runtime correctness fixes, which is exactly what you want to see as an on-device runtime matures.

MNN 3.5.0 was another important mobile/embedded release. The addition of MUSA backend support (c857fa27d7650ecfb478afa0224bfb92aa3fa027, #4182) and RISC-V Vector support (ade3d6c589deb4b92eaa2b6153e1b82c72957f40, #4331) materially broadened deployability. Prompt caching for multi-turn chat (#4330) and OpenCL/Metal linear-attention optimizations (b56a81b39be2b9303c171edcb843453e2a7c4bfd, 2d4a17b4924eb1a06491d1d6f7ccfc6e02f76a4d) also tie directly into the market’s focus on memory efficiency and edge GPU performance.

Cactus Compute shipped a meaningful mobile-facing release in cactus-react-native v1.12.0, adding prompt prefill, reasoning output, speaker diarization, speaker embeddings, and VAD refactoring (release, #27). The quick CI follow-up to enable Git LFS after an Android build failure (3056293c0e5e38be5e770ecb37f6087ceb6874ce, #26) showed healthy release discipline.

ZeticMLangeiOS also deserves mention for a tight three-release stabilization cycle around Lite tier and Gemma 4 support: v1.5.12, v1.5.13, and v1.5.14 (release 1.5.12, release 1.5.13, release 1.5.14). The progression from feature enablement to runtime-failure fixes to large-cache loading fixes (24c5550a03169fd41b051db99150133c86e4d3ff, b15d33062949bf1ae1699af0f78e74e7d60c3da4, 321987b4112962b78068a9bb03110b3a99e2c599) is exactly how mobile SDKs tend to mature in practice.

Model Serving & Deployment

OpenVINO shipped one of the week’s most consequential deployment releases with 2026.1.0 (release). The release expanded GenAI coverage to Qwen3 VL and GPT-OSS 120B (#34798), while deeper runtime work improved NPU memory-pool behavior (#35121), unified Level Zero loading across GPU/NPU (#34928), and tuned GPU performance for grouped GEMM and Qwen3 MoE workloads (#34153, 5e09f61f2f74e87ac3c0a83dfcac9b23c90a35d2). OpenVINO continues to look like one of the most serious “broad hardware, broad model” deployment stacks in the open ecosystem.

ONNX Runtime had a strong infrastructure week even without a release. The biggest addition was model package support (#27786, fbba40a4c8996c69770d6562ae747c211f0b0b41). The CUDA Plugin EP matured quickly with arena allocators, IOBinding sync support, and Windows/CI support (4e1c42e2de754aaafc296660fed005c3d733ecb5, e688ef1ff98b496b10fdb9bf0f6a64dd24e56468, #27959). ONNX Runtime also added Qwen3.5-oriented ops like LinearAttention and CausalConvState (0fedb26c93e6c29882185715d5c2bb583a6d92b5, #27842) and continued WebGPU/browser-side work (bfec0b105a10674728e25e35b6978ff913eb36b2, f7751fed1721d504e5d033ddce321c12fc2c4877).

Triton Inference Server itself was quieter, but the work was high-signal. The most important patch fixed overflow risks when reading JSON inputs (e5aecb32335c9b6669230dcb4512dd5d680d2423, #8676). The repo also tightened QA model build behavior and ensemble concurrency test coverage (e48aa4858ea5ac1b1ec320cdd68ca7c7cd07d258, e0a6a1e62cf8302eb1ef53b46b49009b32a84a1d). This looked like classic pre-freeze hardening.

Apache TVM had a docs-heavy week, but there were still meaningful serving/runtime-adjacent changes. ONNX and TFLite frontend correctness improved through ties-to-even rounding semantics and sequence/ArgMax fixes (#19367, #19368, #19361), while WebGPU and TVM.js compatibility continued to improve (#18823, #18958). It was not a headline week for compiler architecture, but it was a useful maintenance week for deployment reliability.

Other Notable Changes

Hugging Face had a broad, ecosystem-shaping week. transformers v5.5.0 shipped with Gemma 4 and other model additions (release), then immediately moved to 5.6.0-dev0 (c38b2fb78eaedd4261a0e446f7976345cd1c7f1b). Gemma 4 export, tensor parallelism, and docs were refined (#45285, 7f6cc4b3c540a0836f0aad5921111e73205dd4e9), while per-request logits processors landed for serving workflows (#45026). In diffusers, image-generation coverage expanded with FLUX.2 small decoder and NucleusMoE-Image support (#13428, #13317).

text-generation-webui had a notable feature week with MCP server support in v4.4 (release, b1d06dcf96e2b5958ae004b8c9bbb0fc8518328b). That’s strategically interesting because it moves a local inference UI toward tool orchestration and remote capability discovery. The project also tightened security with a path traversal fix (#7462) and safer embedding-loader defaults (e18f32cba78d471dd86a924147aa3ea6638d5e97).

Fluid Inference kept building out multilingual CoreML speech support in mobius, adding Japanese and Chinese Parakeet CTC conversions (#39, #38). That’s a small repo, but it’s a good example of the week’s broader “verticalized edge AI” trend: practical, language-specific, device-native model packaging.

Community Pulse

The loudest community signal by far was post-launch Gemma 4 stabilization. In llama.cpp, issue #21321 on <unused24> token generation became a central thread, alongside reports on missing audio support and infinite output loops (#21325, #21365, #21375). Ollama saw a similar surge with 123 new issues, including CPU fallback confusion, hangs on gemma4:31b, Vulkan output problems, and older CUDA incompatibilities (#15237, #15387, #15261, #15354). vLLM also drew heavy Gemma 4 and concurrency-related traffic, including parser failures and CUDA graph memory-access bugs (#38855, #39025, #39072).

A second strong community theme was KV-cache efficiency and long-context economics, matching the market’s TurboQuant conversation. lmdeploy users explicitly asked for KV-cache quantization and compression (#4499, #4506, #4507); ExecuTorch merged TurboQuant TQ4 KV-cache compression for Qwen 3.5 MoE (#18687); and multiple serving stacks continued to invest in prompt/prefix reuse, from MNN prompt caching (#4330) to SGLang’s IndexCache expansion (#21405).

On the contributor side, SGLang stood out for sheer breadth of visible contributor activity, while OpenVINO and TVM both showed healthy external contribution patterns in docs, frontend coverage, and architecture-specific support (#35067, #19366). Smaller repos like mlx-audio-swift also showed encouraging contributor diversity around practical deployment fixes (#144, #145).

Worth Watching

1. Gemma 4 is still in the “support landed, edge cases surfacing” phase.
The first wave of support is now present across nearly every major open inference engine, but the issue streams show that parser behavior, tool calling, multimodal/audio support, and hardware-specific kernels are still settling. Watch follow-up work in vLLM (#38855), Ollama (#15392), llama.cpp (#21421), and MLX-LM (#1123).

2. KV-cache compression is moving from research topic to roadmap item.
The market conversation around TurboQuant was echoed directly in repo issue trackers and merged PRs. Requests in lmdeploy (#4499), merged work in ExecuTorch (#18687), and prompt/prefix reuse features in MNN and SGLang suggest that memory—not weights—is becoming the next major battleground.

3. Apple Silicon is becoming a first-class inference target across layers, not just apps.
This week’s evidence spans ExecuTorch’s MLX delegate (#17803), Ollama’s MLX/M5 work (#15345), MLX-LM Gemma 4 support (#1093), and Apple-focused speech stacks like mlx-audio-swift (#129). The next question is whether Apple-targeted runtimes can keep pace on tooling, quantization, and multimodal support.

4. Serving stacks are broadening into multimodal and speech, not just text.
SGLang added Qwen3-ASR (#22073); Fluid Inference expanded multilingual CoreML speech conversions (#38, #39); and multiple UI/runtime projects improved tool-calling and agent workflows. Expect more convergence between “LLM server,” “speech runtime,” and “agent platform.”

5. Heterogeneous backend support is no longer optional.
This week’s commits touched MUSA, RVV, Ascend, ROCm, Vulkan, WebGPU, NPU, Qualcomm, Cortex-M, and Apple MLX across different projects (#4182, #4331, #4485, #34928, #18434). That tracks the market’s move toward heterogeneous inference by design rather than by exception.


Generated by the ODLM Newsletter Pipeline | Data: git logs, GitHub API, OpenAI GPT-5.4