NOTE

Hardware-Aware Inference

authorclaude-sonnet-4-6 aliases titleHardware-Aware Inference statusactive date2026-05-04 typepermanent

Hardware-Aware Inference

Hardware-aware inference is the practice of matching model execution to the available hardware accelerators — and selecting the runtime and model format that exposes those accelerators to a running process.

When to Start Here

Start here when the question is about where and how a model physically runs, not about which model to use or how to write the agent loop around it. Typical entry questions:

  • "Which acceleration stack should I use for my hardware?"
  • "What runtime do I need to run local models on Windows / Mac?"
  • "How does ONNX differ from GGUF, and why does it matter for my setup?"
  • "My application runs in Python or .NET — how do I get local inference without a cloud endpoint?"

The Two Dimensions

Picking a local inference setup requires two separate decisions that interact:

Hardware target — which silicon does the work: NVIDIA GPU (CUDA), Apple Silicon (MLX / Metal), any DX12 GPU (DirectML), NPU, or CPU fallback.

Runtime + model format — the software layer between your code and the hardware. The format determines which runtimes are available:

Format Runtime options Hardware surfaces
GGUF llama.cpp, Ollama, LM-Kit.NET CUDA, Metal, DirectML, CPU
ONNX ONNX Runtime, Foundry Local CUDA EP, CoreML EP, DirectML EP, NPU EP

Choosing a format first narrows the runtime options. Choosing hardware first narrows the formats that perform well on that silicon.

The tables and routing guidance below combine two layers:

  • directly sourced facts from foundry-local and lit-foundry-local about the Microsoft / ONNX / NPU stack
  • vault-local synthesis about adjacent runtimes such as Ollama and llama.cpp, based on how related notes in this vault divide the local inference ecosystem

Read the runtime-selection matrix as operational guidance, not as a claim that every row comes from one upstream source.

Acceleration Backends

CUDA (NVIDIA) — industry standard for NVIDIA GPUs. Mature library ecosystem (cuDNN, TensorRT). Best absolute throughput for large models on Linux or Windows. Requires the CUDA toolkit installed on the host.

MLX (Apple) — Apple's open-source framework optimized for Apple Silicon's Unified Memory Architecture (M1–M4). CPU and GPU share the same physical memory pool, so large models can exceed GPU VRAM limits without discrete-GPU copy overhead. The preferred acceleration surface for Mac-native Python workflows.

DirectML (Microsoft) — hardware-agnostic ML library for Windows and WSL that targets any DX12-capable GPU: NVIDIA, AMD, or Intel Arc, including integrated graphics. The practical choice for Windows development when CUDA hardware is unavailable or when cross-vendor GPU coverage is needed.

ONNX Runtime Execution Providers — ONNX Runtime selects the best available accelerator at runtime via pluggable "execution providers": CUDA EP, DirectML EP, CoreML EP, NPU EP. It is the inference substrate under Foundry Local and is also available as a standalone library for cross-platform deployment.

NPU targets — dedicated neural processing units on modern consumer hardware (Intel NPU on Meteor Lake, Apple Neural Engine via CoreML EP, Qualcomm Hexagon on Snapdragon). Foundry Local explicitly routes int4 ONNX models to NPUs on supported hardware, delivering lower power draw than GPU inference at modest model sizes.

Selection Strategy

Scenario Stack Concrete runtime
Mac development, large models MLX + Metal Ollama on Mac, or MLX Python library directly
Windows, any GPU vendor DirectML Foundry Local (ONNX), or llama.cpp DirectML backend
NVIDIA GPU, Linux or Windows CUDA Ollama (llama.cpp), ORT CUDA EP, or Foundry Local
.NET application, embedded model CUDA / Metal / DirectML lm-kit-dotnet (wraps llama.cpp)
Cross-platform or NPU-targeted ONNX Runtime Foundry Local (managed) or ORT standalone
CPU fallback, small models x64 / ARM64 Ollama, llama.cpp, or ORT CPU EP

Quantization affects which hardware tier is required. In this note, the GGUF Q4 / Q8 guidance is a vault-local rule of thumb for reasoning about memory pressure and deployment fit, not a property documented by the Foundry Local source itself. By contrast, the ONNX / int4 edge-device story is directly aligned with the Foundry Local model-catalog positioning.

Where to Go Next


References