Literature: Azure AI Foundry Local

This documentation covers the Azure AI Foundry Local SDK and runtime for hardware-optimized, on-device inference.

Core Pillars

Privacy: Runtime inference stays on the local device, though initial model download, setup, telemetry configuration, or licensing flows may still require network access.
Efficiency: Automatic optimization for NPUs, GPUs (DirectML, CUDA, Metal), and CPUs.
Compatibility: OpenAI API-compatible in common local-serving scenarios, not necessarily a perfect drop-in replacement across every endpoint or behavior.
Resilience: Offline operation is a design goal after initial configuration and model download.

Supports ONNX-optimized model variants, including:

API parity: OpenAI compatibility should be read as interface compatibility for common cases, not guaranteed parity for every streaming behavior, error shape, or model capability.
Offline scope: Offline use depends on whether the required models and runtimes have already been installed locally.