編譯模型

Qualcomm® AI Hub Workbench 支援以下格式的已訓練模型編譯：

PyTorch
ONNX
AI Model Efficiency Toolkit (AIMET) 量化模型.
TensorFlow (通過 ONNX)

上述任何模型都可以編譯為以下目標Runtime：

TensorFlow Lite (最近更名為 LiteRT；推薦給 Android 開發者)
ONNX (建議 Windows 開發者使用)
Qualcomm® AI Engine Direct (QNN) 上下文二進位檔 (SOC 特定)
Qualcomm® AI Engine Direct (QNN) DLC（硬件無關）

要指定 Qualcomm® AI Engine Direct 的版本，請包含 --qairt_version。請參閱 Common Options。

編譯 PyTorch 到 TensorFlow Lite

要編譯 PyTorch 模型，請先使用 PyTorch jit.trace 方法在記憶體中產生|torchscript| 模型。完成 trace 後，即可使用 submit_compile_job() API 進行編譯。

或者，AI Hub Workbench 也提供 torch.export 的 beta 支援。可使用 PyTorch 的 export.export 方法在記憶體中產生 torch.export 模型。匯出後，即可使用 submit_compile_job() API 編譯模型。torch.export 模型可透過 PyTorch 的 export.save 方法進行序列化。儲存為 .pt2 副檔名後，也可使用 submit_compile_job() API 進行編譯。

TensorFlow Lite 模型可以在 CPU、GPU (使用 GPU 委派) 或 NPU (使用 QNN 委派) 上運行.

import torch
import torchvision

import qai_hub as hub

client = hub.Client()

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = client.submit_compile_job(
    pt_model,
    name="MobileNet_V2",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=input_shape),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

如果您已經有保存的追蹤或腳本化的 torch 模型 (使用 torch.jit.save 保存)，您可以直接提交.我們將使用 mobilenet_v2.pt 作為範例.在此範例中，我們還會分析編譯的模型

import qai_hub as hub

client = hub.Client()

# Compile a model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)

# Profile the compiled model
profile_job = client.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

將 PyTorch 模型編譯為 QNN DLC

Qualcomm® AI Hub 支援將 PyTorch 模型編譯並分析為 QNN DLC。在此範例中，我們將使用 mobilenet_v2.pt 並將其編譯為 QNN DLC(.dlc 檔案）。

DLC 與硬體無關。 Qualcomm® AI Engine Direct SDK 保證 DLC 可與更新版本的 SDK 相容。這表示使用某一版本 SDK 編譯的 DLC 可在更新版本的 SDK 上執行。詳情請參閱 Qualcomm® AI Engine Direct Options。

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

返回值是一個 CompileJob 的實例.請參閱此範例了解如何為 Snapdragon® 神經處理單元 (NPU) 分析此模型.

編譯 PyTorch 模型到 QNN 上下文二進位檔

Qualcomm® AI Hub Workbench 支援將 PyTorch 模型編譯為 QNN 內容二進位（context binary），然後對其進行分析。在此範例中，我們將使用 mobilenet_v2.pt，並將其編譯為針對特定裝置最佳化的 QNN 內容二進位。由於這些二進位是為目標硬體特別最佳化，因此只能為單一裝置進行編譯。

上下文二進位檔是一種 SOC 特定的部署機制.當為設備編譯時，預期模型將部署到相同的設備.該格式與操作系統無關，因此相同的模型可以部署在 Android、Linux 或 Windows .上下文二進位檔僅設計用於 NPU.

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN context binary
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

返回值是一個 CompileJob 的實例.請參閱此範例了解如何為 Snapdragon® 神經處理單元 (NPU) 分析此模型.

QNN 上下文二進位檔也可以嵌入到 ONNX 模型中.

編譯為預編譯的 QNN ONNX

Qualcomm® AI Hub Workbench 支援編譯並分析預先編譯的 ONNX Runtime 模型。該模型是一個與 ONNX Runtime 相容的模型，包含可在 Snapdragon 裝置上使用 ONNX Runtime 執行的預編譯 QNN 二進位檔。更多詳細資訊，請參閱 ONNX Runtime QNN Execution Provider 文件。

使用預編譯 QNN ONNX 的優點:

部署方便:適用於 Android、Linux 或 Windows.
性能提升:相當於 QNN 上下文二進位檔.
簡單的推理代碼: ONNX Runtime 使用 QNN Execution Provider 在編譯的模型上運行推理.
大型模型:適用於大型模型 (>1GB) 如 LLMs、Stable Diffusion 等.

請注意，QNN context binary 與作業系統無關，但與裝置相關。此外，context binary 僅設計用於 NPU。

產生預先編譯的 QNN ONNX 模型

預先編譯的 QNN ONNX 模型可透過以下兩個步驟產生：

首先，使用 submit_compile_and_link_jobs() 將原始模型（PyTorch、ONNX 等）編譯為 QNN context binary。
接著，不指定任何選項，使用 submit_compile_job() 將 QNN context binary 包裝為 PrecompiledQnnOnnx 成品。

在此範例中，我們假設目標裝置為 Snapdragon® 8 Elite：

import qai_hub as hub

# Step 1: Compile a PyTorch model to QNN context binary
_, link_job = hub.submit_compile_and_link_jobs(
    models="mobilenet_v2.pt",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    input_specs={"image": (1, 3, 224, 224)},
)
assert isinstance(link_job, hub.LinkJob)

# Step 2: Get the QNN context binary from LinkJob and wrap it as PrecompiledQnnOnnx
# Note: When wrapping an ONNX wrappable model (QNN Context Binary), do not pass options
qnn_context_binary = link_job.get_target_model()
compile_job = hub.submit_compile_job(
    model=qnn_context_binary,
    device=hub.Device("Snapdragon 8 Elite QRD"),
)
assert isinstance(compile_job, hub.CompileJob)
compile_job.download_target_model("Precompiled_MobileNet_V2.onnx")

編譯的模型是一個可以打包的目錄（副檔名為 .onnx），其中包含一個 ONNX 檔案和一個 QNN 上下文二進位檔案。如果您上傳自己預先編譯的 ONNX Runtime 模型，它應該符合以下文件夾結構：

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.bin

請注意，從 ONNX 模型到 QNN 上下文二進位檔有相對路徑引用，因此如果您重新命名或移動 .bin 檔案，請注意該引用.

編譯 PyTorch 模型以適用於 ONNX Runtime

Qualcomm® AI Hub Workbench 支援將 PyTorch 模型編譯為 ONNX Runtime。在此範例中，我們將使用 mobilenet_v2.pt，並將其編譯為 ONNX 模型。此模型可以使用 ONNX Runtime 進行效能分析。

ONNX Runtime 支援在 CPU、GPU（使用 DML Execution Provider）或 NPU（使用 QNN Execution Provider）上執行：

import qai_hub as hub

client = hub.Client()

# Compile a model to an ONNX model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.onnx")

ONNX 來源與 Workbench 編譯

由 AI Hub Workbench 產生的 ONNX 模型包含針對 Qualcomm 硬體最佳化的硬體相容性修正。直接上傳的 ONNX 模型可能不包含這些修正，因此我們建議使用 Qualcomm® AI Hub Workbench 來編譯模型。

編譯 ONNX 模型為 TensorFlow Lite 或 QNN

Qualcomm® AI Hub Workbench 亦支援將 ONNX 模型編譯為 TensorFlow Lite 或 Qualcomm® Deep Learning Container。本範例將使用 mobilenet_v2.onnx。

import qai_hub as hub

client = hub.Client()

# Compile a model to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile a model to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_dlc",
)
compile_job.download_target_model("MobileNet_V2.dlc")

請注意，ONNX 模型可能是未量化的（如上例所示），也可能是量化的（如我們在量化中所見）。如果來源模型是量化的，則會遵循量化參數以生成量化的可部署資產。ONNX 模型的目錄也可以支持 ONNX 模型的外部權重。這個目錄（附檔名為 .onnx）可以選擇壓縮，必須包含一個 .onnx 文件和一個附檔名為 .data 的權重文件。它應符合以下文件夾結構：

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.data

其中 <modeldir> 和 <model> 可以是任何名稱。如果您的 ONNX 模型不符合該結構，請使用以下代碼使其符合：

# if you have an ONNX model "file.onnx" which uses external weights,
# but does not adhere to Qualcomm AI Hub's required format, use this
# code to make it adhere

import onnx

model = onnx.load("file.onnx")
onnx.save(model, "new_file.onnx", save_as_external_data=True, location="new_file.data")

# place both "new_file.onnx" and "new_file.data" in a new directory with
# a .onnx extension, without any other files and upload that directory
# to Qualcomm AI Hub, either as is or as a .zip file

請注意，從 ONNX 模型到權重文件有相對路徑引用，因此如果您重新命名或移動權重文件，請注意該引用。

將使用 AIMET 量化的模型編譯為 TensorFlow Lite 或 QNN

AI Model Efficiency Toolkit (AIMET) 是一個開源庫，提供用於訓練神經網絡模型的先進模型量化和壓縮技術。AIMET 的 QuantizationSimModel 可以導出為 ONNX 模型（.onnx）和具有量化參數的編碼文件（.encodings）。

要使用此模型，請建立一個名稱中包含 .aimet 的目錄。它應包含一個 .onnx 模型和相應的編碼文件，

<modeldir>.aimet
   ├── <model>.onnx
   ├── <model>.data (optional)
   └── <encodings>.encodings

其中 <modeldir>, <model>, 和 <encodings> 可以是任何名稱。只有當 ONNX 模型具有外部權重時，才需要 <model.data>。

讓我們以 mobilenet_v2_onnx.aimet.zip 為例。解壓到 mobilenet_v2_onnx.aimet 目錄後，我們可以通過以下方式提交編譯作業：

import qai_hub as hub

client = hub.Client()

# Compile to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc --quantize_full_type int8",
)
compile_job.download_target_model("MobileNet_V2.dlc")

將模型編譯與連結為共享權重的 QNN Context Binary

Qualcomm® AI Hub Workbench 可以將多個模型，或是具有多種輸入變體的單一模型，編譯並連結成一個共享權重（multi‑graph）的 QNN context binary。這對於整合共享相同權重的多個 graph、並針對特定裝置的 NPU 進行部署時非常有用。對於每個模型輸入變體，你必須分別指定對應的模型、編譯選項，以及唯一的 graph 名稱，如以下範例程式碼所示。 Graph 名稱會作為 key，用來從生成的 QNN Context Binary 中存取不同的模型變體。如需了解更多關於共享權重 QNN context binary 的資訊，請參考 Linking 。

支援的來源模型：具有動態 shape 的 ONNX、TorchScript (.pt)

API 將對每個指定的裝置回傳一個由 CompileJob 與 LinkJob 組成的 tuple。

import torch

import qai_hub as hub

client = hub.Client()

pt_model1 = torch.jit.load("encoder.pt")
pt_model2 = torch.jit.load("decoder.pt")

input_specs1 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
]
# Compile options are repeated to match the number of model input_specs variants
# Each input_spec can have its own compile options
compile_options1 = ["--force_channel_last_input x --quantize_io"] * 2

input_specs2 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
    {"x": ((1, 3, 160, 160), "float32")},
]
compile_options2 = ["--qnn_options default_graph_htp_precision=FLOAT16"] * 3

# Model entries in list are repeated to match their respective number of input_specs variants
models = [pt_model1, pt_model1, pt_model2, pt_model2, pt_model2]

# models: list of models to compile (|onnx|, |torchscript|)
# device: target device or list of target devices for compilation and linking
# name: optional name for the compile and link job
# input_specs: list of I/O specifications for each model variant
# graph_names: list of unique graph names for each model variant
# compile_options: list of compile options for each model variant
# link_options: link options for each device

jobs = client.submit_compile_and_link_jobs(
    models,
    device=hub.Device("Samsung Galaxy S23"),
    name="encoder + decoder",
    input_specs=[*input_specs1, *input_specs2],
    graph_names=[
        "encoder_224",
        "encoder_192",
        "decoder_224",
        "decoder_192",
        "decoder_160",
    ],
    compile_options=[*compile_options1, *compile_options2],
    link_options="--qnn_options default_graph_htp_optimizations=O=3",
)