モデルのコンパイル

Qualcomm® AI Hub Workbench は、以下のフォーマットで学習済みモデルのコンパイルをサポートしています：

PyTorch
ONNX
AI Model Efficiency Toolkit (AIMET) で量子化されたモデル
TensorFlow (ONNX経由)

上記のいずれかのモデルは、以下のターゲットランタイムにコンパイルできます：

TensorFlow Lite (最近 LiteRT に改名されました。Android開発者に推奨)
ONNX （Windows開発者に推奨）
Qualcomm® AI Engine Direct (QNN) コンテキストバイナリ (SOC特定)
Qualcomm® AI Engine Direct (QNN) DLC（ハードウェアに依存しない）

Qualcomm® AI Engine Direct のバージョンを指定するには、--qairt_version を含めます。詳細は Common Options を参照してください。

PyTorch を TensorFlow Lite にコンパイルする

PyTorch モデルをコンパイルするには、まず PyTorch の jit.trace メソッドを使用して、メモリ上で TorchScript モデルを生成します。トレース後、submit_compile_job() API を使用してモデルをコンパイルできます

代替として、AI Hub Workbench は torch.export のベータサポートも提供しています。PyTorch の export.export メソッドを使用して、メモリ上で torch.export モデルを生成できます。エクスポート後、submit_compile_job() API を使用してモデルをコンパイルできます。torch.export モデルは、PyTorch の export.save メソッドを使用してシリアライズできます。.pt2 拡張子のファイルとして保存すれば、このファイルも submit_compile_job() API を使用してコンパイル可能です。

TensorFlow Lite モデルはCPU、GPU (GPU delegation を使用)、またはNPU (QNN delegation を使用) で実行できます。

import torch
import torchvision

import qai_hub as hub

client = hub.Client()

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = client.submit_compile_job(
    pt_model,
    name="MobileNet_V2",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=input_shape),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

すでに保存されたトレース済みまたはスクリプト化されたtorchモデルがある場合 (torch.jit.save で保存)、それを直接提出できます。例として mobilenet_v2.pt を使用します。この例では、コンパイルされたモデルのプロファイルも行います：

import qai_hub as hub

client = hub.Client()

# Compile a model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)

# Profile the compiled model
profile_job = client.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

PyTorch モデルをQNN DLCにコンパイル

Qualcomm® AI Hub は PyTorch モデルを QNN DLC にコンパイルおよびプロファイルすることをサポートします。この例では、mobilenet_v2.pt を使用し、QNN DLC（.dlc ファイル）にコンパイルします。

DLC はハードウェアに依存しません。Qualcomm® AI Engine Direct SDK は、DLC が将来の SDK バージョンと互換性を持つことを保証します。つまり、ある SDK バージョンでコンパイルされた DLC は、以降の SDK バージョンでも動作することが保証されます。詳細は Qualcomm® AI Engine Direct Options を参照してください。

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

戻り値は CompileJob のインスタンスです。このモデルを Snapdragon® ニューラルプロセッシングユニット (NPU) 用にプロファイルする方法については、この例を参照してください。

PyTorch モデルをQNNコンテキストバイナリにコンパイル

Qualcomm® AI Hub Workbench supports compiling a PyTorch model to a QNN context binary and then profiling that. In this example, we will use mobilenet_v2.pt and compile it to a QNN context binary optimized to run on specific device. Since they are optimized specifically for targeted hardware, it can only be compiled for a single device.

コンテキストバイナリはSOC特定のデプロイメントメカニズムです。デバイス用にコンパイルされた場合、モデルは同じデバイスにデプロイされることが期待されます。この形式はオペレーティングシステムに依存しないため、同じモデルをAndroid、Linux、またはWindowsでデプロイできます。コンテキストバイナリはNPU専用に設計されています。

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN context binary
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

戻り値は CompileJob のインスタンスです。このモデルを Snapdragon® ニューラルプロセッシングユニット (NPU) 用にプロファイルする方法については、この例を参照してください。

QNNコンテキストバイナリは ONNX モデル内に埋め込むこともできます。

事前コンパイルされたQNN ONNX のコンパイル

Qualcomm® AI Hub Workbench は、事前にコンパイルされた ONNX Runtime モデルのコンパイルおよびプロファイリングをサポートしています。このモデルは、Snapdragon デバイス上で ONNX Runtime を使用して実行可能な、事前コンパイル済みの QNN バイナリを含む ONNX Runtime 互換モデルです。詳細については、 ONNX Runtime QNN Execution Provider のドキュメントを参照してください。

事前コンパイルされたQNN ONNX を使用する利点：

デプロイの容易さ：Android、Linux、またはWindowsで動作します。
パフォーマンス向上：QNNコンテキストバイナリと同等。
シンプルな推論コード：ONNX Runtime は QNN Execution Provider を使用してコンパイルされたモデルで推論を実行します。
大規模モデル：LLM、Stable Diffusionなどの大規模モデル (>1GB) に対応。

QNN コンテキストバイナリは OS 非依存ですが、デバイス固有である点に注意してください。また、コンテキストバイナリは NPU 専用に設計されています。

事前コンパイル済み QNN ONNX モデルの生成

事前コンパイル済み QNN ONNX モデルは、次の 2 つの手順で生成できます。

まず、submit_compile_and_link_jobs() を使用して、ソースモデル（PyTorch、ONNX など）を QNN コンテキストバイナリにコンパイルします。
次に、オプションを一切指定せずに submit_compile_job() を使用し、QNN コンテキストバイナリを PrecompiledQnnOnnx アーティファクトとしてラップします。

この例では、Snapdragon® 8 Elite をターゲットにすると仮定します。

import qai_hub as hub

# Step 1: Compile a PyTorch model to QNN context binary
_, link_job = hub.submit_compile_and_link_jobs(
    models="mobilenet_v2.pt",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    input_specs={"image": (1, 3, 224, 224)},
)
assert isinstance(link_job, hub.LinkJob)

# Step 2: Get the QNN context binary from LinkJob and wrap it as PrecompiledQnnOnnx
# Note: When wrapping an ONNX wrappable model (QNN Context Binary), do not pass options
qnn_context_binary = link_job.get_target_model()
compile_job = hub.submit_compile_job(
    model=qnn_context_binary,
    device=hub.Device("Snapdragon 8 Elite QRD"),
)
assert isinstance(compile_job, hub.CompileJob)
compile_job.download_target_model("Precompiled_MobileNet_V2.onnx")

コンパイルされたモデルは、ONNX ファイルとQNNコンテキストバイナリファイルを含む圧縮ディレクトリ (拡張子 .onnx) です。自分でコンパイルした事前コンパイルされた ONNX Runtime モデルをアップロードする場合、次のフォルダ構造に準拠している必要があります：

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.bin

ONNX モデルからQNNコンテキストバイナリへの相対パス参照があるため、.bin ファイルの名前を変更したり移動したりする場合は、その参照に注意してください。

ONNX Runtime 用の PyTorch モデルのコンパイル

Qualcomm® AI Hub Workbench は、 PyTorch モデルを ONNX Runtime 用にコンパイルすることをサポートしています。この例では、 mobilenet_v2.pt を使用し、 ONNX モデルにコンパイルします。このモデルは ONNX Runtime を使用してプロファイリングできます。

ONNX Runtime は、CPU、GPU（DML実行プロバイダーを使用）、またはNPU（QNN実行プロバイダーを使用）での実行をサポートしています。

import qai_hub as hub

client = hub.Client()

# Compile a model to an ONNX model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.onnx")

ONNXのプロビナンスおよびWorkbenchコンパイル

AI Hub Workbenchによって生成されたONNXモデルには、Qualcommハードウェア向けに最適化するためのハードウェア互換性修正が含まれています。ONNXモデルを直接アップロードした場合、これらの修正が含まれていない可能性があるため、Qualcomm® AI Hub Workbench を使用してモデルをコンパイルすることを推奨します。

ONNX モデルを TensorFlow Lite またはQNNにコンパイル

Qualcomm® AI Hub Workbench は、ONNX モデルを TensorFlow Lite または Qualcomm® Deep Learning Container にコンパイルすることもサポートしています。この例では、mobilenet_v2.onnx を使用します。

import qai_hub as hub

client = hub.Client()

# Compile a model to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile a model to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_dlc",
)
compile_job.download_target_model("MobileNet_V2.dlc")

ONNX モデルは、量子化されていない場合（上記の例のように）や、量子化されている場合があります（量子化で説明します）。ソースモデルが量子化されている場合、量子化パラメータは尊重され、量子化されたデプロイ可能なアセットが生成されます。ONNX モデルは、外部の重みを持つ ONNX モデルをサポートするディレクトリでもあります。オプションで圧縮されたディレクトリ（拡張子 .onnx）には、正確に1つの .onnx ファイルと正確に1つの重みファイル拡張子 .data が含まれている必要があります。次のフォルダ構造に準拠する必要があります：

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.data

<modeldir> および <model> は任意の名前にすることができます。ONNXモデルがその構造に準拠していない場合は、次のコードを使用して準拠させてください：

# if you have an ONNX model "file.onnx" which uses external weights,
# but does not adhere to Qualcomm AI Hub's required format, use this
# code to make it adhere

import onnx

model = onnx.load("file.onnx")
onnx.save(model, "new_file.onnx", save_as_external_data=True, location="new_file.data")

# place both "new_file.onnx" and "new_file.data" in a new directory with
# a .onnx extension, without any other files and upload that directory
# to Qualcomm AI Hub, either as is or as a .zip file

ONNX モデルから重みファイルへの相対パス参照があるため、重みファイルの名前を変更したり移動したりする場合は、その参照に注意してください。

AIMET で量子化されたモデルを TensorFlow Lite またはQNNにコンパイル

AI Model Efficiency Toolkit（AIMET）は、ニューラルネットワークモデルのトレーニングのための高度なモデル量子化および圧縮技術を提供するオープンソースライブラリです。AIMET の QuantizationSimModel は、量子化パラメータを持つ ONNX モデル（.onnx）およびエンコーディングファイル（.encodings）にエクスポートできます。

このモデルを使用するには、名前に .aimet を含むディレクトリを作成します。1つの .onnx モデルと対応するエンコーディングファイルを含める必要があります。

<modeldir>.aimet
   ├── <model>.onnx
   ├── <model>.data (optional)
   └── <encodings>.encodings

<modeldir>, <model>, および <encodings> は任意の名前にすることができます。ONNX モデルに外部の重みがある場合にのみ、<model.data> が必要です。

例として mobilenet_v2_onnx.aimet.zip を使用します。これを mobilenet_v2_onnx.aimet ディレクトリに解凍した後、コンパイルジョブを送信できます。

import qai_hub as hub

client = hub.Client()

# Compile to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc --quantize_full_type int8",
)
compile_job.download_target_model("MobileNet_V2.dlc")

モデルをコンパイルおよびリンクして、重み共有 QNN コンテキストバイナリを生成する

Qualcomm® AI Hub Workbench は、複数のモデル、または複数の入力バリエーションを持つモデルを、重み共有（マルチグラフ）QNN コンテキストバイナリとしてコンパイルおよびリンクすることができます。これは、重みを共有するグラフをまとめ、特定デバイスの NPU をターゲットにする際に有用です。各モデル入力バリエーションについて、対応するモデル、コンパイルオプション、そして下記のサンプルコードのようにユニークなグラフ名を指定する必要があります。グラフ名は、生成された QNN コンテキストバイナリからモデルバリエーションにアクセスするためのキーとして使用されます。重み共有 QNN コンテキストバイナリの詳細については Linking を参照してください。

サポートされているソースモデル: 動的シェイプ対応の ONNX、TorchScript (.pt)

API は、指定された各デバイスに対して CompileJob と LinkJob で構成されたインスタンスのタプルを返します。

import torch

import qai_hub as hub

client = hub.Client()

pt_model1 = torch.jit.load("encoder.pt")
pt_model2 = torch.jit.load("decoder.pt")

input_specs1 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
]
# Compile options are repeated to match the number of model input_specs variants
# Each input_spec can have its own compile options
compile_options1 = ["--force_channel_last_input x --quantize_io"] * 2

input_specs2 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
    {"x": ((1, 3, 160, 160), "float32")},
]
compile_options2 = ["--qnn_options default_graph_htp_precision=FLOAT16"] * 3

# Model entries in list are repeated to match their respective number of input_specs variants
models = [pt_model1, pt_model1, pt_model2, pt_model2, pt_model2]

# models: list of models to compile (|onnx|, |torchscript|)
# device: target device or list of target devices for compilation and linking
# name: optional name for the compile and link job
# input_specs: list of I/O specifications for each model variant
# graph_names: list of unique graph names for each model variant
# compile_options: list of compile options for each model variant
# link_options: link options for each device

jobs = client.submit_compile_and_link_jobs(
    models,
    device=hub.Device("Samsung Galaxy S23"),
    name="encoder + decoder",
    input_specs=[*input_specs1, *input_specs2],
    graph_names=[
        "encoder_224",
        "encoder_192",
        "decoder_224",
        "decoder_192",
        "decoder_160",
    ],
    compile_options=[*compile_options1, *compile_options2],
    link_options="--qnn_options default_graph_htp_optimizations=O=3",
)