모델 컴파일

Qualcomm® AI Hub Workbench 는 다음 형식으로 학습된 모델의 컴파일을 지원합니다:

PyTorch
ONNX
AI Model Efficiency Toolkit (AIMET) 양자화된 모델들
TensorFlow (ONNX를 통해)

위 모델은 다음 대상 런타임에 대해 컴파일될 수 있습니다.

TensorFlow Lite (최근 LiteRT 로 이름이 변경됨. Android 개발자에게 권장됨)
ONNX (Windows 개발자에게 권장됨)
Qualcomm® AI Engine Direct (QNN) 컨텍스트 바이너리(SOC별)
Qualcomm® AI Engine Direct (QNN) DLC (하드웨어 독립적)

Qualcomm® AI Engine Direct 의 버전을 지정하려면 --qairt_version 을 포함하세요. Common Options 를 참조하세요.

PyTorch 를 TensorFlow Lite 로 컴파일하기

To compile a PyTorch model, first generate a TorchScript model in memory using the jit.trace method in PyTorch. Once traced, you can compile the model using the submit_compile_job() API.

Alternatively, AI Hub Workbench has beta support for torch.export. A torch.export model can be generated in memory using the export.export method in PyTorch. Once exported, you can compile the model using the submit_compile_job() API. A torch.export model can be serialized using the export.save method in PyTorch. Once saved to a file with a .pt2 extension, this file can be compiled as well using the submit_compile_job() API.

TensorFlow Lite 모델은 CPU, GPU(GPU delegation 사용) 또는 NPU(QNN delegation 사용)에서 실행할 수 있습니다.

import torch
import torchvision

import qai_hub as hub

client = hub.Client()

# Using pre-trained MobileNet
torch_model = torchvision.models.mobilenet_v2(pretrained=True)
torch_model.eval()

# Trace model
input_shape: tuple[int, ...] = (1, 3, 224, 224)
example_input = torch.rand(input_shape)
pt_model = torch.jit.trace(torch_model, example_input)

# Compile model on a specific device
compile_job = client.submit_compile_job(
    pt_model,
    name="MobileNet_V2",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=input_shape),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

이미 저장된 추적 또는 스크립트된 토치 모델이 있는 경우 (torch.jit.save 로 저장됨) 직접 제출할 수 있습니다. mobilenet_v2.pt 를 예시로 사용합니다. 예를 들어. 이 예에서 우리는 컴파일된 모델을 프로파일링합니다.

import qai_hub as hub

client = hub.Client()

# Compile a model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    input_specs=dict(image=(1, 3, 224, 224)),
)

# Profile the compiled model
profile_job = client.submit_profile_job(
    model=compile_job.get_target_model(),
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)

# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.tflite")

PyTorch 모델을 QNN DLC로 컴파일

Qualcomm® AI Hub 는 PyTorch 모델을 QNN DLC로 컴파일하고 프로파일링하는 기능을 지원합니다. 이 예제에서는 mobilenet_v2.pt 를 사용하여 QNN DLC (.dlc 파일)로 컴파일합니다.

DLC는 하드웨어에 독립적입니다. Qualcomm® AI Engine Direct SDK는 DLC가 이후 버전의 SDK와 호환될 것을 보장합니다. 즉, 특정 SDK 버전으로 컴파일된 DLC는 이후 SDK 버전에서도 실행 가능합니다. 자세한 내용은 Qualcomm® AI Engine Direct Options 를 참조하세요.

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

반환 값은 CompileJob 의 인스턴스입니다. 이 모델을 Snapdragon® 신경 처리 장치(NPU)에 프로파일링하는 방법을 배우려면 this example 를 참조하세요

PyTorch 모델을 QNN 컨텍스트 바이너리로 컴파일하기

Qualcomm® AI Hub Workbench 는 PyTorch 모델을 QNN 컨텍스트 바이너리로 컴파일하고 프로파일링하는 것을 지원합니다. 이 예제에서는 mobilenet_v2.pt 를 사용하여 특정 장치에서 실행되도록 최적화된 QNN 컨텍스트 바이너리로 컴파일할 것입니다. 이들은 특정 하드웨어에 맞게 최적화되어 있기 때문에 단일 장치에 대해서만 컴파일할 수 있습니다.

컨텍스트 바이너리는 SOC 전용 배포 메커니즘입니다. 디바이스에 대해 컴파일할 때 모델이 동일한 디바이스에 배포될 것으로 예상됩니다. 포맷은 운영 체제에 독립적이므로 동일한 모델을 Android, Linux 또는 Windows에 배포할 수 있습니다. 컨텍스트 바이너리는 NPU에만 설계되었습니다.

import qai_hub as hub

client = hub.Client()

# Compile a model to QNN context binary
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_context_binary",
    input_specs=dict(image=(1, 3, 224, 224)),
)
assert isinstance(compile_job, hub.CompileJob)

반환 값은 CompileJob 의 인스턴스입니다. 이 모델을 Snapdragon® 신경 처리 장치(NPU)에 프로파일링하는 방법을 배우려면 this example 를 참조하세요

QNN 컨텍스트 바이너리는 ONNX 모델에도 포함될 수 있습니다.

Compiling to a Precompiled QNN ONNX

Qualcomm® AI Hub Workbench 는 사전 컴파일된 ONNX Runtime 모델로의 컴파일 및 사전 컴파일된 ONNX Runtime 모델의 프로파일링을 지원합니다. 해당 모델은 Snapdragon 기기에서 ONNX Runtime 를 사용하여 실행할 수 있는 사전 컴파일된 QNN 바이너리를 포함하는 ONNX Runtime 호환 모델입니다. 자세한 내용은 ONNX Runtime QNN Execution Provider 문서 를 참조하세요.

사전 컴파일된 QNN ONNX 를 사용하는 이점:

배포 용이성: Android, Linux, Windows에서 작동합니다.
성능 향상: QNN 컨텍스트 바이너리와 동일함.
간단한 추론 코드: ONNX Runtime 는 QNN Execution Provider 를 사용하여 컴파일된 모델에 대한 추론을 실행합니다.
대형 모델: LLM, 스테이블 디퓨전 등 큰 모델(>1GB)에 적합합니다.

Please note that the QNN context binary is operating system agnostic, but device specific. Additionally, context binaries are designed only for the NPU.

Generating a Precompiled QNN ONNX Model

A Precompiled QNN ONNX model can be generated in two steps:

First, compile your source model (PyTorch, ONNX, etc.) to a QNN context binary using submit_compile_and_link_jobs()
Then, wrap the QNN context binary as a PrecompiledQnnOnnx artifact using submit_compile_job() (without specifying any options).

In this example, let us assume we want to target the Snapdragon® 8 Elite:

import qai_hub as hub

# Step 1: Compile a PyTorch model to QNN context binary
_, link_job = hub.submit_compile_and_link_jobs(
    models="mobilenet_v2.pt",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    input_specs={"image": (1, 3, 224, 224)},
)
assert isinstance(link_job, hub.LinkJob)

# Step 2: Get the QNN context binary from LinkJob and wrap it as PrecompiledQnnOnnx
# Note: When wrapping an ONNX wrappable model (QNN Context Binary), do not pass options
qnn_context_binary = link_job.get_target_model()
compile_job = hub.submit_compile_job(
    model=qnn_context_binary,
    device=hub.Device("Snapdragon 8 Elite QRD"),
)
assert isinstance(compile_job, hub.CompileJob)
compile_job.download_target_model("Precompiled_MobileNet_V2.onnx")

컴파일된 모델은 선택적으로 압축된 디렉토리(확장자 .onnx)로, ONNX 파일과 QNN 컨텍스트 바이너리 파일을 포함합니다. 직접 컴파일한 사전 컴파일된 ONNX Runtime 모델을 업로드하는 경우, 다음 폴더 구조를 준수해야 합니다:

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.bin

ONNX 모델에서 QNN 컨텍스트 바이너리로의 상대 경로 참조가 있다는 점에 유의하세요. 따라서 .bin 파일의 이름을 바꾸거나 이동하는 경우 해당 참조에 주의하세요.

ONNX Runtime 를 위한 PyTorch 모델 컴파일

Qualcomm® AI Hub Workbench 는 ONNX Runtime 용으로 PyTorch 모델을 컴파일하는 기능을 지원합니다. 이 예제에서는 mobilenet_v2.pt 를 사용하여 이를 ONNX 모델로 컴파일합니다. 이 모델은 ONNX Runtime 를 사용하여 프로파일링할 수 있습니다.

ONNX Runtime 는 CPU, GPU (DML 실행 공급자 사용) 또는 NPU (QNN 실행 공급자 사용)에서 실행을 지원합니다.

import qai_hub as hub

client = hub.Client()

# Compile a model to an ONNX model
compile_job = client.submit_compile_job(
    model="mobilenet_v2.pt",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime onnx",
    input_specs=dict(image=(1, 3, 224, 224)),
)
# Download the optimized compiled model
compile_job.download_target_model("MobileNet_V2.onnx")

ONNX provenance and Workbench compilation

ONNX models produced by AI Hub Workbench contain hardware-compatibility fixes that optimize for Qualcomm hardware. ONNX models uploaded directly may not contain these fixes, and we therefore recommend compiling the model with Qualcomm® AI Hub Workbench.

ONNX 모델을 TensorFlow Lite 또는 QNN으로 컴파일

Qualcomm® AI Hub Workbench also supports the compilation of ONNX models to TensorFlow Lite or a Qualcomm® Deep Learning Container. We will use mobilenet_v2.onnx as an example.

import qai_hub as hub

client = hub.Client()

# Compile a model to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile a model to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2.onnx",
    device=hub.Device("Samsung Galaxy S23 (Family)"),
    options="--target_runtime qnn_dlc",
)
compile_job.download_target_model("MobileNet_V2.dlc")

ONNX 모델은 양자화되지 않았을 수도 있고(위 예제와 같이) 양자화되었을 수도 있습니다(양자화 (Quantization) 에서 볼 수 있듯이). 소스 모델이 양자화된 경우, 양자화 매개변수를 준수하여 양자화된 배포 가능한 자산을 생성합니다. ONNX 모델은 외부 가중치를 지원하기 위해 디렉토리일 수도 있습니다. 선택적으로 압축된 디렉토리(확장자 .onnx)는 정확히 하나의 .onnx 파일과 정확히 하나의 .data 확장자를 가진 가중치 파일을 포함해야 합니다. 다음 폴더 구조를 준수해야 합니다:

<modeldir>.onnx
   ├── <model>.onnx
   └── <model>.data

<modeldir> 와 <model> 는 어떤 이름이든 될 수 있습니다. ONNX 모델이 해당 구조를 따르지 않는 경우, 다음 코드를 사용하여 구조를 따르도록 하세요:

# if you have an ONNX model "file.onnx" which uses external weights,
# but does not adhere to Qualcomm AI Hub's required format, use this
# code to make it adhere

import onnx

model = onnx.load("file.onnx")
onnx.save(model, "new_file.onnx", save_as_external_data=True, location="new_file.data")

# place both "new_file.onnx" and "new_file.data" in a new directory with
# a .onnx extension, without any other files and upload that directory
# to Qualcomm AI Hub, either as is or as a .zip file

ONNX 모델에서 가중치 파일로의 상대 경로 참조가 있으므로, 가중치 파일의 이름을 변경하거나 이동할 때 이 참조를 유의하세요.

AIMET 으로 양자화된 모델을 TensorFlow Lite 또는 QNN으로 컴파일

AI Model Efficiency Toolkit (AIMET)은 신경망 모델을 훈련하기 위한 고급 모델 양자화 및 압축 기술을 제공하는 오픈 소스 라이브러리입니다. AIMET 의 QuantizationSimModel 은 양자화 매개변수가 포함된 ONNX 모델(.onnx)과 인코딩 파일(.encodings)로 내보낼 수 있습니다.

이 모델을 사용하려면 이름에 .aimet 이 포함된 디렉토리를 생성하세요. 하나의 .onnx 모델과 해당 인코딩 파일을 포함해야 합니다.

<modeldir>.aimet
   ├── <model>.onnx
   ├── <model>.data (optional)
   └── <encodings>.encodings

<modeldir>, <model> 와 <encodings> 는 어떤 이름이든 될 수 있습니다. ONNX 모델에 외부 가중치가 있는 경우에만 <model.data> 가 필요합니다.

mobilenet_v2_onnx.aimet.zip 을 예로 들어보겠습니다. mobilenet_v2_onnx.aimet 디렉토리로 압축을 푼 후, 컴파일 작업을 제출할 수 있습니다

import qai_hub as hub

client = hub.Client()

# Compile to TensorFlow Lite
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
)
compile_job.download_target_model("MobileNet_V2.tflite")

# Compile to a QNN DLC
compile_job = client.submit_compile_job(
    model="mobilenet_v2_onnx.aimet",
    device=hub.Device("Samsung Galaxy S24 (Family)"),
    options="--target_runtime qnn_dlc --quantize_full_type int8",
)
compile_job.download_target_model("MobileNet_V2.dlc")

모델을 컴파일하고 링크하여 가중치 공유 QNN 컨텍스트 바이너리 생성하기

Qualcomm® AI Hub Workbench 는 여러 모델 또는 여러 입력 변형을 가진 모델을 가중치 공유(멀티 그래프) QNN 컨텍스트 바이너리로 컴파일하고 링크할 수 있습니다. 이는 동일한 가중치를 공유하는 그래프들을 하나로 묶고 특정 디바이스의 NPU를 타깃팅할 때 유용합니다. 각 모델 입력 변형마다 해당 모델, 컴파일 옵션, 고유한 그래프 이름을 아래 예제 코드처럼 지정해야 합니다. 그래프 이름은 생성된 QNN 컨텍스트 바이너리에서 모델 변형에 접근하기 위한 키로 사용됩니다. 가중치 공유 QNN 컨텍스트 바이너리에 대한 자세한 내용은 Linking 을 참고하세요.

지원되는 소스 모델: 동적 쉐이프를 가진 ONNX, TorchScript (.pt)

API는 지정된 각 디바이스에 대해 CompileJob 및 LinkJob 인스턴스로 구성된 튜플을 반환합니다.

import torch

import qai_hub as hub

client = hub.Client()

pt_model1 = torch.jit.load("encoder.pt")
pt_model2 = torch.jit.load("decoder.pt")

input_specs1 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
]
# Compile options are repeated to match the number of model input_specs variants
# Each input_spec can have its own compile options
compile_options1 = ["--force_channel_last_input x --quantize_io"] * 2

input_specs2 = [
    {"x": ((1, 3, 224, 224), "float32")},
    {"x": ((1, 3, 192, 192), "float32")},
    {"x": ((1, 3, 160, 160), "float32")},
]
compile_options2 = ["--qnn_options default_graph_htp_precision=FLOAT16"] * 3

# Model entries in list are repeated to match their respective number of input_specs variants
models = [pt_model1, pt_model1, pt_model2, pt_model2, pt_model2]

# models: list of models to compile (|onnx|, |torchscript|)
# device: target device or list of target devices for compilation and linking
# name: optional name for the compile and link job
# input_specs: list of I/O specifications for each model variant
# graph_names: list of unique graph names for each model variant
# compile_options: list of compile options for each model variant
# link_options: link options for each device

jobs = client.submit_compile_and_link_jobs(
    models,
    device=hub.Device("Samsung Galaxy S23"),
    name="encoder + decoder",
    input_specs=[*input_specs1, *input_specs2],
    graph_names=[
        "encoder_224",
        "encoder_192",
        "decoder_224",
        "decoder_192",
        "decoder_160",
    ],
    compile_options=[*compile_options1, *compile_options2],
    link_options="--qnn_options default_graph_htp_optimizations=O=3",
)