Introduction

Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container.

Off-the-Shelf Model Checkpoints from Hugging Face Hub

To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options:

For details on how to use these models, go directly to Serving Quantized Models.

Quantizing Your Own Models (FP8/INT8)

To quantize your own models with FP8 or INT8, follow these steps:

  1. Install the friendli-model-optimizer package This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command:
pip install "friendli-model-optimizer"
  1. Prepare the Original Model Ensure you have the original model checkpoint that can be loaded using Hugging Face’s transformers library.

  2. Quantize Model with Friendli-Model-Optimizer(FMO) You can simply run quantization with the command below:

export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint.
export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations.
export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8.
export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0".

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--device $DEVICE \

When the model checkpoint is successfully quantized, the following files will be created at $OUTPUT_DIR.

  • config.json
  • model.safetensors
  • special_tokens_map.json
  • tokenizer_config.json
  • tokenizer.json

If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors.

  • model-00001-of-00005.safetensors
  • model-00002-of-00005.safetensors
  • model-00003-of-00005.safetensors
  • model-00004-of-00005.safetensors
  • model-00005-of-00005.safetensors

For more information about FMO, check out this documentation for details.

Serving Quantized Models

Search Optimal Policy

To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Serving FP8 Models

Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint.

# Fill the values of following variables.
export HF_MODEL_NAME="" # Quantized model name in Hugging Face Hub or directory path of the quantized model checkpoint.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1"')
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --algo-policy-dir /policy \
    --search-policy true

Example: FriendliAI/Llama-3.1-8B-Instruct-fp8

FP8 model serving is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \  # Make sure running policy search
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8
    --algo-policy-dir /policy
    --search-policy true