Serving Quantized Models
Introduction
Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container.
Off-the-Shelf Model Checkpoints from Hugging Face Hub
To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options:
- Checkpoints quantized with friendli-model-optimizer
- Quantized model checkpoints by FriendliAI
- a subset of models quantized with:
For details on how to use these models, go directly to Serving Quantized Models.
Quantizing Your Own Models (FP8/INT8)
To quantize your own models with FP8 or INT8, follow these steps:
- Install the
friendli-model-optimizer
package This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command:
pip install "friendli-model-optimizer"
-
Prepare the Original Model Ensure you have the original model checkpoint that can be loaded using Hugging Face’s
transformers
library. -
Quantize Model with Friendli-Model-Optimizer(FMO) You can simply run quantization with the command below:
export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint.
export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations.
export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8.
export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0".
fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--device $DEVICE \
When the model checkpoint is successfully quantized, the following files will be created at $OUTPUT_DIR
.
config.json
model.safetensors
special_tokens_map.json
tokenizer_config.json
tokenizer.json
If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors
.
model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors
For more information about FMO, check out this documentation for details.
Serving Quantized Models
Search Optimal Policy
To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.
Serving FP8 Models
Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint.
# Fill the values of following variables.
export HF_MODEL_NAME="" # Quantized model name in Hugging Face Hub or directory path of the quantized model checkpoint.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--algo-policy-dir /policy \
--search-policy true
Example: FriendliAI/Llama-3.1-8B-Instruct-fp8
FP8 model serving is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8
--algo-policy-dir /policy
--search-policy true
Was this page helpful?