Running Friendli Container

Introduction

Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. The current version of Friendli Container supports most of major generative language models.

Prerequisites

Before you begin, make sure you have signed up for Friendli Suite. You can use Friendli Container free of charge for 60 days.
Friendli Container currently only supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to our required CUDA compatibility guide.
Prepare a Friendli Token following this guide.
Prepare a Friendli Container Secret following this guide.

Preparing Friendli Token

Friendli Token is the user credentials for logging into our container registry.

Sign in Friendli Suite.
Go to User settings > Tokens and click ‘Create new token’.
Save your created token value and export it as FRIENDLI_TOKEN.

Preparing Container Secret

Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image.

Sign in Friendli Suite.
Go to Container > Container Secrets and click ‘Create secret’.
Save your created secret value and export it as FRIENDLI_CONTAINER_SECRET.

🔑 Secret Rotation

You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be revoked automatically in 30 minutes.

Pulling Friendli Container Image

export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS"
export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN

Pull image

docker pull registry.friendli.ai/[your_repository]:[your_tag]

💰 60-Days Free Trial

During the 60-days free trial period, you can use registry.friendli.ai/trial image only, which can be pulled with docker pull registry.friendli.ai/trial.

Running Friendli Container with Hugging Face Models

If your model is in a safetensors format, which is compatible with Hugging Face transformers, you can serve the model directly with Friendli Container.

The current version of Friendli Container supports direct loading of safetensors checkpoints for the following models (and corresponding Hugging Face transformers classes):

Arctic (ArcticForCausalLM)
Baichuan (BaichuanForCausalLM)
Blenderbot (BlenderbotForConditionalGeneration)
BLOOM (BloomForCausalLM)
Cohere (CohereForCausalLM)
DBRX (DbrxForCausalLM)
DeepSeek (DeepseekForCausalLM)
EXAONE (ExaoneForCausalLM)
Falcon (FalconForCausalLM)
Gemma2 (Gemma2ForCausalLM)
Gemma (GemmaForCausalLM)
GPT2 (GPT2LMHeadModel)
GPT-J (GPTJForCausalLM)
GPT-NeoX (GPTNeoXForCausalLM)
Grok-1 (Grok1ForCausalLM)
Llama (LlamaForCausalLM)
Mistral (MistralForCausalLM)
Mixtral (MixtralForCausalLM)
MPT (MPTForCausalLM)
MT5 (MT5ForConditionalGeneration)
OPT (OPTForcausalLM)
Phi3 (Phi3ForCausalLM)
Phi (PhiForCausalLM)
Qwen2 (Qwen2ForCausalLM)
Solar (SolarForCausalLM)
StarCoder2 (Starcoder2ForCausalLM)
T5 (T5ForConditionalGeneration)

If your model does not belong to one of the above model types, please contact us for support.

Here are the instructions to run Friendli Container to serve a Hugging Face model:

# Fill the values of following variables.
export HF_MODEL_NAME=""  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    [LAUNCH_OPTIONS]

The [LAUNCH_OPTIONS] should be replaced with Launch Options for Friendli Container.

By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests.

Examples: Deploying the Llama 3 8B Instruct model

This is an example running Llama3-8B-Instruct model with a single GPU.

export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret (leave it if it's already set in your environment)
export HF_TOKEN=""  # Access token from HuggingFace (see the caution below)
export HF_MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION="device=0"

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -e HF_TOKEN=$HF_TOKEN \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name meta-llama/Meta-Llama-3-8B-Instruct

Since downloading meta-llama/Meta-Llama-3-8B-Instruct is allowed only for authorized users, you need to provide your Hugging Face User Access Token through HF_TOKEN environment variable. It works the same for all private repositories.

Multi-GPU Serving

Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference.

Tensor Parallelism

Tensor parallelism is employed when serving large models that exceed the memory capacity of a single GPU, by distributing parts of the model’s weights across multiple GPUs. To leverage tensor parallelism with the Friendli Container:

Specify multiple GPUs for $GPU_ENUMERATION (e.g., ‘“device=0,1,2,3”’).
Use --num-devices (or -d) option to specify the tensor parallelism degree (e.g., --num-devices 4).

Pipeline Parallelism

Pipeline parallelism splits a model into multiple segments to be processed across different GPU, enabling the deployment of larger models that would not otherwise fit on a single GPU. To exploit pipeline parallelism with the Friendli Container:

Specify multiple GPUs for $GPU_ENUMERATION (e.g., ‘“device=0,1,2,3”’).
Use --num-workers (or -n) option to specify the pipeline parallelism degree (e.g., --num-workers 4).

🆚 Choosing between Tensor Parallelism and Pipeline Parallelism

When deploying models with the Friendli Container, you have the flexibility to combine tensor parallelism and pipeline parallelism. We recommend exploring a balance between the two, based on their distinct characteristics. While tensor parallelism involves “expensive” all-reduce operations to aggregate partial results across all devices, pipeline parallelism relies on “cheaper” peer-to-peer communication. Thus, in limited network setup, such as PCIe networks, leveraging pipeline parallelism is preferable. Conversely, in rich network setup like NVLink, tensor parallelism is recommended due to its superior parallel computation efficiency.

Advanced: Serving Quantized Models

Running quantized models requires an additional step to search execution policy. See Serving Quantized Models to learn how to create an inference endpoint for the quantized model.

Advanced: Serving MoE Models

Running MoE (Mixture of Experts) models requires an additional step to search execution policy. See Serving MoE Models to learn how to create an inference endpoint for the MoE model.

Sending Inference Requests

We can now send inference requests to the running Friendli Container. For information on all parameters that can be used in an inference request, please refer to this document.

Examples

Options for Running Friendli Container

General Options

Options	Type	Summary	Default	Required
`--version`	-	Print Friendli Container version.	-	❌
`--help`	-	Print Friendli Container help message.	-	❌

Launch Options

Options	Type	Summary	Default	Required
`--web-server-port`	INT	Web server port.	8000	❌
`--metrics-port`	INT	Prometheus metrics export port.	8281	❌
`--hf-model-name`	TEXT	Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at ~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before creating the inference endpoint. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format.	-	❌
`--tokenizer-file-path`	TEXT	Absolute path of tokenizer file. This option is not needed when `tokenizer.json` is located under the path specified at `--ckpt-path`.	-	❌
`--tokenizer-add-special-tokens`	BOOLEAN	Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer’s `add_special_tokens` argument. The default value is false for versions < v1.6.0.	`true`	❌
`--tokenizer-skip-special-tokens`	BOOLEAN	Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer’s `skip_special_tokens` argument.	`true`	❌
`--dtype`	CHOICE: [bf16, fp16, fp32]	Data type of weights and activations. Choose one of <fp16\|bf16\|fp32>. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of `torch_dtype` in `config.json` file or assumes fp16.	fp16	❌
`--bad-stop-file-path`	TEXT	JSON file path that contains stop sequences or bad words/tokens.	-	❌
`--num-request-threads`	INT	Thread pool size for handling HTTP requests.	4	❌
`--timeout-microseconds`	INT	Server-side timeout for client requests, in microseconds.	0 (no timeout)	❌
`--ignore-nan-error`	BOOLEAN	If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request.	-	❌
`--max-batch-size`	INT	Max number of sequences that can be processed in a batch.	384	❌
`--num-devices`, `-d`	INT	Number of devices to use in tensor parallelism degree.	1	❌
`--num-workers`, `-n`	INT	Number of workers to use in a pipeline (i.e., pipeline parallelism degree).	1	❌
`--search-policy`	BOOLEAN	Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at Optimizing Inference with Policy Search.	false	❌
`--terminate-after-search`	BOOLEAN	Terminates engine container after the policy search.	false	❌
`--algo-policy-dir`	TEXT	Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at Optimizing Inference with Policy Search.	current working dir	❌
`--adapter-model`	TEXT	Add an adapter model with adapter name and path; <adapter_name>:<adapter_ckpt_path>. The path can be a name from a Hugging Face model hub.	-	❌

Model Specific Options

T5

Options	Type	Summary	Default	Required
`--max-input-length`	INT	Maximum input length.	-	✅
`--max-output-length`	INT	Maximum output length.	-	✅

Get Started

Products

Tutorials

Running Friendli Container

Introduction

Prerequisites

Preparing Friendli Token

Preparing Container Secret

Pulling Friendli Container Image

Running Friendli Container with Hugging Face Models

Examples: Deploying the Llama 3 8B Instruct model

Multi-GPU Serving

Tensor Parallelism

Pipeline Parallelism

Advanced: Serving Quantized Models

Advanced: Serving MoE Models

Sending Inference Requests

Examples

Options for Running Friendli Container

General Options

Launch Options

Model Specific Options

T5

Get Started

Products

Tutorials

​Introduction

​Prerequisites

​Preparing Friendli Token

​Preparing Container Secret

​Pulling Friendli Container Image

​Running Friendli Container with Hugging Face Models

​Examples: Deploying the Llama 3 8B Instruct model

​Multi-GPU Serving

​Tensor Parallelism

​Pipeline Parallelism

​Advanced: Serving Quantized Models

​Advanced: Serving MoE Models

​Sending Inference Requests

​Examples

​Options for Running Friendli Container

​General Options

​Launch Options

​Model Specific Options

​T5

Introduction

Prerequisites

Preparing Friendli Token

Preparing Container Secret

Pulling Friendli Container Image

Running Friendli Container with Hugging Face Models

Examples: Deploying the Llama 3 8B Instruct model

Multi-GPU Serving

Tensor Parallelism

Pipeline Parallelism

Advanced: Serving Quantized Models

Advanced: Serving MoE Models

Sending Inference Requests

Examples

Options for Running Friendli Container

General Options

Launch Options

Model Specific Options

T5