Running Friendli Container
Introduction
Friendli Containers enable you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. The current version of Friendli Containers supports most of major generative language models.
Prerequisites
- Before you begin, make sure you have signed up for Friendli Suite. You can use Friendli Containers free of charge for four weeks.
- Friendli Container currently only supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to our required CUDA compatibility guide.
- Prepare a Personal Access Token following this guide.
- Prepare a Friendli Container Secret following this guide.
Preparing Personal Access Token
PAT (Personal Access Token) is the user credentials for logging into our container registry.
- Sign in Friendli Suite.
- Go to User Settings > Tokens and click 'Create new token'.
- Save your created token value.
Preparing Container Secret
Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image.
- Sign in Friendli Suite.
- Go to Container > Container Secrets and click 'Create secret'.
- Save your created secret value.
🔑 Secret Rotation
You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be revoked automatically in 30 minutes.
Pulling Friendli Container Image
Log in to the Docker client using the personal access token created as outlined in Preparing Personal Access Token.
export FRIENDLI_PAT="YOUR PERSONAL ACCESS TOKEN"
docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_PATPull image
docker pull registry.friendli.ai/[your_repository]:[your_tag]
💰 4-Week Free Trial
During the 4-week free trial period, you can use registry.friendli.ai/trial
image only, which can be pulled with docker pull registry.friendli.ai/trial
.
Running Friendli Container with Hugging Face Models
If your model is in a safetensors
format, which is compatible with Hugging Face transformers, you can serve the model directly with Friendli Containers.
The current version of Friendli Containers supports direct loading of safetensors
checkpoints for the following models (and corresponding Hugging Face transformers classes):
- GPT (
GPT2LMHeadModel
) - GPT-J (
GPTJForCausalLM
) - MPT (
MPTForCausalLM
) - OPT (
OPTForCausalLM
) - BLOOM (
BloomForCausalLM
) - GPT-NeoX (
GPTNeoXForCausalLM
) - Llama (
LlamaForCausalLM
) - Falcon (
FalconForCausalLM
) - Mistral (
MistralForCausalLM
) - Mixtral (
MixtralForCausalLM
) - Qwen2 (
Qwen2ForCausalLM
) - Gemma (
GemmaForCausalLM
) - Starcoder2 (
Starcoder2ForCausalLM
) - Cohere (
CohereForCausalLM
) - DBRX (
DbrxForCausalLM
)
If your model does not belong to one of the above model types, please ask for support by sending an email to Support.
Here are the instructions to run Friendli Container to serve a Hugging Face model:
# Fill the values of following variables.
export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Llama-2-7b-chat-hf")
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
[LAUNCH_OPTIONS]
The [LAUNCH_OPTIONS]
should be replaced with Launch Options for Friendli Container.
By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests.
Examples: Llama 2 7B Chat
This is an example running Llama2-7b-chat model with a single GPU.
docker run \
--gpus '"device=0"' \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \
-e HF_TOKEN="YOUR HUGGING FACE TOKEN" \
registry.friendli.ai/trial \
--hf-model-name meta-llama/Llama-2-7b-chat-hf
Since downloading meta-llama/Llama-2-7b-chat-hf
is allowed only for authorized users, you need to provide your Hugging Face User Access Token through HF_TOKEN
environment variable.
It works the same for all private repositories.
Multi-GPU Serving
Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference.
Tensor Parallelism
Tensor parallelism is employed when serving large models that exceed the memory capacity of a single GPU, by distributing parts of the model's weights across multiple GPUs. To leverage tensor parallelism with the Friendli Container:
- Specify multiple GPUs for
$GPU_ENUMERATION
(e.g., '"device=0,1,2,3"'). - Use
--num-devices
(or-d
) option to specify the tensor parallelism degree (e.g.,--num-devices 4
).
Pipeline Parallelism
Pipeline parallelism splits a model into multiple segments to be processed across different GPU, enabling the deployment of larger models that would not otherwise fit on a single GPU. To exploit pipeline parallelism with the Friendli Container:
- Specify multiple GPUs for
$GPU_ENUMERATION
(e.g., '"device=0,1,2,3"'). - Use
--num-workers
(or-n
) option to specify the pipeline parallelism degree (e.g.,--num-workers 4
).
🆚 Choosing between Tensor Parallelism and Pipeline Parallelism
When deploying models with the Friendli Container, you have the flexibility to combine tensor parallelism and pipeline parallelism. We recommend exploring a balance between the two, based on their distinct characteristics. While tensor parallelism involves "expensive" all-reduce operations to aggregate partial results across all devices, pipeline parallelism relies on "cheeper" peer-to-peer communication. Thus, in limited network setup, such as PCIe networks, leveraging pipeline parallelism is preferable. Conversely, in rich network setup like NVLink, tensor parallelism is recommended due to its superior parallel computation efficiency.
Advanced: Serving AWQ Models
Running quantized models requires an additional step to search execution policy. See Serving AWQ Models to learn how to create an inference endpoint for the AWQ model.
Advanced: Serving MoE Models
Running MoE (Mixture of Experts) models requires an additional step to search execution policy. See Serving MoE Models to learn how to create an inference endpoint for the AWQ model.
Sending Inference Requests
We can now send inference requests to the running Friendli Container. For information on all parameters that can be used in an inference request, please refer to this document.
Examples
- cURL
- Python SDK
curl -X POST http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Python is a popular", "max_tokens": 30, "stream": true}'
from friendli import Friendli
client = Friendli(base_url="http://0.0.0.0:8000")
stream = client.completions.create(
prompt="Python is a popular",
max_tokens=30,
stream=True,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
Options for Running Friendli Container
General Options
Options | Type | Summary | Default | Required |
---|---|---|---|---|
--version | - | Print Friendli Container version. | - | ❌ |
--help | - | Print Friendli Container help message. | - | ❌ |
Launch Options
Options | Type | Summary | Default | Required |
---|---|---|---|---|
--web-server-port | INT | Web server port. | 8000 | ❌ |
--metrics-port | INT | Prometheus metrics export port. | 8281 | ❌ |
--hf-model-name | TEXT | Hugging Face model name to serve (e.g., meta-llama/Llama-2-7b-chat-hf ). Friendli Container will download the model from the Hugging Face Models Hub before creating the inference endpoint. This option is available only for models with safetensors format. | - | ❌ |
--ckpt-path | TEXT | Absolute path of model checkpoint. If not specified, use uninitialized (garbage) values for model parameters. | - | ❌ |
--ckpt-type | TEXT | Checkpoint file type. Choose one of {hdf5|safetensors|hf_safetensors}. If not specified, guess the type from the filename extension of the ckpt, or use HDF5. | hdf5 | ❌ |
--tokenizer-file-path | TEXT | Absolute path of tokenizer file. This option is not needed when tokenizer.json is located under the path specified at --ckpt-path . | - | ❌ |
--tokenizer-add-special-tokens | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer's add_special_tokens argument. | false | ❌ |
--tokenizer-skip-special-tokens | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer's skip_special_tokens argument. | true | ❌ |
--dtype | CHOICE: [bf16, fp16, fp32] | Checkpoint data type. Choose one of {fp16|bf16|fp32}. Must be equal to the data type of the model checkpoint being used. This option is not needed when config.json is located under the path specified at --ckpt-path . | fp16 | ❌ |
--bad-stop-file-path | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | ❌ |
--num-request-threads | INT | Thread pool size for handling HTTP requests. | 4 | ❌ |
--timeout-microseconds | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | ❌ |
--ignore-nan-error | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | ❌ |
--max-batch-size | INT | Max number of sequences that can be processed in a batch. | 384 | ❌ |
--num-devices , -d | INT | Number of devices to use in tensor parallelism degree. | 1 | ❌ |
--num-workers , -n | INT | Number of workers to use in a pipeline (i.e., pipeline parallelism degree). | 1 | ❌ |
--search-policy | BOOLEAN | Search best engine policy for the given combination of model, hardware, parallelism degree. Learn more about policy search at Optimizing Inference with Policy Search. | - | ❌ |
--algo-policy-dir | TEXT | Path to directory containing the policy file. Learn more about policy search at Optimizing Inference with Policy Search. | - | ❌ |