This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli-client SDK.


Install friendli-client to use gRPC client SDK:

pip install friendli-client

Ensure you have the friendli-client SDK version 1.4.1 or higher installed.

Starting the Friendli Container with gRPC

Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true option to the command argument. This supports response-streaming gRPC, and you can send requests using our friendli-client SDK. To start the Friendli Container with gRPC support, use the following command:

# Fill the values of following variables.
export HF_MODEL_NAME=""  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1"')

docker run \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
    --hf-model-name $HF_MODEL_NAME \
    --grpc true

You can change the port of the server with --web-server-port argument.

Sending Requests with the Client SDK

Here is how to use the friendli-client SDK to interact with the gRPC server. This example assumes that the gRPC server is running on

Properly Closing the Client

By default, the library closes underlying HTTP and gRPC connections when the client is garbage-collected. You can manually close the Friendli or AsyncFriendli client using the .close() method or utilize a context manager to ensure proper closure when exiting a with block.