Chat completion

POST

chat

completions

object

model

string

messages

array

frequency_penalty

number

presence_penalty

number

repetition_penalty

number

max_tokens

integer

min_tokens

integer

stop

array

stream

boolean

stream_options

object

include_usage

boolean

temperature

number

top_p

number

top_k

integer

timeout_microseconds

integer

seed

array

eos_token

array

tools

array

tool_choice

string · string

parallel_tool_calls

boolean

response_format

object

type

enum<string>

schema

string

See available models at this pricing table.

To successfully run an inference request, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token.

You can explore examples on the Friendli Serverless Endpoints playground and adjust settings with just a few clicks.

Authorizations

Authorization

string

headerrequired

When using Friendli Endpoints API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.

For more detailed information, please refer here.

Headers

X-Friendli-Team

string

ID of team to run requests as (optional parameter).

Body

application/json

model

string

required

Code of the model to use. See available model list.

messages

object[]

required

A list of messages comprising the conversation so far.

frequency_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

presence_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

repetition_penalty

number | null

Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be greater than or equal to 1.0 (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.

max_tokens

integer | null

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

min_tokens

integer | null

default: 0

The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when tools are specified.

integer | null

default: 1

The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

stop

string[] | null

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

stream

boolean | null

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

stream_options

object | null

Options related to stream. It can only be used when stream: true.

temperature

number | null

default: 1

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.

top_p

number | null

default: 1

Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.

top_k

integer | null

default: 0

The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k argument.

timeout_microseconds

integer | null

Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.

seed

integer[] | null

Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

eos_token

integer[] | null

A list of endpoint sentence tokens.

tools

object[] | null

A list of tools the model may call. Currently, only functions are supported as a tool. A maximum of 128 functions is supported. Use this to provide a list of functions the model may generate JSON inputs for.

When tools are specified, min_tokens field is unsupported.

tool_choice

Determines the tool calling behavior of the model. When set to none, the model will bypass tool execution and generate a response directly. In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message. Alternatively, setting required ensures that the model invokes at least one tool before responding to the user. You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.

parallel_tool_calls

boolean | null

Whether to enable parallel function calling.

response_format

object | null

The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens. You can check this by verifying that the finish_reason of the output message is length.

Important You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.). Otherwise, the model may result in an unending stream of whitespace or other characters.

Response

200 - application/json

choices

object[]

usage

object

created

integer

The Unix timestamp (in seconds) for when the generation completed.

Was this page helpful?

Overview Completion

API Reference

Inference

Serverless

Authorizations

Headers

Body

Response