Text completions

POST /v1/completions

Generate text based on the given text prompt. See available models and at this pricing table.

Request

Header Parameters

X-Friendli-Team string

ID of a team to run request as.

application/json

Body

oneOf

Prompt
Tokens

prompt stringrequired

The prompt (i.e., input text) to generate completion for. Either prompt or tokens field is required.

tokens integer[]required

The tokenized prompt (i.e., input tokens). Either prompt or tokens field is required.

model stringrequired

Code of the model to use. See available model list.

stream booleannullable

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated. Not supported when using beam search.

timeout_microseconds integernullable

Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.

max_tokens integernullable

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

max_total_tokens integernullable

The maximum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between max_tokens and max_total_tokens is allowed. Default value is the model's maximum length. This is similar to Hugging Face's max_length argument.

min_tokens integernullable

The minimum number of tokens to generate. default value is 0. this is similar to hugging face's min_new_tokens argument.

min_total_tokens integernullable

The minimum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between min_tokens and min_total_tokens is allowed. This is similar to Hugging Face's min_length argument.

n integernullable

The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

num_beams integernullable

Number of beams for beam search. Numbers between 1 and 31 (both inclusive) are allowed. Default behavior is no beam search. This is similar to Hugging Face's num_beams argument.

length_penalty numbernullable

Coefficient for exponential length penalty that is used with beam search. Only allowed for beam search. Defaults to 1.0. This is similar to Hugging Face's length_penalty argument.

early_stopping booleannullable

Whether to stop the beam search when at least num_beams beams are finished with the EOS token. Only allowed for beam search. Defaults to false. This is similar to Hugging Face's early_stopping argument.

no_repeat_ngram integernullable

If this exceeds 1, every ngram of that size can only occur once among the generated result (plus the input tokens for decoder-only models). 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Defaults to 1. This is similar to Hugging Face's no_repeat_ngram_size argument.

encoder_no_repeat_ngram integernullable

If this exceeds 1, every ngram of that size occurring in the input token sequence cannot appear in the generated result. 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Only allowed for encoder-decoder models. Defaults to 1. This is similar to Hugging Face's encoder_no_repeat_ngram_size argument.

repetition_penalty numbernullable

penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). should be greater than or equal to 1.0. 1.0 means no penalty. see keskar et al., 2019 for more details. this is similar to hugging face's repetition_penalty argument.

encoder_repetition_penalty numbernullable

Penalizes tokens that have already appeaared in the input tokens. Should be greater than or equal to 1.0. 1.0 means no penalty. Only allowed for encoder-decoder models. See Keskar et al., 2019 for more details. This is similar to Hugging Face's encoder_repetition_penalty argument.

frequency_penalty numbernullable

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

presence_penalty numbernullable

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

temperature numbernullable

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. defaults to 1.0. this is similar to hugging face's temperature argument.

top_k integernullable

The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k argument.

top_p numbernullable

Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.

stop string[]nullable

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. This option is incompatible with beam search (specified by num_beams); use stop_tokens for that case instead. Defaults to empty list.

stop_tokens object[]nullable

Stop generating further tokens when generated token corresponds to any of the tokens in the sequence. If beam search is enabled, all of the active beams should contain the stop token to terminate generation.

Array [

tokens integer[]

A List of token IDs.

]

seed integer[]nullable

Seed to control random procedure. If nothing is given, the API generate the seed randomly, use it for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

token_index_to_replace integer[]nullable

A list of token indices where to replace the embeddings of input tokens provided via either tokens or prompt.

embedding_to_replace number[]nullable

A list of flattened embedding vectors used for replacing the tokens at the specified indices provided via token_index_to_replace.

beam_search_type stringnullable

One of DETERMINISTIC, NAIVE_SAMPLING, and STOCHASTIC. Which beam search type to use. DETERMINISTIC means the standard, deterministic beam search, which is similar to Hugging Face's beam_search. Argmuents for controlling random sampling such as top_k and top_p are not allowed for this option. NAIVE_SAMPLING is similar to Hugging Face's beam_sample. STOCHASTIC means stochastic beam search (more details in Kool et al. (2019)). This option is ignored if num_beams is not provided. Defaults to DETERMINISTIC.

beam_compat_pre_normalization booleannullable

beam_compat_no_post_normalization booleannullable

bad_words string[]nullable

Text phrases that should not be generated. For a bad word phrase that contains N tokens, if the first N-1 tokens appears at the last of the generated result, the logit for the last token of the phrase is set to -inf. Before checking whether a bard word is included in the result, the word is converted into tokens. We recommend using bad_word_tokens because it is clearer. For example, after tokenization, phrases "clear" and " clear" can result in different token sequences due to the prepended space character. Defaults to empty list.

bad_word_tokens object[]nullable

Same as the above bad_words field, but receives token sequences instead of text phrases. This is similar to Hugging Face's bad_word_ids argument.

Array [

tokens integer[]

A List of token IDs.

]

include_output_logits booleannullable

Whether to include the output logits to the generation output.

include_output_logprobs booleannullable

Whether to include the output logprobs to the generation output.

forced_output_tokens integer[]nullable

A token sequence that is enforced as a generation output. This option can be used when evaluating the model for the datasets with multi-choice problems (e.g., HellaSwag, MMLU). Use this option with include_output_logprobs to get logprobs for the evaluation..

eos_token integer[]nullable

A list of endpoint sentence tokens.

Responses

Successfully generated completions. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json.

application/json
text/event-stream

Schema
Example (No Streaming)

Schema

choices object[]

Array [

index integer

The index of the choice in the list of generated choices.

seed integer

Random seed used for the generation.

text string

Generated text output.

tokens integer[]

Generated output tokens.

]

usage object

prompt_tokens integer

Number of tokens in the prompt.

completion_tokens integer

Number of tokens in the generated completion.

total_tokens integer

Total number of tokens used in the request (prompt_tokens + completion_tokens).

No streaming example

{
  "choices": [
    {
      "index": 0,
      "seed": 42,
      "text": "\n\nThis is indeed a test",
      "tokens": [
        3,
        3,
        412,
        15,
        5440,
        129,
        3391
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 7,
    "total_tokens": 12
  }
}

Schema
Example (Streaming)

Schema

event string

Possible values: [token_sampled, complete]

Type of server-sent event

token_sampled
complete

index integer

The index of the choice in the list of generated choices.

text string

Generated text output.

token integer

Generated output token.

choices object[]

Array [

index integer

The index of the choice in the list of generated choices.

seed integer

Random seed used for the generation.

text string

Generated text output.

tokens integer[]

Generated output tokens.

]

usage object

prompt_tokens integer

Number of tokens in the prompt.

completion_tokens integer

Number of tokens in the generated completion.

total_tokens integer

Total number of tokens used in the request (prompt_tokens + completion_tokens).

Streaming example (server-sent events)

data: {"event": "token_sampled", "index": 0, "text": "\n", "token": 3}

data: {"event": "token_sampled", "index": 0, "text": "\n", "token": 3}

data: {"event": "token_sampled", "index": 0, "text": "This", "token": 412}

data: {"event": "token_sampled", "index": 0, "text": " is", "token": 15}

data: {"event": "token_sampled", "index": 0, "text": " indeed", "token": 5440}

data: {"event": "token_sampled", "index": 0, "text": " a", "token": 129}

data: {"event": "token_sampled", "index": 0, "text": " test", "token": 3391}

data: {
  "event": "complete",
  "choices": [
    {
      "index": 0,
      "seed": 42,
      "text": "\n\nThis is indeed a test",
      "tokens": [3, 3, 412, 15, 5440, 129, 3391]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 7,
    "total_tokens": 12
  }
}

Text completions

/v1/completions

Request​

Header Parameters

Body

Responses​

Request

Responses