Skip to main content

Document Retrieval


Retrieval Augmented Generation (RAG) is a powerful technology that combines the strengths of retrieval and generation models to produce more accurate and informative responses. In a RAG system, a retrieval module searches a database of documents to find relevant passages related to a given input query. The retrieved passages are then used to condition a generation model, which produces a response based on the input query and the retrieved context. This approach enables the generation model to produce more accurate and informative responses, as it can draw upon the knowledge contained in the retrieved passages.

Document Retrieval in Friendli Suite

In Friendli Suite, you can upload your data as PDF or TXT files and retrieve relevant contexts to feed into RAG models. When you upload data files, they are parsed, embedded, and stored in our vector database.

To retrieve contexts, you can query our REST API, which returns similar contexts from the data source along with their similarity scores. Here's an example query:

curl -X POST
-H "Content-Type: application/json"
-H "Authorization: Bearer ${USER_PAT}"
-d '{"query": "What is python?", "k": 3, "document_ids": [...]}'

The API response will contain an array of relevant contexts, each with a similarity score:

"results": [
"content": "Python\nParadigm Multi-paradigm...",
"score": 0.91734

You can then use the retrieved contents to send inference requests.

RAG Example: PDF QA

This example shows you how to run RAG with Friendli Document Retrieval and Friendli Serverless Endpoints by using langchain.

!pip3 install -qU langchain langchain-community friendli-client requests
import os
import requests
from langchain_community.chat_models.friendli import ChatFriendli

# Get your access token at

def retrieve_contexts(
document_ids: list[str], query: str, k: int
) -> list[str]:
resp =
"Content-Type": "application/json",
"Authorization": f"Bearer {FRIENDLI_TOKEN}",
"document_ids": document_ids,
"query": query,
"k": k,
data = resp.json()
return [r["content"] for r in data["results"]]

document_ids = [...]
contexts = retrieve_contexts(document_ids, "What is Orca?", 2)

llm = ChatFriendli(
model="meta-llama-3-70b-instruct", friendli_token=FRIENDLI_TOKEN
llm.call_as_llm(message="What is Orca?")

template = """Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say “thanks for asking!” at the end of the answer.


Question: {question}

Helpful Answer:"""

rag_message = template.format(
context="\n".join(contexts), question="What is Orca?"
'ORCA is a distributed serving system for Transformer-based generative models. It is designed to provide low-latency and high-throughput inference serving for large-scale Transformer models, such as GPT-3. Thanks for asking!'