Retrieval-Augmented Generation

(Part 2 of a four part series presented at the Masterclass “From chatbots to personalized research assistants: a journey through the new ecosystems related to Large Language Models” at the Medien Triennale Südwest 2023)

  • Journalistic use case: work with (large) documents in research, fact checking, customized search tools for archived content.
  • Take away message: The context of prompts to Large Language Models can be narrowed down to own data/documents. The problem of hallucinations can be minimized. Advanced techniques even guarantee the extraction of verbatim passages.

Enhancing content generation with retrieval-augmented techniques offers a powerful approach to enriching the context of a prompt. By incorporating retrieved data based on specific informational requirements, such as a user’s query, we can craft a more informed and precise response. This approach also opens the door to integrating knowledge that wasn’t included in the original training data of a Large Language Model, such as more recent or proprietary data. Furthermore, it allows for the tailoring of the retrieval pipeline to meet unique user needs.

The predominant method of Retrieval-Augmented Generation leverages what are known as vector stores. These databases, aptly named for their function, house vectors of embeddings along with their associated metadata. Embeddings convert various data modalities – such as text, image, audio, and more – into a high-dimensional vector representation. This transformation process encapsulates multiple layers of information, ensuring that similar data elements maintain proximity within the embedding space. To facilitate this, there are specialized embedding models available, such as Sentence Transformers or commercial solutions like OpenAI’s embeddings endpoint.

Basic Question & Answering

The process of basic question and answering can be broken down into three main stages: preparation, querying, and answer generation.

During the preparation stage, documents are collected and divided into manageable chunks.

Each chunk is then processed to calculate its embeddings, which are stored along with relevant metadata such as the source and the plain text of the processed chunk.

The querying stage begins when a question is received from the user. The question is processed to calculate its embeddings, which are then used to find the most similar chunks in the vector store.

Finally, in the answer generation stage, the plain text of the retrieved chunks is written into the context window of the prompt template. This template is then used to generate the answer to the user’s question.

In LangChain, this entire pipeline can be written in a few lines of code. In this example, we load the content of a website and split it into chunks and calculate embeddings on them. However, LangChain provides a variety of integrations to load data from very different data sources, such as PDF files, databases or cloud services. LangChain also abstracts from the specific vector store used, i.e. this code example with a local vector store will also work with vector stores for productive use in larger applications without major changes.

from langchain.chat_models import ChatOpenAI

from langchain.document_loaders import WebBaseLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

def preprocess_website(website_url):
    # load text from website
    loader = WebBaseLoader(website_url)
    data = loader.load()
    # split website text into chunks of 1000 characters 
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(data)
    # calculate embeddings for chunks and store them in vector store
    vectorstore = Chroma.from_documents(documents=docs, embedding=OpenAIEmbeddings())
    return VectorStoreIndexWrapper(vectorstore=vectorstore)

# this is the large language model needed for generating the final answer
llm = ChatOpenAI(model_name="gpt-3.5-turbo-0613", temperature=0.0)
# preprocess text from the English language introduction page of the SMC
index = preprocess_website("")
# define the question, this would be the user's input
question = "What is the mission of the SMC?"

# query vector store and print result
# i.e. embedding the question
# retrieve similar chunks from vector store
# add retrieved chunks to  context of answer generating prompt template 
# generate answer
print(index.query(question, llm=llm))
As an intermediary, we want to promote the exchange of information and dialogue between science, the media and the public.

Question & Answering with sources and supporting evidence

Retrieval-Augmented Generation is an attempt to narrow down the information used to generate an answer to the retrieved context. This lowers the risk of hallucinations and, as mentioned earlier, allows the use of more recent or proprietary data in LLM-based applications.

However, another requirement for these applications is the ability to assess the quality or reliability of the generated responses. For this purpose, the sources for the chunks used are often specified in the context. In the following example, we try to go one step further and to extract answers as precisely as possible, for which the most verbatim matches in the context of the query are to be supplied.

The preparation and querying phase of question and answering with sources and supporting evidence is identical to Basic Retrieval-Augmented Generation example discussed earlier. However, after identifying the most similar chunks to a query in the vector store, the retrieved chunks are then passed to the following extractive prompt:

This prompt’s task is to fill out a predefined data model, which includes the question and answers as a list of facts, each supported by evidence.

For extracting the information from the context of the prompt, a specialized function calling capability of OpenAI’s Large Language Models is used. This capability is designed in a narrower sense to determine from a prompt which function (which tool, see the post on agents in this series) should be called with which parameters, but it can also be used excellently for data extraction for given data models.

The final phase is, as before, the generation of the answer. Here, we create an answer using the list of extracted facts, each backed by evidence. The answer, along with its supporting evidence, is then presented to the user.

Since the logic of data extraction is already encapsulated in its own chain, this use case can also be implemented in a few lines of code in LangChain.

# data preparation and definition of llm as above
from langchain.chains import create_citation_fuzzy_match_chain

# helper function for highlighting parts of a text
def highlight(text, span):
    return (
        + text[span[0] - 20 : span[0]]
        + "*"
        + "\033[91m"
        + text[span[0] : span[1]]
        + "\033[0m"
        + "*"
        + text[span[1] : span[1] + 20]
        + "..."

question = "What is the mission of the SMC?"

# prepare context
docs = index.vectorstore.similarity_search(question, k=4)
context = "\n\n".join([doc.page_content for doc in docs])

# create and run fuzzy match chain
fuzzy_match_chain = create_citation_fuzzy_match_chain(llm)
results =, context=context)

# process and display structured response
for fact in results.answer:
     print("Answer:", fact.fact)
      for span in fact.get_spans(context):
          print("Source:", highlight(context, span))

Answer: As an intermediary, we want to promote the exchange of information and dialogue between science, the media and the public.
Source …ut we do even more. *As an intermediary, we want to promote the exchange of information and dialogue between science, the media and the public.*
For our work, we re…

Answer: Our motto: We love enlightenment!
Source …g example of this”.
*Our motto: We love enlightenment!*
We are not alone an…