RAG Retrieval 101: How to Pick the Right Embeddings for Your AI App

/ tech

What are embeddings and embedding models?

Embeddings are vector representations of data such as text, images, audio, video, etc. The arrays of numbers in an embedding capture the semantics and important features of the data in a way that maps similar entities close together in vector space while dissimilar entities are farther apart.

Embedding models are algorithms that are designed to learn embeddings for data. They are typically based on large neural network models trained on massive datasets to generate embeddings that capture the complex relationships between different data points. For example, in natural language processing, embedding models will create vector representations of words or sentences that place semantically similar words near each other in the vector space.

What is RAG (briefly)

Retrieval Augmented Generation (RAG) aims to improve the quality of pre-trained language model (LM) generation using data retrieved from a knowledge base. The success of RAG relies heavily on retrieving the most relevant results from the knowledge base.

A common approach used for retrieval in RAG applications is semantic search. In semantic search, an embedding model is used to create vector representations of the user query and information in the knowledge base. Given a user query embedding, the system can retrieve the most relevant source documents from the knowledge base.

The retrieved documents, along with the original user query and any prompts, are then passed as context to a LM to generate an answer to the user’s question. So they enable effective retrieval, which improves the context and quality of LM generation in RAG.

Choosing the right embedding model for your RAG application

But with so many models out there, how do we choose the best one for our use case?

A good place to start when looking for embedding models to use is the MTEB Leaderboard on Hugging Face. It is the most up-to-date list of proprietary and open-source text embedding models, accompanied by statistics on how each model performs on various embedding tasks such as retrieval, summarization, etc.

Benchmarks are a good place to begin but bear in mind that these results are self-reported and have been benchmarked on datasets that might not accurately represent the data you are dealing with. It is also possible that some models may include the MTEB datasets in their training data since they are publicly available. So even if you choose a model based on benchmark results, we recommend evaluating it on your dataset.

When reviewing the leaderboard, focus on metrics related to retrieval performance like NDCG. Also consider model size, max tokens, and embedding dimensions. Larger models may have better performance but impact latency. Smaller ones are more efficient for storage and inference. The goal is finding the right balance for your use case.

Compare model sizes

Model size is one of the key factors to consider when choosing an embedding model. As seen in the leaderboard, model sizes can range from under 1 GB to over 10 GB.

Generally, larger models lead to better performance in tasks like retrieval. This is because larger models are able to capture more semantic relationships and nuances in the data during training. For example, Anthropic’s Claude model with 10B parameters achieves the top retrieval performance, while smaller models like sentence-transformers/msmarco-MiniLM-L6-v2 perform worse.

However, larger model size also leads to higher inference latency. This is because larger models require more compute to process an input and generate embeddings. In a production environment, latency directly impacts user experience – long wait times degrade experience.

So there is a tradeoff between performance and latency that must be evaluated for your specific use case. If real-time response is critical, smaller models like MiniLM may be preferable despite modestly lower accuracy. For non-real-time applications, larger models like Claude could be used to maximize accuracy.

The choice also depends on your computational budget. Larger models are more expensive to deploy because they require more RAM and often GPUs. So that is another aspect to consider when choosing between small and large embedding models.

Compare proprietary vs open-source

Proprietary models like Anthropic Claude have excellent performance out of the box. Claude tops the leaderboard across most metrics. Being proprietary also means the training data and algorithms are not publicly known, giving Claude an edge over open-source models.

However, open-source models like SentenceTransformers offer benefits like being free to use and customizable. SentenceTransformers are based on frameworks like Transformers and Tensorflow, allowing developers to fine-tune the models on their own datasets. This can help improve performance on niche domains. The open-source code also enables inspecting the model architecture and training routines.

Overall, proprietary models offer better performance but open-source models provide more transparency and customization. Depending on your priorities and resources, you may prefer one over the other. Testing both types of models on your dataset can help determine which is better suited.

Compare max tokens

When looking at the max tokens that a model can handle, you typically don’t want to put more than a single paragraph of text (~100 tokens) into a single embedding. So even models with max tokens of 512 should be more than enough for most use cases.

For example, the Anthropic Claude model has a max token limit of 512, which is overkill for generating a single paragraph embedding. The smaller Anthropic Constitutional model has a max token limit of 128, which is perfectly adequate.

Similarly, Google’s T5 model has a limit of 512 tokens, while the smaller ByT5 model caps out at 256 tokens.

The takeaway is that having a max token limit in the 100-256 range provides plenty of room for full paragraph embeddings, while still optimizing for efficiency. Unless your use case specifically requires embedding very long passages of text, models with 512 max tokens are overkill and inefficient.

Compare embedding dimensions

The length of the embedding vector is another important factor to consider. Smaller embedding dimensions offer faster inference and more storage efficiency, while more dimensions allow the model to capture more nuanced details and complex relationships in the data.

Ultimately, we want to strike the right balance between capturing the full complexity of our data while maintaining operational efficiency. More dimensions provide greater representational power but this comes at the cost of slower performance and larger storage needs.

On the other hand, smaller embeddings limit the model’s ability to learn intricate patterns and semantics. The ideal embedding size will depend on your specific use case and dataset. Start with models in the 768-1024 range as a reasonable default and test larger or smaller sizes as needed.

The goal is to choose the smallest size that still captures the key semantic relationships in your data. This ensures maximum efficiency without sacrificing model capability.

Recommend top models

When choosing an embedding model for your RAG application, you’ll need to make tradeoffs between performance, cost, and operational constraints. Here are some top models to consider:

Anthropic’s Claude – This large proprietary model offers state-of-the-art performance on many NLP tasks including retrieval, but requires significant compute resources. It’s a good choice if you need maximum accuracy and can afford the compute costs.
Google’s T5 – T5 is an open-source transformer model that achieves strong retrieval results while being more lightweight than Claude. It’s a good option if you want high performance without the compute burden of massive models.
Sentence Transformers – This is an open-source framework that trains sentence embedding models. Many of the top models are small and efficient while still offering decent accuracy. Choose these if you need a lightweight model that can be deployed easily.
GenAI’s Leonardo – Our new medium-sized proprietary model balances retrieval performance and efficiency. Leonardo is a cost-effective option if you need better accuracy than lightweight models but can’t afford massive models like Claude.

The choice depends on your priorities – if compute resources and costs are no concern, Claude may be the best performer. But T5, SentenceTransformers models, or Leonardo offer a good blend of accuracy and efficiency for many applications. Evaluate models on your dataset to choose the best fit.

How to evaluate embedding models

Evaluating embedding models on your own dataset is crucial before deploying them in production, even if they perform well on public benchmarks. There are a few key things to consider:

Importance of evaluating on your dataset

Every dataset has its own characteristics and complexity. The performance of an embedding model can vary significantly across different datasets. Models that perform well on public benchmarks may not work as well on your custom dataset. Hence, benchmark scores provide only a starting point. The true test of an embedding model’s efficacy is evaluating it on a sample of your own data.

Metrics for evaluation

Some key metrics used to evaluate embeddings for semantic search/retrieval tasks are:

Precision and Recall: Measure the model’s ability to retrieve all relevant results (recall) while minimizing irrelevant results (precision).
Mean Reciprocal Rank (MRR): Evaluates the model’s ability to rank the first relevant result higher in the list of retrieved documents.
Normalized Discounted Cumulative Gain (NDCG): Assesses if relevant results are ranked higher than irrelevant ones. NDCG also accounts for the position of results in the ranked list.

Annotation schema

To calculate these metrics, human judgments are required on the relevance of retrieved results for sample queries. This is facilitated by creating an annotation schema to consistently judge result relevance across queries.

Evaluators

Tools like Pyserini and FB’s DPR provide an end-to-end workflow for sampling data, creating embeddings, retrieving results, annotating judgments, and calculating evaluation metrics.

Evaluating on your own data is key to choosing the right embedding model before production deployment. Metrics, annotation schema, and open-source evaluators help streamline this process.

Choosing the right embedding model is crucial for building effective RAG applications. In this article, we covered the key points around embeddings and RAG:

Embeddings are vector representations of data that capture semantic meaning. Embedding models are algorithms to generate these vectors.
In RAG applications, embeddings enable semantic search to find the most relevant documents for a given query. The retrieved documents provide knowledge context for the LLM to generate high quality responses.
There are many embedding models to choose from. Look at benchmarks like the MTEB leaderboard but evaluate models on your own data as well.
Consider model size, proprietary vs open source, max tokens, embedding dimensions when comparing models. Choose a model that balances performance and operational constraints.
Outline a plan to evaluate shortlisted models on your dataset using metrics like NDCG, latency, etc. The best model for you will depend on your specific use case and data.

Selecting the optimal embedding model lays the foundation for an effective RAG system. With the right model, you can build AI applications that leverage external knowledge to provide users with accurate, relevant and helpful information.

Distribute:

Daniel Sfita

April 11, 2024

/popular articles

get in touch /

RAG Retrieval 101: How to Pick the Right Embeddings for Your AI App

What are embeddings and embedding models?

What is RAG (briefly)

Choosing the right embedding model for your RAG application

Compare model sizes

Compare proprietary vs open-source

Compare max tokens

Compare embedding dimensions

Recommend top models

How to evaluate embedding models

Importance of evaluating on your dataset

Metrics for evaluation

Annotation schema

Evaluators

The Ultimate Enterprise AI Resource Hub: Curated Expert Content & Insights

RAG Retrieval 101: How to Pick the Right Embeddings for Your AI App

Successful AI Rollout in Enterprise Environments

/turn your vision into reality

The best way to start a long-term collaboration is with a Pilot project. Let’s talk.