Advanced RAG: Best Practices for Production Ready Systems

Written by
Karl Simon
Published on
January 1, 2025

If you’ve been familiarizing yourself with techniques to augment generative AI model knowledge with your additional data beyond a model’s training cutoff date (or including your own, private data), you’re aware of Retrieval Augmented Generation (RAG) based systems by now. It’s far from a new concept, and in fact, some in the community have already started predicting its demise as a key technique for Q&A systems for your data, whether via chatbots or extended workflow-based applications.

The reason for such optimism about the demise of RAG? Many point to models that are effectively increasing their context windows (e.g. Claude 3 models allow 200k tokens) to enable documents to be included within the API call alongside your prompt(ed) instructions to answer questions about the same.

However, the demise of RAG is not so imminent - at least not until models can prove that it can handle an extended, lengthy context (with all of our private data) without failing to miss important details, including the well-known “lost in the middle” fact(s). This is especially true for critically important, fact-driven Q&A systems supporting domains (such as Legal, Finance, and Supply Chain Management) that rely on complete comprehensive, high-fidelity information. In “fact”, RAG implementations with only a Single-Agent model architecture where the model is assigned all roles as a singular agent to perform extended reasoning duties, such as ‘write’, ‘critique’, ‘edit’, and ‘publish’ content (all with specific and extended prompt instructions comprising guiding examples, known as “few shot instructions”) risks agent execution completeness and accuracy degrading, or at least not performing reliably across multiple retries of the same inquiry.

Naturally, architects have already begun solving this challenge months back. Many have heard the calls for introducing Multi-Agent architectures with each agent defined for a specific subtask guided by a subtask-specific prompt. And as a natural pairing to multi-agent architectures, many, including langchain, have strongly recommended and enabled through its framework the ability to engineer flows to extend the Directed Acyclic Graph (DAG) with cycles (i.e. “loops”).  In Flow Engineering, we can create supervising agents to select the next action subtask (and sub-agent call), and include 1 or more repeated cycles for specific subtasks as necessary within a multi-step workflow.

And guess what - we echo these calls! These new capabilities have absolutely enabled generative AI-based application workflows to better emulate true processes, and with improved accuracy, if constructed properly. For example, langchain demonstrated that it could implement code generation for the following decision points, based on a flow for AlphaCodium:

Through Langraph-based flow engineering, the langchain team effectively built out the following workflow, with check nodes driving evaluation and potential iterative cycles as shown below:

Flow Engineering + Multi-Agent Architecture FTW?

But implementing flow engineering supporting a multi-agent architecture isn’t “magic” - you can’t simply implement these techniques and expect better results. As you might have already guessed, optimal implementations still require employing the proper chunking, indexing, retrieval, and post-retrieval sorting strategies employed. Depending upon the use case, you might find yourself employing different retrievers within your end-to-end workflow, or otherwise setting different score thresholds at different moments as well.

In short, there are many pitfalls, leading to incorrect outputs, and the earlier a pitfall occurs within the process, the probability increases that the likelihood and size of the inaccuracy will have further increased by the time your workflow ends. Even other considerations, such as deciding between chat history preservation vs. curtailment at different process steps can dramatically alter the forward-progressing flow and size of the context assimilated by a later agent, which may or may not lead to expected or even required results.

These lessons are learned through experience, with preparing a solution to be “Production ready”. My goal in the upcoming weeks is to share many of these lessons, and arm you with the foresight to plan and execute accordingly.

In the upcoming weeks, I’ll share use case examples that we’ve experienced at SOVA, and provide insight to some of the identified opportunities to correct or otherwise optimize outcomes. Next time, we’ll begin with a typical knowledge retrieval-based, Q&A system supporting a Legal domain that must be accurate to support contract generation and existing contract legal guidance review for the internal Legal Team. I’ll highlight the challenges, the opportunities identified, and the employed solution updates that ultimately met clients’ expected outcomes.

Stay tuned.

Share This Post
Karl Simon
CTO