Usually, putting very large files as context and prompting over such context is expensive and takes considerable time. Although this strategy is usually much better than reviewing all the documents by your own, there is another approach that might return similar results for many prompts and using few coins. This is where Retrieval Augmented Generation – RAG – comes out, and this tool is now available in Straico’s API.
Table of Contents
ToggleWhat is RAG?
Retrieval-Augmented Generation (RAG) is an advanced AI framework designed to enhance the performance and accuracy of generative AI models, such as large language models (LLMs). This technique synergizes an information retrieval mechanism with a text generator model to produce more grounded and reliable outputs. By sourcing authoritative knowledge bases outside the model’s training data, RAG ensures that the generated responses are informed by external factual sources. This approach is particularly beneficial for improving the quality and relevance of LLM-generated content.
Here’s a summary of how a RAG-based system is typically created:
1. Data Collection: Initially, a large corpus of textual data is collected and prepared. This corpus will serve as the source from which the system retrieves relevant information.
2. Embedding and Vectorization: Each document in the corpus is converted into a vector representation. This process involves transforming the text into numerical vectors that a computer can understand and process. Embeddings, such as those generated by pre-trained models like BERT or Sentence Transformers, are used for this purpose. These embeddings capture semantic information about the text, allowing the system to understand contextual similarities between different pieces of text.
3. Indexing: Once vectorized, these embeddings are stored in a vector database or index. Advanced indexing techniques, like HNSW (Hierarchical Navigable Small World graphs), are often used to enable efficient similarity search across large datasets.
4. Query Processing: When a query is received, it is also converted into an embedding using the same model that was used for the documents. This query embedding represents the semantic content of the user’s question.
5. Retrieval: The query embedding is used to search the index for the most similar document embeddings. This process identifies the documents in the corpus that are most relevant to the query based on their vector proximity.
6. Generative Response: The retrieved documents are then passed to a generative model, such as a transformer-based language model. This model leverages the contextual information from the retrieved documents to generate a coherent and contextually enriched response to the query.
7. Output: Finally, the system responds to the user query with the generated text, which includes information seamlessly integrated from the retrieved documents.
The following is a general schema on how RAG works
Overall, RAG systems leverage the power of both information retrieval through embedding and vectorization, and language generation, to provide responses that are both relevant and contextually rich. This approach allows for dynamic and informed dialogue systems, providing better answers than traditional retrieval or generative models alone.
Why do we care about RAG?
Retrieving information from large documents using LLM is tempting. Straico has actually set some default templates as examples: one of them is called “Advisor on airline regulations”, which has an attached pdf file of 19,121 words. The template allows you to ask any question about the document.
For example, when the core question is “What happens if the flight is delayed?” generates a detailed response based on the large file given as context. When using Claude 3.5 Sonnet as LLM, the transaction is worth almost 900 coins!
You can check the detailed prompt and answer here.
It obviously gets worse when prompting again over the same context. But, is it really necessary to prompt over the whole context every time you have a question that only resembles a small fraction of the document?
This is when the magic of RAG arises. Let’s take a look at how to use RAG bases in our API.
Create a RAG base using Straico’s API
If you’re still not familiar with Straico’s API, please go to the detailed documentation: https://documenter.getpostman.com/view/5900072/2s9YyzddrR
Anyway, please keep reading this post: you won’t regret after noticing the great advantages of RAG bases.
With the POST endpoint “Create RAG base” (https://api.straico.com/v0/rag), you can set a reusable entity, which has an embedded source database suitable for future prompts. The parameters to be set are:
- name: a name for the RAG base.
- description: a description of the RAG base.
- files: you can upload up to 4 files (pdf, docx, csv, txt, xlsx, py).
This is an example of a call from Postman, in which we actually upload the same pdf containing the airline regulations of Straico’s default template.
The endpoint’s response looks like the following:
{
"success": true,
"data": {
"user_id": "65c23272b5e653413c616e1beedf",
"name": "Airline regulations",
"description": "RAG base on airline regulations",
"rag_url": "https://prompt-rack.s3.amazonaws.com/api/rag/65c31342b5e65c616e1beedf/b1481ab7-2727-4ffb-b532-28a7b26ca556/index.faiss",
"original_filename": "contract_of_carriage.pdf",
"chunking_method": "fixed_size",
"chunk_size": 1000,
"chunk_overlap": 50,
"buffer_size": 100,
"breakpoint_threshold_type": "percentile",
"_id": "6743adewe2f7a16eer34c71d46",
"createdAt": "2024-11-24T22:23:46.561Z",
"updatedAt": "2024-11-24T22:23:46.561Z",
"__v": 0
},
"total_coins": 19.12,
"total_words": 19121
}
Notice that the attribute total_coins has a value of 19.12. This is a very affordable fee for a 19,121-long file. In fact, RAG base creation has a fee of 0.1 coins per 100 processed words of the original files.
The _id is essential for prompting over the RAG base later. However, there are other attributes such as “chunking_method”, which can be set to use different algorithms to create the RAG base. You can refer to the API documentation, and the following article for more information: https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d
Other parameters, such as chunk_size, chunk_overlap, buffer_size and breakpoint_threshold_type can be actually set as optional parameters for the different chunking methods. Please refer to the API documentation for more details.
NOTE: Depending on the document’s length, it might take several minutes for the RAG base to be created, please be patient.
Prompt over RAG bases
Now that we have created a RAG base supported on a pdf file, let’s try prompts over the RAG base. This is similar to prompting over a large raw file, but we’ll notice some very interesting results!
Use the endpoint “RAG prompt completion” (https://stapi.straico.com/v0/rag/<rag-id>/prompt) to send a prompt over the RAG base.
The parameters to be set are:
- prompt: the actual prompt to be sent.
- model: the LLM to be used.
- <rag-id>: the _id retrieved from the previous endpoint’s response, to be set directly on the url.
In order to emulate the scenario with raw files, our prompt will be “What happens if the flight is delayed?”, and the selected LLM will be Claude 3.5 Sonnet.
The response is the following:
{
"success": true,
"response": {
"answer": "According to the context, here's what happens in case of flight delays:\n\n1. JetBlue will notify passengers of known delays of 30 minutes or more. These notifications can be given through:\n- jetblue.com\n- telephone (upon request)\n- flight information display systems\n- airport announcements\n- onboard announcements\n- email\n- text message\n\n2. For ground delays at U.S. airports:\n- Passengers will receive food and drink (potable water) no later than 2 hours after the aircraft leaves the Gate\n- Access to operable restrooms will be provided\n- Medical treatment will be provided if necessary\n\n3. Regarding liability for delays:\n- JetBlue may be liable for damage caused by delays, but won't be liable if they can prove they took all reasonable measures to avoid the delay\n- The airline is not liable for delays caused by facilities or personnel not under their control (such as airport, air traffic control, or security)\n- Damages for delays are subject to the terms and limitations set forth in the Montreal Convention and Warsaw Convention\n- The airline will reimburse passengers for reasonable expenses that occur because of delays on domestic flights",
"references": [
{
"page_content": "JetBlue Airways \n2. Carrier will notify Passengers of known delays of thirty (30) minutes or \nmore, cancellations and diversions. \n3. Subject to the terms of this Contract of Carriage including but not limited \nto Sections 20 (Improperly Packaged and Damaged Items; Late Items), \n25 (Failure to Operate as Scheduled), 26 (Relief for Failure to Transport \n/ Failure to Operate) and 32 (Government Laws and Regulations), \nand applicable law, Carrier will endeavor to deliver baggage on time, \nincluding making every reasonable effort to return mishandled bags \nwithin twenty-four (24) hours, reimbursing Passengers for reasonable \nexpenses that occur because of any delay on domestic flights or as \nrequired on international flights and reimbursing Passengers for any \nfees associated with transportation of a lost bag. \n4. Carrier is an instant purchase airline. Carrier does not hold reservations \nwithout payment. \n5. Carrier’s rules regarding fare refunds are set forth in Section 4 of this",
"page": 56
},
{
"page_content": "JetBlue Airways \nc. Delay of Passengers: The Carrier shall be liable for damage occasioned \nby delay in the carriage of Passengers by air, as provided in the following \nparagraphs: \n1. The Carrier shall not be liable if it proves that it and its servants and \nagents took all measures that could reasonably be required to avoid the \ndamage, or that it was impossible for it or them to take such measures. \n2. Airport, Air Traffic Control, security, and other facilities or personnel, \nwhether public or private, not under the control and direction of the \nCarrier are not servants or agents of the Carrier, and the Carrier is not \nliable to the extent the delay is caused by these kinds of facilities or \npersonnel. \n3. Damages occasioned by delay are subject to the terms, limitations \nand defenses set forth in the Montreal Convention and the Warsaw \nConvention, whichever may apply. They include foreseeable \ncompensatory damages sustained by a Passenger and do not include",
"page": 29
},
{
"page_content": "JetBlue flight in the same class of service at no additional charge or fare, \nexcept when a portion of the trip has been made. Any refund will be made in an \namount equal to the applicable one-way fare for the portion of the trip cancelled \nor not operated as scheduled by JetBlue. \nd.\n GROUND DELAYS \nAt\n all U.S. large, medium, small, and no hub airports JetBlue serves, including \nU.S. large, medium, small, and no hub diversion airports, JetBlue will provide \nPassengers experiencing a Ground Delay with food and drink (potable water) \nno later than two (2) hours after the aircraft leaves the Gate unless the \nPilot-in-Command determines there is a safety or security-related reason for not \ndoing so. JetBlue will provide Passengers with, access to operable restrooms \nand, as necessary, medical treatment. JetBlue will not permit the aircraft to \nremain on a tarmac for more than three (3) hours for domestic flights or for",
"page": 54
},
{
"page_content": "JetBlue Airways \n3. Passengers on JetBlue itineraries originating in the United Kingdom or \nin a European Community state are not eligible for the compensation \nor relief described in this Section 37, except to the extent that the \nprovision of any component of such compensation or relief is otherwise \nindependently compelled by applicable local law or regulation and/or \nclaimed consistent therewith. \nb. INFORMATION \nJetBlue\n will notify Passengers of the following: known delays of thirty (30) \nminutes or more, cancellations, and diversions. Notification will be given \nin any of the following forms: via jetblue.com, via telephone (upon request), \non flight information display systems, via airport announcement, via onboard \nannouncement, via email or via text message. \nc.\n CANCELLATIONS \nA\n Passenger whose flight is cancelled by JetBlue will receive, at the \nPassenger’s option, a full refund or reaccommodation on the next available",
"page": 54
}
],
"file_name": "contract_of_carriage.pdf",
"coins_used": 38.45
}
}
First of all, notice that the attribute coins_used has a value of 38.45. This is a very small value compared to the almost 900 coins used when prompting over the raw file!
Under the key “response”, we’ve got the following elements:
- answer: this is the actual prompt response.
- references: these are the pieces of the document that were actually considered to generate the final answer.
The coin usage depends on how many words the references have and how many references are used. The number of references to be used can be set as an optional parameter “k” when calling the endpoint.
Comparative results between raw file prompts and RAG
After creating the RAG base and prompting over it, we can finally compare its performance with the counterpart of prompting over a raw file.
Raw file | RAG | |
FILE WORD COUNT | 19,121 | 19,121 |
COIN USAGE FOR RAG CREATION | 0 | 19.12 |
COIN USAGE PER PROMPT | 898.752 | 38.45 |
COIN USAGE PERCENTAGE COMPARED TO RAW | 100% | 4.3% |
Coin usage over RAG bases is much cheaper than those required for prompting over raw files. Moreover, you could send further prompts to the same RAG base without uploading files again. You only invest once in RAG base creation for an unlimited number of prompts!
Suggested uses cases
The airline regulation example is one of many use cases that we have discussed in the community. However, RAG has been proved to be useful for:
- Querying on long legal documents.
- Retrieving information from large academic papers.
- Creating chat assistants able to answer FAQ.
- Getting information from long conversations in forums or communities.
Feel free to try your own use case with Straico’s API.
Bonus: Agents
Agents are the next step in your AI journey. Start by creating your first Agent using the “Create agent” endpoint. From there, associate RAG bases with your Agent using the “Add RAG to agent” endpoint, effectively building a comprehensive knowledge base tailored to your specific requirements.
Engage with your Agent using the “Agent prompt completion” endpoint, and experience how it draws upon the associated RAG bases to provide context-aware, intelligent responses.
Continuous Innovation
While this API upgrade introduces exciting new features, we remain committed to supporting and improving our existing capabilities. Classical prompting over raw files will continue to be available outside our API, ensuring that you have a range of options to suit your diverse needs.
This is just the beginning of our journey to enhance the generative AI experience. We’re continuously exploring ways to expand our offerings, with future plans to introduce new features and functionalities.
Experience the future of generative AI today. Try our RAG and Agent features, and unlock a world of possibilities for your business or personal endeavors.