利用LangChain和CrewAI构建基于RAG的查询解析系统

如今，企业需要处理大量来自客户、销售团队和内部利益相关者的询问。手动回复这些查询是一个缓慢而低效的过程，往往会导致延迟和答案不一致。由人工智能驱动的查询解决系统可确保快速、准确和可扩展的响应。它的工作原理是利用检索增强生成（RAG）技术检索相关信息并生成精确的答案。在本文中，我将与大家分享我使用LangChain、ChromaDB 和 CrewAI 构建基于 RAG 的查询解析系统的历程。

我们为什么需要AI驱动的查询解析系统？

现在，人工回复需要时间，因此可能会导致延迟。客户需要即时回复，企业需要快速获取准确信息。人工智能驱动的系统可自动处理查询，减少工作量并提高一致性。它可以提高生产率，加快决策速度，并在不同部门提供可靠的回复。

人工智能驱动的查询解决系统在客户支持方面非常有用，它可以自动回复并提高客户满意度。在销售和营销领域，它可以提供实时的产品详情和客户洞察。金融、医疗保健、教育和电子商务等行业都能从自动查询处理中受益，确保顺利运营和更好的用户体验。

了解RAG工作流程

在深入实施之前，让我们先了解一下检索增强生成（RAG）系统是如何工作的。

RAG工作流程

该架构由三个关键阶段组成：索引、检索和生成。

1. 建立矢量存储（文档处理与存储）

系统首先处理并存储相关文档，使其易于搜索。以下是索引过程的工作原理：

文档和分块：将大文档分割成较小的文本块，以便高效检索。
嵌入模型：使用基于人工智能的嵌入模型，将这些文本块转换为矢量表示。
向量存储：将矢量化数据编入索引并存储在数据库（如 ChromaDB）中，以便快速查找。

2. 查询处理与检索

当用户提交查询时，系统会先检索相关数据，然后再生成响应。以下是查询处理和检索的步骤：

用户查询输入：用户提交问题或请求。
矢量化：使用嵌入模型将查询转换为数字向量。
搜索和检索：系统在矢量存储中搜索并检索最相关的块。

3. 增强和生成响应

为了生成有理有据的响应，系统会利用检索到的数据对查询进行扩充。生成响应的步骤如下。

增强查询：将检索到的文档块与原始查询结合起来。
LLM 处理：大语言模型（LLM）利用查询和检索到的上下文生成最终响应。
最终回复：系统向用户提供一个符合事实并能感知上下文的答案。

现在您已经知道 RAG 系统是如何工作的了，让我们来学习如何构建基于 RAG 的查询解决系统。

构建基于RAG的查询解析系统

在本文中，我将引导您构建一个基于 RAG 的查询解决系统，该系统可使用人工智能代理高效地回答学习者的查询。为了简单起见，我将演示项目的简化版本，并解释其工作原理。

为查询解析选择正确的数据

在构建基于 RAG 的查询解析系统之前，需要考虑的最重要因素就是数据，具体来说，就是有效检索所需的数据类型。结构良好的知识库至关重要，因为响应的准确性和相关性取决于可用数据的质量。以下是针对不同目的应考虑的主要数据类型：

客户支持数据：常见问题、故障排除指南、产品手册和过去的客户互动。
销售和营销数据：产品目录、定价详情、竞争对手分析和客户咨询。
内部知识库：公司政策、培训文件和标准操作程序 (SOP)。
财务和法律文件：合规指南、财务报告和监管政策。
用户生成的内容：论坛讨论、聊天记录和反馈表，提供真实的用户查询。

选择正确的数据源对我们的学习者查询解决系统至关重要，这样才能确保做出准确、相关的回复。最初，我尝试了不同类型的数据，以确定哪种数据能提供最佳结果。首先，我使用了 PowerPoint 幻灯片（PPT），但它们并没有像预期的那样提供全面的答案。接着，我加入了常见问题，这提高了回答的准确性，但缺乏足够的上下文。然后，我对过去的讨论进行了测试，这有助于通过利用学员之前的互动来提高回答的相关性。不过，最有效的方法还是使用课程视频中的字幕，因为它们提供了与学员询问直接相关的结构化详细内容。这种方法有助于提供快速、相关的答案，因此对电子学习平台和教育支持系统非常有用。

构建查询解决系统

在编码之前，构建查询解决系统非常重要。最好的方法是定义系统需要执行的关键任务。

该系统将处理三项主要任务：

从字幕（SRT 文件）中提取并存储课程内容。
根据学习者的查询检索相关课程资料。
使用人工智能代理生成结构化回复。

为此，系统分为三个组件，每个组件处理特定的功能。这确保了效率和可扩展性。

系统包括

字幕处理– 从 SRT 文件中提取文本，对其进行处理，并将嵌入内容存储在 ChromaDB 中。
检索– 根据学习者的查询搜索和检索相关课程资料。
查询回答代理– 使用 CrewAI 生成结构化的准确回答。

每个组件都能确保高效的查询解决、个性化的回复和流畅的内容检索。现在我们有了结构，接下来就开始实施。

实施步骤

现在我们有了结构，接下来就开始实施。

1. 导入库

要构建人工智能驱动的学习支持系统，我们首先需要导入必要的库。

import pysrt

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.schema import Document

from langchain.embeddings import OpenAIEmbeddings

from langchain.vectorstores import Chroma

from crewai import Agent, Task, Crew

import pandas as pd

import ast

import pysrt from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.schema import Document from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from crewai import Agent, Task, Crew import pandas as pd import ast

import pysrt
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from crewai import Agent, Task, Crew 
import pandas as pd
import ast

让我们来了解一下这些库。

pysrt – 用于从 SRT 字幕文件中提取文本。
langchain.text_splitter.RecursiveCharacterTextSplitter – 将大段文本分割成小块，以便更好地检索。
langchain.schema.Document – 表示结构化文本文档。
langchain.embeddings.OpenAIEmbeddings – 将文本转换为数字向量，用于相似性搜索。
langchain.vectorstores.Chroma – 将嵌入数据存储在向量数据库中，以便高效检索。
crewai (Agent, Task, Crew) – 定义处理学习者查询的人工智能代理。
pandas – 处理数据帧形式的结构化数据。
ast – 帮助将基于字符串的数据结构解析为 Python 对象。
os – 提供系统级操作，如读取环境变量。
tqdm – 在长时间运行的任务中显示进度条。

2. 设置环境

要使用 OpenAI 的 API 进行嵌入，我们必须加载 API 密钥并配置模型设置。

第 1 步：从本地文本文件中读取 API 密钥。

with open('/home/janvi/Downloads/openai.txt', 'r') as file:

openai_api_key = file.read()

with open('/home/janvi/Downloads/openai.txt', 'r') as file: openai_api_key = file.read()

with open('/home/janvi/Downloads/openai.txt', 'r') as file:
openai_api_key = file.read()

第 2 步：将 API 密钥存储为环境变量，以便其他组件访问。

os.environ['OPENAI_API_KEY'] = openai_api_key

os.environ['OPENAI_API_KEY'] = openai_api_key

第 3 步：指定用于处理嵌入的 OpenAI 模型。

os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o-mini'

os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o-mini'

通过设置这些配置，我们确保了与 OpenAI 应用程序接口的无缝集成，从而使我们的系统能够高效地处理和存储嵌入式内容。

3. 提取和存储字幕数据

字幕通常包含视频讲座中的宝贵见解，因此对于基于人工智能的检索系统来说，字幕是结构化内容的丰富来源。有效提取和处理字幕数据可以在回答学习者的询问时高效搜索和检索相关信息。

第 1 步：从 SRT 文件中提取文本

为了保留教育见解，我们使用 pysrt 来读取和预处理 SRT 文件中的文本。这样可以确保提取的内容结构合理，便于进一步处理和存储。

def extract_text_from_srt(srt_path):

"""Extracts text from an SRT subtitle file using pysrt."""

subs = pysrt.open(srt_path)

text = " ".join(sub.text for sub in subs)

return text

def extract_text_from_srt(srt_path): """Extracts text from an SRT subtitle file using pysrt.""" subs = pysrt.open(srt_path) text = " ".join(sub.text for sub in subs) return text

def extract_text_from_srt(srt_path):
"""Extracts text from an SRT subtitle file using pysrt."""
subs = pysrt.open(srt_path)
text = " ".join(sub.text for sub in subs)
return text

由于课程可能有多个字幕文件，我们对存储在预定义文件夹中的课程资料进行了系统整理和迭代。这样就可以进行无缝文本提取和进一步处理。

# Define course names and their respective folder paths

course_folders = {

"Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch",

"Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex",

"Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai"

}

# Dictionary to store course names and their respective .srt file paths

course_srt_files = {}

# Iterate through course folder mappings

for course, folder_path in course_folders.items():

srt_files = []

# Walk through the directory to find .srt files

for root, _, files in os.walk(folder_path):

srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt"))

# Add to dictionary if there are .srt files

if srt_files:

course_srt_files[course] = srt_files

# Define course names and their respective folder paths course_folders = { "Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch", "Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex", "Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai" } # Dictionary to store course names and their respective .srt file paths course_srt_files = {} # Iterate through course folder mappings for course, folder_path in course_folders.items(): srt_files = [] # Walk through the directory to find .srt files for root, _, files in os.walk(folder_path): srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt")) # Add to dictionary if there are .srt files if srt_files: course_srt_files[course] = srt_files

# Define course names and their respective folder paths
course_folders = {
"Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch",
"Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex",
"Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai"
}
# Dictionary to store course names and their respective .srt file paths
course_srt_files = {}
# Iterate through course folder mappings
for course, folder_path in course_folders.items():
srt_files = []
# Walk through the directory to find .srt files
for root, _, files in os.walk(folder_path):
srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt"))
# Add to dictionary if there are .srt files
if srt_files:
course_srt_files[course] = srt_files

这些提取的文本构成了我们人工智能驱动的学习支持系统的基础，可实现高级检索和查询解析。

第 2 步：在 ChromaDB 中存储字幕

在这一部分，我们将详细介绍将课程字幕存储到 ChromaDB 的过程，包括文本分块、嵌入生成、持久化和成本估算。

a. ChromaDB 的持久目录

persistent_directory 是保存存储数据的文件夹路径，它允许我们在重启程序后仍能保留嵌入内容。如果没有这个目录，每次执行后数据库都会重置。

persist_directory = "./subtitles_db"

persist_directory = "./subtitles_db"

ChromaDB 用作矢量数据库，可高效地存储和检索嵌入信息。

b. 将文本分割成小块

大型文档（如整个课程的字幕）超出了嵌入式的标记限制。为了解决这个问题，我们使用 RecursiveCharacterTextSplitter 将文本分割成更小的、重叠的块，以提高搜索的准确性。

# Text splitter to break documents into smaller chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Text splitter to break documents into smaller chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Text splitter to break documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

每个语块长度为 1,000 个字符，确保文本被分割成易于管理的片段。为了保持各语料块之间的上下文关系，上一个语料块的 200 个字符会包含在下一个语料块中。这种重叠有助于保留重要细节，提高检索准确性。

c. 初始化 OpenAI 嵌入和 ChromaDB 向量存储

我们需要将文本转换为数字向量表示，以便进行相似性搜索。OpenAI 的嵌入允许我们将课程内容编码为可高效搜索的格式。

# Initialize OpenAI embeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Initialize OpenAI embeddings embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

在这里，OpenAIEmbeddings() 使用 OpenAI API 密钥（openai_api_key）初始化嵌入模型。这将确保每个文本块都能转换成高维向量表示。

d. 初始化 ChromaDB

现在，我们将这些向量嵌入存储到 ChromaDB 中。

# Initialize Chroma vectorstore with persistent directory

vectorstore = Chroma(

collection_name="course_materials",

embedding_function=embeddings,

persist_directory=persist_directory

)

# Initialize Chroma vectorstore with persistent directory vectorstore = Chroma( collection_name="course_materials", embedding_function=embeddings, persist_directory=persist_directory )

# Initialize Chroma vectorstore with persistent directory
vectorstore = Chroma(
collection_name="course_materials",
embedding_function=embeddings,
persist_directory=persist_directory
)

collection_name=’course_materials” 会在 ChromaDB 中创建一个专门的集合，用于组织所有与课程相关的嵌入。embedding_function=embeddings（嵌入函数）指定了将文本转换为数字向量的 OpenAI 嵌入函数。persist_directory=persist_directory 确保所有存储的嵌入式素材即使在重启程序后仍可在 ./subtitles_db/ 中使用。

第 3 步：估算课程数据的存储成本

在将文档添加到矢量数据库之前，必须估算令牌的使用成本。由于 OpenAI 按每 1,000 个代币收费，因此我们计算了预期成本，以便有效管理费用。

a. 定义定价参数

由于 OpenAI 按每 1000 个 tokens 收费，因此我们在添加文档之前先估算成本。

import time

# OpenAI Pricing (adjust based on the model being used)

COST_PER_1K_TOKENS = 0.0001 # Cost per 1K tokens for 'text-embedding-ada-002'

TOKENS_PER_CHUNK_ESTIMATE = 750 # Approximate tokens per 1000-character chunk

# Track total tokens and cost

total_tokens = 0

total_cost = 0

# Start timing

start_time = time.time()

import time # OpenAI Pricing (adjust based on the model being used) COST_PER_1K_TOKENS = 0.0001 # Cost per 1K tokens for 'text-embedding-ada-002' TOKENS_PER_CHUNK_ESTIMATE = 750 # Approximate tokens per 1000-character chunk # Track total tokens and cost total_tokens = 0 total_cost = 0 # Start timing start_time = time.time()

import time
# OpenAI Pricing (adjust based on the model being used)
COST_PER_1K_TOKENS = 0.0001  # Cost per 1K tokens for 'text-embedding-ada-002'
TOKENS_PER_CHUNK_ESTIMATE = 750  # Approximate tokens per 1000-character chunk
# Track total tokens and cost
total_tokens = 0
total_cost = 0
# Start timing
start_time = time.time()

COST_PER_1K_TOKENS = 0.0001 定义了使用 OpenAI 嵌入时每 1000 个 token 的成本。TOKENS_PER_CHUNK_ESTIMATE = 750 估计每个 1000 个字符的数据块包含约 750 个标记。total_tokens 和 total_cost 变量跟踪执行过程中处理的数据总量和产生的成本。start_time 变量记录的是开始时间，用于衡量进程所需的时间。

b. 检查并向 ChromaDB 添加课程

我们希望避免重新处理已经存储在向量数据库中的课程。因此，我们会查询 ChromaDB，检查课程是否已经存在。如果未找到课程，我们将提取并存储其字幕数据。

# Add new courses to the vectorstore if they don't already exist

for course, srt_list in course_srt_files.items():

# Check if the course already exists in the vectorstore

existing_docs = vectorstore._collection.get(where={"course": course})

if not existing_docs['ids']:

# Course not found, add it

srt_texts = [extract_text_from_srt(srt) for srt in srt_list]

course_text = "\n\n\n\n".join(srt_texts) # Join SRT texts with four new lines

doc = Document(page_content=course_text, metadata={"course": course})

chunks = text_splitter.split_documents([doc])

# Add new courses to the vectorstore if they don't already exist for course, srt_list in course_srt_files.items(): # Check if the course already exists in the vectorstore existing_docs = vectorstore._collection.get(where={"course": course}) if not existing_docs['ids']: # Course not found, add it srt_texts = [extract_text_from_srt(srt) for srt in srt_list] course_text = "\n\n\n\n".join(srt_texts) # Join SRT texts with four new lines doc = Document(page_content=course_text, metadata={"course": course}) chunks = text_splitter.split_documents([doc])

# Add new courses to the vectorstore if they don't already exist
for course, srt_list in course_srt_files.items():
# Check if the course already exists in the vectorstore
existing_docs = vectorstore._collection.get(where={"course": course})
if not existing_docs['ids']:
# Course not found, add it
srt_texts = [extract_text_from_srt(srt) for srt in srt_list]
course_text = "\n\n\n\n".join(srt_texts)  # Join SRT texts with four new lines
doc = Document(page_content=course_text, metadata={"course": course})
chunks = text_splitter.split_documents([doc])

使用 extract_text_from_srt() 函数提取字幕。然后使用 \n\n\n 将多个字幕文件连接在一起，以提高可读性。创建一个 Document 对象，存储完整的字幕文本及其元数据。最后，使用 text_splitter.split_documents()将文本分割成小块，以便高效处理和检索。

c. 估算 tokens 用量和成本

在将块添加到 ChromaDB 之前，我们要估算成本。

# Estimate cost before adding documents

chunk_count = len(chunks)

batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE

batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS

total_tokens += batch_tokens

total_cost += batch_cost

# Estimate cost before adding documents chunk_count = len(chunks) batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS total_tokens += batch_tokens total_cost += batch_cost

      # Estimate cost before adding documents
chunk_count = len(chunks)
batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE
batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS
total_tokens += batch_tokens
total_cost += batch_cost

chunk_count 表示分割文本后生成的块数。batch_tokens 会根据块计数估算标记总数。batch_cost 计算处理当前课程的估计成本。total_tokens 和 total_cost 会累计所有课程的值，以跟踪总体处理情况和费用。

d. 向 ChromaDB 添加数据块

vectorstore.add_documents(chunks)

print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})")

else:

print(f"Course already exists: {course}")

vectorstore.add_documents(chunks) print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})") else: print(f"Course already exists: {course}")

       vectorstore.add_documents(chunks)
print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})")
else:
print(f"Course already exists: {course}")

处理后的数据块存储在 ChromaDB 中，以便高效检索。我们会显示一条信息，说明添加的块数和估计的处理成本。

所有课程处理完毕后，我们会计算并显示最终结果。

# End timing

end_time = time.time()

# Display cost and time

print(f"\nCourse Embeddings Update Completed! 🚀")

print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}")

print(f"Estimated Total Tokens: {total_tokens}")

print(f"Estimated Cost: ${total_cost:.4f}")

print(f"Total Time Taken: {end_time - start_time:.2f} seconds")

# End timing end_time = time.time() # Display cost and time print(f"\nCourse Embeddings Update Completed! 🚀") print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}") print(f"Estimated Total Tokens: {total_tokens}") print(f"Estimated Cost: ${total_cost:.4f}") print(f"Total Time Taken: {end_time - start_time:.2f} seconds")

# End timing
end_time = time.time()
# Display cost and time
print(f"\nCourse Embeddings Update Completed! 🚀")
print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}")
print(f"Estimated Total Tokens: {total_tokens}")
print(f"Estimated Cost: ${total_cost:.4f}")
print(f"Total Time Taken: {end_time - start_time:.2f} seconds")

总处理时间用（end_time – start_time）计算。然后，系统会显示已处理的数据块数量、估计的 tokens 用量和总成本。最后，系统会提供整个嵌入过程的摘要。

输出：

计算并显示最终结果

从输出中我们可以看到，10 秒内共处理了 739 个块，估计成本为 0.0554 美元。

4. 查询和响应学习者的查询

字幕存储到 ChromaDB 后，系统需要在学习者提交查询时检索相关内容。这一检索过程是通过相似性搜索来处理的，相似性搜索可识别与输入查询最相关的存储文本片段。

工作原理

Query Input：学员提交与课程相关的问题。
Filtering by Course：系统确保检索仅限于相关课程材料。
Similarity Search in ChromaDB：将查询转换为嵌入，ChromaDB 会检索存储的最相似文本块。
Returning the Top Results：系统会选择前三个最相关的文本片段。
Formatting the Output：对检索到的文本进行格式化，并作为上下文呈现，以便进一步处理。

# Define retrieval tool with metadata filtering

def retrieve_course_materials(query: str, course = course):

"""Retrieves course materials filtered by course name."""

filter_dict = {"course": course}

results = vectorstore.similarity_search(query, k=3, filter=filter_dict)

return "\n\n".join([doc.page_content for doc in results])

# Define retrieval tool with metadata filtering def retrieve_course_materials(query: str, course = course): """Retrieves course materials filtered by course name.""" filter_dict = {"course": course} results = vectorstore.similarity_search(query, k=3, filter=filter_dict) return "\n\n".join([doc.page_content for doc in results])

# Define retrieval tool with metadata filtering
def retrieve_course_materials(query: str, course = course):
"""Retrieves course materials filtered by course name."""
filter_dict = {"course": course}
results = vectorstore.similarity_search(query, k=3, filter=filter_dict)
return "\n\n".join([doc.page_content for doc in results])

查询示例：

course_name = "Introduction to Deep Learning using PyTorch"

question = "What is gradient descent?"

context = retrieve_course_materials(query=question, course= course_name)

print(context)

course_name = "Introduction to Deep Learning using PyTorch" question = "What is gradient descent?" context = retrieve_course_materials(query=question, course= course_name) print(context)

course_name = "Introduction to Deep Learning using PyTorch"
question = "What is gradient descent?"
context = retrieve_course_materials(query=question, course= course_name)
print(context)

查询和响应学习者的查询

输出包括从 ChromaDB 检索的内容，按课程名称和问题进行过滤，使用相似性搜索查找最相关的信息。

为什么使用相似性搜索？

语义理解：与关键词搜索不同，相似性搜索能找到与查询语义相关的文本。
高效检索：系统只检索最相关的部分，而不是扫描整个文档。
提高答案质量：通过课程筛选和相关性排序，学习者可以获得针对性很强的内容。

这种机制可确保学习者在提交问题时，能从存储的课程资料中获得相关的、上下文准确的信息。

5. 实现人工智能查询回答代理

一旦从 ChromaDB 中检索到相关课程资料，下一步就是使用人工智能驱动的代理来对学习者的询问做出有意义的回答。CrewAI 用于定义一个智能代理，负责分析查询并生成结构良好的回复。

现在，让我们看看它是如何工作的。

第 1 步：定义代理

创建的查询回答代理具有明确的角色和背景故事，以便在回答学习者的查询时指导其行为。

# Define the agent with a well-structured role and backstory

query_answer_agent = Agent(

role = "Learning Support Specialist",

goal = "You help learners with their queries with the best possible response",

backstory = """You lead the Learners Query resolution department of

an Ed tech company focussed on self paced courses on topics related to

Data Science, Machine Learning and Generative AI. You respond to learner

queries related to course content, assignments, technical and administrative issues.

You are polite, diplomatic and take ownership of things which could be

imporved in your oragnisation.

""",

verbose = False,

)

# Define the agent with a well-structured role and backstory query_answer_agent = Agent( role = "Learning Support Specialist", goal = "You help learners with their queries with the best possible response", backstory = """You lead the Learners Query resolution department of an Ed tech company focussed on self paced courses on topics related to Data Science, Machine Learning and Generative AI. You respond to learner queries related to course content, assignments, technical and administrative issues. You are polite, diplomatic and take ownership of things which could be imporved in your oragnisation. """, verbose = False, )

# Define the agent with a well-structured role and backstory
query_answer_agent = Agent(
role = "Learning Support Specialist",
goal = "You help learners with their queries with the best possible response",
backstory = """You lead the Learners Query resolution department of 
an Ed tech company focussed on self paced courses on topics related to 
Data Science, Machine Learning and Generative AI. You respond to learner
queries related to course content, assignments, technical and administrative issues. 
You are polite, diplomatic and take ownership of things which could be 
imporved in your oragnisation.
""",
verbose = False,
)

让我们来了解一下代码块中发生了什么。首先，我们提供的角色是 “学习支持专家”，因为代理充当的是回答学生问题的虚拟导师。然后，我们定义目标，确保代理在回答问题时优先考虑准确性和清晰度。最后，我们设置 verbose=False，除非需要调试，否则执行时保持沉默。这种定义明确的代理角色可确保回复有帮助、有条理，并符合教育平台的基调。

第 2 步：定义任务

定义代理后，我们需要为其分配任务

query_answering_task = Task(

description= """

Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words.

Here is the query: {query}

Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} .

Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response.

There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} .

In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support'

is the response given by you. Use this past discussion appropriatly to come with a great reply.

This is the full name of the learner: {learner_name}

Address each learner by their first name, if you are not sure what the first name is, simply start with Hi.

Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful",

"I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc.

If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you"

""",

expected_output = "A crisp accurate response to the query",

agent=query_answer_agent)

query_answering_task = Task( description= """ Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words. Here is the query: {query} Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} . Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response. There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} . In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support' is the response given by you. Use this past discussion appropriatly to come with a great reply. This is the full name of the learner: {learner_name} Address each learner by their first name, if you are not sure what the first name is, simply start with Hi. Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful", "I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc. If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you" """, expected_output = "A crisp accurate response to the query", agent=query_answer_agent)

query_answering_task  = Task(
description= """
Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words.
Here is the query: {query}
Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} .  
Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response.
There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} . 
In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support' 
is the response given by you. Use this past discussion appropriatly to come with a great reply.
This is the full name of the learner: {learner_name}
Address each learner by their first name, if you are not sure what the first name is, simply start with Hi.
Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful", 
"I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc.
If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you"
""",
expected_output = "A crisp accurate response to the query",
agent=query_answer_agent)

让我们来分解一下提供给人工智能代理的任务。查询处理包括处理代表学习者问题的 {query}。回复应简明扼要（100 字以内）且准确无误。在使用课程内容时，{relevant_content} 会从 ChromaDB 中存储的字幕中提取，人工智能必须先纠正拼写错误，然后再将内容纳入回复中。

如果存在过去的讨论，{thread} 则有助于保持连续性。学习者查询以“learner”开头，而过去的回复则以“support”开头，这样人工智能就能根据上下文提供答案。使用 {learner_name} 可以实现个性化，即代理可以用学生的名字称呼他们，如果不确定，则默认为“Hi”。

为了让回答更有吸引力，人工智能会添加一个积极的结束语，如 “希望你觉得这对你有帮助！”或 “欢迎随时联系我们”。如果人工智能对某个答案不确定，它会明确表示 “对不起，我不确定，我会再联系您的。”这种方法确保了礼貌、清晰和有条理的回答格式，提高了学习者的参与度和信任度。

第 3 步：初始化 CrewAI 实例

现在，我们已经有了代理和任务，我们可以初始化 CrewAI，使代理能够动态处理查询。

# Create the Crew

response_crew = Crew(

agents=[query_answer_agent],

tasks=[query_answering_task],

verbose=False

)

# Create the Crew response_crew = Crew( agents=[query_answer_agent], tasks=[query_answering_task], verbose=False )

# Create the Crew
response_crew = Crew(
agents=[query_answer_agent],
tasks=[query_answering_task],
verbose=False
)

agents=[query_answer_agent] 参数可将学习支持专家代理添加到机组中。tasks=[query_answering_task] 将查询回答任务分配给该代理。除非需要调试，否则设置 verbose=False 会将输出保持在最低水平。CrewAI 使系统能够同时处理多个学习者的查询，使其在动态查询处理方面具有可扩展性和高效性。

为什么使用 CrewAI 进行查询回答？

结构化回复：确保每个回复都条理清晰、内容翔实。
情境感知：利用检索到的课程材料和过去的讨论来提高回复质量。
可扩展性：可在 CrewAI 中将多个查询作为任务处理，从而动态处理多个查询。
效率：通过简化查询解决工作流程来缩短响应时间。

通过实施这种人工智能驱动的回答系统，学习者可以收到针对其特定查询量身定制的信息充分的回答。

第 4 步：为多个学习者的询问生成回复

人工智能代理建立后，需要动态处理存储在结构化数据集中的学习者查询。

下面的代码使用人工智能代理处理存储在 CSV 文件中的学习者查询并生成回复。它首先加载包含学员查询、课程详情和讨论线程的数据集。reply_too_query 函数会提取相关详细信息，如学员姓名、课程名称和当前查询。如果存在以前的讨论，则会检索这些讨论的上下文。如果查询包含图片，则会跳过。然后，该函数会从 ChromaDB 获取相关课程资料，并将查询、相关内容和过去的讨论发送给人工智能代理，以生成结构化的回复。

df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv')

def reply_to_query(df_new=df_new, index=1):

learner_name = df_new.iloc[index]["thread_starter"]

course_name = df_new.iloc[index]["course"]

if df_new.iloc[index]['number_of_replies']>1:

thread = ast.literal_eval(df_new.iloc[index]["modified_thread"])

else:

thread = []

question = df_new.iloc[index]["current_query"]

if df_new.iloc[index]['has_image'] == True:

return " "

context = retrieve_course_materials(query = question , course=course_name)

response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name})

print('Q: ', question)

print('\n')

print('A: ', response_result)

print('\n\n')

df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv') def reply_to_query(df_new=df_new, index=1): learner_name = df_new.iloc[index]["thread_starter"] course_name = df_new.iloc[index]["course"] if df_new.iloc[index]['number_of_replies']>1: thread = ast.literal_eval(df_new.iloc[index]["modified_thread"]) else: thread = [] question = df_new.iloc[index]["current_query"] if df_new.iloc[index]['has_image'] == True: return " " context = retrieve_course_materials(query = question , course=course_name) response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name}) print('Q: ', question) print('\n') print('A: ', response_result) print('\n\n')

df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv')
def reply_to_query(df_new=df_new, index=1):
learner_name = df_new.iloc[index]["thread_starter"]
course_name = df_new.iloc[index]["course"]
if df_new.iloc[index]['number_of_replies']>1:
thread = ast.literal_eval(df_new.iloc[index]["modified_thread"])
else:
thread = []
question = df_new.iloc[index]["current_query"]
if df_new.iloc[index]['has_image'] == True:
return " "
context = retrieve_course_materials(query = question , course=course_name)
response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name})
print('Q: ', question)
print('\n')
print('A: ', response_result)
print('\n\n')

测试函数，执行一次查询（index=1）

reply_to_query(df, index=1)

reply_to_query(df, index=1)

为多个学习者的询问生成回复

由此我们可以看出，它仅对一个索引运行良好。

现在，迭代所有查询，处理每个查询的同时处理潜在错误。这确保了查询解决的高效自动化，允许动态处理多个学习者查询。

for i in range(len(df)):

try:

reply_to_query(df, index=i)

except:

print("Error in index number: ", i)

continue

for i in range(len(df)): try: reply_to_query(df, index=i) except: print("Error in index number: ", i) continue

for i in range(len(df)):
try:
reply_to_query(df, index=i)
except:
print("Error in index number: ", i)
continue

这一步为何重要？

自动处理查询：系统可高效处理多个学习者的查询。
确保上下文相关性：根据检索到的课程资料和过去的讨论生成回复。
可扩展性：该方法允许人工智能代理动态处理和回复数千次查询。
改进学习支持：学习者会收到个性化、数据驱动的查询回复。

这一步骤可确保对学习者的每个查询进行分析，并根据具体情况作出有效答复，从而提升整体学习体验。

输出：

回复查询的过程已经实现了自动化

从输出结果中我们可以看到，回复查询的过程已经实现了自动化，然后是提问和回答。

未来改进

为了升级基于 RAG 的查询解决系统，我们可以对以下几个方面进行改进：

常见问题及其解决方案：在查询解决框架内实施结构化常见问题解答系统将有助于即时解答常见问题，减少对现场支持的依赖。
图像处理能力：增加从图像（如截图、图表或扫描文档）中分析和提取相关信息的功能将增强系统的多功能性，使其在教育和客户支持领域更加有用。
改进图像列布尔运算：改进图像列检测背后的逻辑，以便更准确地正确识别和处理基于图像的查询。
语义分块和不同的分块技术：试验各种分块策略，如语义分块、固定长度分割和混合方法，可提高检索准确性和对响应的上下文理解。

小结

这个基于 RAG 的查询解析系统利用 LangChain、ChromaDB 和 CrewAI 高效地自动为学习者提供支持。它提取字幕，将其作为嵌入式内容存储在 ChromaDB 中，并使用相似性搜索检索相关内容。CrewAI 代理可处理查询、参考过去的讨论并生成结构化回复，从而确保准确性和个性化。

该系统提高了可扩展性、检索效率和回复质量，使自定进度的学习更具互动性。未来的改进包括多模式支持、更好的检索优化和增强的回复生成。通过自动解决查询问题，该系统简化了学习支持，为学习者提供更快的上下文感知响应，并提高了整体参与度。

常见问题

Q1. 什么是 LangChain，为什么要在本项目中使用它？

A. LangChain 是一个用于构建由语言模型（LLM）驱动的应用程序的框架。它有助于处理、检索和生成基于文本数据的响应。在本项目中，LangChain 用于将文本分割成块、生成嵌入和高效检索课程资料。

Q2. ChromaDB 如何存储和检索课程内容？

A. ChromaDB 是一个矢量数据库，用于存储和检索嵌入。它能将课程资料转换为数字表示，从而在学员提交查询时，通过基于相似性的搜索找到相关内容。

Q3. CrewAI 在回答学习者的查询时扮演什么角色？

A. CrewAI 能够创建动态处理任务的人工智能代理。在本项目中，它为学习支持专家代理提供动力，该代理可以检索课程资料、处理过去的讨论并为学习者的查询生成结构化的回复。

Q4. 为什么 OpenAI 嵌入要用于文本处理？

A. OpenAI 嵌入将文本转换为数字向量，从而更容易执行相似性搜索。这有助于根据学习者的查询有效检索相关课程资料。

Q5. 系统如何处理字幕（SRT 文件）？

A. 系统使用 pysrt 从字幕 (SRT) 文件中提取文本。然后对提取的内容进行分块，使用 OpenAI embeddings 进行嵌入，并存储在 ChromaDB 中，以便在需要时进行检索。

Q6. 这个系统能同时处理多个查询吗？

A. 可以，该系统具有可扩展性，可以使用 CrewAI 的任务管理动态处理多个学习者的查询。这可确保快速高效的响应。

Q7. 本系统未来有哪些改进？

A. 未来的改进包括对图像和视频的多模式支持、更好的检索优化以及改进的回复生成技术，以提供更准确、更符合上下文的回复。

利用LangChain和CrewAI构建基于RAG的查询解析系统

我们为什么需要AI驱动的查询解析系统？

了解RAG工作流程

1. 建立矢量存储（文档处理与存储）

2. 查询处理与检索

3. 增强和生成响应

构建基于RAG的查询解析系统

为查询解析选择正确的数据

构建查询解决系统

实施步骤

1. 导入库

2. 设置环境

3. 提取和存储字幕数据

第 1 步：从 SRT 文件中提取文本

第 2 步：在 ChromaDB 中存储字幕

第 3 步：估算课程数据的存储成本

b. 检查并向 ChromaDB 添加课程

4. 查询和响应学习者的查询

为什么使用相似性搜索？

5. 实现人工智能查询回答代理

第 1 步：定义代理

第 2 步：定义任务

第 3 步：初始化 CrewAI 实例

第 4 步：为多个学习者的询问生成回复

未来改进

小结

常见问题

评论留言

取消回复

文章目录

利用LangChain和CrewAI构建基于RAG的查询解析系统

我们为什么需要AI驱动的查询解析系统？

了解RAG工作流程

1. 建立矢量存储（文档处理与存储）

2. 查询处理与检索

3. 增强和生成响应

构建基于RAG的查询解析系统

为查询解析选择正确的数据

构建查询解决系统

实施步骤

1. 导入库

2. 设置环境

3. 提取和存储字幕数据

第 1 步：从 SRT 文件中提取文本

第 2 步：在 ChromaDB 中存储字幕

第 3 步：估算课程数据的存储成本

b. 检查并向 ChromaDB 添加课程

4. 查询和响应学习者的查询

为什么使用相似性搜索？

5. 实现人工智能查询回答代理

第 1 步：定义代理

第 2 步：定义任务

第 3 步：初始化 CrewAI 实例

第 4 步：为多个学习者的询问生成回复

未来改进

小结

常见问题

相关文章

评论留言

取消回复

文章目录