利用LlamaIndex和Gemini 2.0構建財務報告檢索系統

財務報告對於評估公司的健康狀況至關重要。它們長達數百頁，很難有效地提取具體的見解。分析師和投資者要花費數小時翻閱資產負債表、損益表和腳註，只為回答一些簡單的問題，如：2024 年公司的收入是多少？隨著 LLM 模型和向量搜尋技術的最新進展，我們可以使用 LlamaIndex 和相關框架自動進行財務報告分析。這篇博文將探討我們如何使用 LlamaIndex、ChromaDB、Gemini2.0 和 Ollama 構建一個強大的財務 RAG 系統，精確地回答來自冗長報告的查詢。

學習目標

瞭解高效分析對財務報告檢索系統的需求。
瞭解如何使用 LlamaIndex 對財務報告進行預處理和向量化。
探索 ChromaDB，為文件檢索構建強大的向量資料庫。
使用 Gemini 2.0 和 Llama 3.2 為金融資料分析實施查詢引擎。
使用 LlamaIndex 探索高階查詢路由技術，以增強洞察力。

為什麼需要財務報告檢索系統？

財務報告包含有關公司業績的重要資訊，包括收入、支出、負債和盈利能力。然而，這些報告篇幅巨大、冗長，而且充滿專業術語，分析師、投資者和高管手動提取相關資訊非常耗時。

財務報告檢索系統可通過自然語言查詢實現這一過程的自動化。使用者可以簡單地提出 “2023 年的收入是多少？”或 “總結一下 2023 年的流動性問題”等問題，而無需搜尋 PDF 檔案。系統會快速檢索並總結相關部分，從而節省人工操作的時間。

專案實施

要實施專案，我們首先需要設定環境並安裝所需的庫：

步驟 1：設定環境

首先，我們將建立一個用於開發工作的 conda 環境。

$conda create --name finrag python=3.12
$conda activate finrag

步驟 2：安裝必要的Python庫

安裝 libraires 是任何專案實施的關鍵步驟：

$pip install llama-index llama-index-vector-stores-chroma chromadb
$pip install llama-index-llms-gemini llama-index-llms-ollama
$pip install llama-index-embeddings-gemini llama-index-embeddings-ollama
$pip install python-dotenv nest-asyncio pypdf

步驟 3：建立專案目錄

現在建立一個專案目錄，並建立一個名為 .env 的檔案，在該檔案中放入所有 API 金鑰，以便安全管理 API 金鑰。

# on .env file
GOOGLE_API_KEY="<your-api-key>"

我們從 .env 檔案載入環境變數，以安全地儲存敏感的 API 金鑰。這將確保我們的雙子座應用程式介面（Gemini API）或谷歌應用程式介面（Google API）始終受到保護。

我們將使用 Jupyter Notebook 完成專案。建立一個 Jupyter Notebook 檔案，然後開始逐步實施。

步驟 4：載入API金鑰

現在，我們將載入下面的 API 金鑰：

import os
from dotenv import load_dotenv
load_dotenv()
GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
# Only to check .env is accessing properly or not.
# print(f"GEMINI_API_KEY: {GEMINI_API_KEY}")

現在，我們的環境已經準備就緒，可以進入下一個最重要的階段了。

使用Llamaindex處理檔案

從 AnnualReports 網站收集賽車遊戲公司的財務報告。

點選此處下載。

第一頁看起來像

賽車遊戲公司的財務報告

Source: Report

這些報告總共有 123 頁，但我只需將報告中的財務報表提取出來，然後為我們的專案建立一個新的 PDF。

我是怎麼做的呢？使用 PyPDF 庫非常簡單。

from pypdf import PdfReader
from pypdf import PdfWriter
reader = PdfReader("NASDAQ_MSGM_2023.pdf")
writer = PdfWriter()
# page 66 to 104 have financial statements.
page_to_extract = range(66, 104)
for page_num in page_to_extract:
writer.add_page(reader.pages[page_num])
output_pdf = "Motorsport_Games_Financial_report.pdf"
with open(output_pdf, "wb") as outfile:
writer.write(output_pdf)
print(f"New PDF created: {output_pdf}")

新報告檔案只有 38 頁，這有助於我們快速嵌入檔案。

載入和分割財務報告

在專案資料目錄中，放入新建立的 Motorsport_Games_Financial_report.pdf 檔案，該檔案將為專案編制索引。

財務報告通常為 PDF 格式，包含大量表格資料、腳註和法律宣告。我們使用 LlamaIndex 的 SimpleDirectoryReader 來載入這些檔案並將其轉換為文件。

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()

由於報告的篇幅非常大，無法作為單個文件進行處理，因此我們將其分割成較小的塊或節點。每個小塊對應一個頁面或部分，有助於更有效地檢索。

from copy import deepcopy
from llama_index.core.schema import TextNode
def get_page_nodes(docs, separator="\n---\n"):
"""Split each document into page node, by separator."""
nodes = []
for doc in docs:
doc_chunks = doc.text.split(separator)
for doc_chunk in doc_chunks:
node = TextNode(
text=doc_chunk,
metadata=deepcopy(doc.metadata),
)
nodes.append(node)
return nodes

要了解檔案提取過程，請參閱下圖。

檔案提取過程

現在，我們的財務資料已經準備好進行向量化和儲存以備檢索。

使用ChromaDB建立向量資料庫

我們將使用 ChromaDB 快速、準確地建立本地向量資料庫。我們的金融文字嵌入式表示法將儲存到 ChromaDB 中。

我們將初始化向量資料庫，並使用 Ollama 配置 Nomic-embed-text 模型，以生成本地嵌入。

import chromadb
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import Settings
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("financial_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

最後，我們使用 LLamaIndex 的 VectorStoreIndex 建立一個向量索引。該索引將我們的向量資料庫與 LlamaIndex 的查詢引擎連線起來。

from llama_index.core import VectorStoreIndex, StorageContext
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex.from_documents(documents=documents, storage_context=storage_context, embed_model=embed_model)

上述程式碼將使用 nomic-embed-text 從金融文字檔案中建立向量索引。這需要時間，具體取決於本地系統的規範。

索引建立完成後，您就可以在必要時重複使用嵌入的程式碼，而無需重新建立索引。

vector_index = VectorStoreIndex.from_vector_store(
vector_store=vector_store, embed_model=embed_model
)

這將允許你使用儲存中的 chromadb 嵌入檔案。

現在，我們的過載工作已經完成，是時候查詢報告並放鬆一下了。

使用Gemini 2.0查詢財務資料

一旦我們的財務資料建立了索引，我們就可以提出自然語言問題並得到準確的答案。我們將使用 Gemini-2.0 Flash 模型進行查詢，該模型可與我們的向量資料庫互動，獲取相關部分並生成有見地的回覆。

設定Gemini 2.0

from llama_index.llms.gemini import Gemini
llm = Gemini(api_key=GEMINI_API_KEY, model_name="models/gemini-2.0-flash")

使用帶有向量索引的Gemini 2.0啟動查詢引擎

query_engine = vector_index.as_query_engine(llm=llm, similarity_top_k=5)

示例查詢和響應

下面是多個查詢和不同的響應：

查詢-1

response = query_engine.query("what is the revenue of on 2022 Year Ended December 31?")
print(str(response))

響應基於報告的查詢響應1

來自報告的相應圖片：

基於報告的查詢響應1的原資料出處

查詢-2

response = query_engine.query(
"what is the Net Loss Attributable to Motossport Games Inc. on 2022 Year Ended December 31?"
)
print(str(response))

響應

基於報告的查詢響應2

來自報告的相應圖片：

基於報告的查詢響應2的原資料出處

查詢-3

response = query_engine.query(
"What are the Liquidity and Going concern for the Company on December 31, 2023"
)
print(str(response))

響應

基於報告的查詢響應3

查詢-4

response = query_engine.query(
"Summarise the Principal versus agent considerations of the company?"
)
print(str(response))

響應

基於報告的查詢響應4

來自報告的相應圖片：

基於報告的查詢響應4原資料出處

查詢-5

response = query_engine.query(
"Summarise the Net Loss Per Common Share of the company with financial data?"
)
print(str(response))

響應

基於報告的查詢響應5

來自報告的相應圖片：

基於報告的查詢響應5原資料出處

查詢-6

response = query_engine.query(
"Summarise Property and equipment consist of the following balances as of December 31, 2023 and 2022 of the company with financial data?"
)
print(str(response))

響應

基於報告的查詢響應6

來自報告的相應圖片：

基於報告的查詢響應6原資料出處

查詢-7

response = query_engine.query(
"Summarise The Intangible Assets on December 21, 2023 of the company with financial data?"
)
print(str(response))

響應

基於報告的查詢響應7

查詢-8

response = query_engine.query(
"What are leases of the company with yearwise financial data?"
)
print(str(response))

響應

基於報告的查詢響應8

來自報告的相應圖片：

基於報告的查詢響應8原資料出處

使用Llama 3.2進行本地查詢

在本地利用 Llama 3.2 查詢財務報告，而無需依賴基於雲的模型。

設定Llama 3.2:1b

local_llm = Ollama(model="llama3.2:1b", request_timeout=1000.0)
local_query_engine = vector_index.as_query_engine(llm=local_llm, similarity_top_k=3)

查詢-9

response = local_query_engine.query(
"Summary of chart of Accrued expenses and other liabilities using the financial data of the company"
)
print(str(response))

響應

基於報告的查詢響應9

來自報告的相應圖片：

基於報告的查詢響應9原資料出處

使用LlamaIndex進行高階查詢路由選擇

有時，我們既需要詳細的檢索，也需要總結性的見解。我們可以通過結合向量索引和摘要索引來實現這一點。

向量索引用於精確的文件檢索
摘要索引用於簡明的財務摘要

我們已經建立了向量索引，現在我們將建立一個摘要索引，使用分層方法來總結財務報表。

from llama_index.core import SummaryIndex
summary_index = SummaryIndex(nodes=page_nodes)

然後整合 RouterQueryEngine，它可根據查詢型別有條件地決定是從摘要索引還是向量索引。

from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

現在建立摘要查詢引擎

summary_query_engine = summary_index.as_query_engine(
llm=llm, response_mode="tree_summarize", use_async=True
)

該摘要查詢引擎將被整合到摘要工具中，而向量查詢引擎將被整合到向量工具中。

# Creating summary tool
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
description=(
"Useful for summarization questions related to Motorsport Games Company."
),
)
# Creating vector tool
vector_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
description=(
"Useful for retriving specific context from the Motorsport Games Company."
),
)

這兩種工具都已完成，現在我們通過路由器將這些工具連線起來，這樣當查詢通過路由器時，路由器就會通過分析使用者查詢來決定使用哪種工具。

# Router Query Engine
adv_query_engine = RouterQueryEngine(
llm=llm,
selector=LLMSingleSelector.from_defaults(llm=llm),
query_engine_tools=[summary_tool, vector_tool],
verbose=True,
)

我們的高階查詢系統已全部安裝完畢，現在可查詢我們新推出的高階查詢引擎。

查詢-10

response = adv_query_engine.query(
"Summarize the charts describing the revenure of the company."
)
print(str(response))

響應

基於報告的查詢響應10原資料出處

您可以看到，我們的智慧路由器會決定使用摘要工具，因為使用者在查詢中要求摘要。

查詢-11

response = adv_query_engine.query("What is the Total Assets of the company Yearwise?")
print(str(response))

響應

基於報告的查詢響應12

在這裡，路由器選擇了向量工具，因為使用者詢問的是具體資訊，而不是摘要。

本文使用的所有程式碼都在這裡。

小結

我們可以利用 LlamaIndex、ChromaDB 和高階 LLM 高效分析財務報告。該系統可實現自動財務洞察、實時查詢和強大的彙總功能。這類系統使財務分析更方便、更高效，從而在投資、交易和經營過程中做出更好的決策。

由 LLM 驅動的文件檢索系統可大幅減少分析複雜財務報告所花費的時間。
使用雲和本地 LLM 的混合方法可確保系統設計的成本效益、隱私保護和靈活性。
LlamaIndex 的模組化框架可以輕鬆實現財務報告整理工作流程的自動化。
這類系統可適用於法律檔案、醫療報告和監管備案等不同領域，因此是一種通用的 RAG 解決方案。

常見問題

Q1. 系統如何處理不同的財務報告？

A. 系統設計用於處理任何結構化的財務檔案，將其分解為文字塊，嵌入並儲存在 ChromaDB 中。新報告可以動態新增，無需重新建立索引。

Q2. 能否將其擴充套件到生成財務圖表和視覺化效果？

A. 可以，通過整合 Matplotlib、Pandas 和 Streamlit，您可以將收入增長、淨虧損分析或資產分佈等趨勢視覺化。

Q3. 查詢路由系統如何提高準確性？

A. RouterQueryEngine 會自動檢測查詢是否需要彙總響應或特定的財務資料檢索。這樣可以減少不相關的輸出，確保回覆的準確性。

Q4. 該系統是否適用於實時財務分析？

A. 可以，但這取決於向量儲存更新的頻率。您可以使用 OpenAI 嵌入式應用程式介面（API）持續攝取管道，動態查詢實時財務報告。

利用LlamaIndex和Gemini 2.0構建財務報告檢索系統

文章目录

學習目標

為什麼需要財務報告檢索系統？

專案實施

步驟 1：設定環境

步驟 2：安裝必要的Python庫

步驟 3：建立專案目錄

步驟 4：載入API金鑰

使用Llamaindex處理檔案

載入和分割財務報告

使用ChromaDB建立向量資料庫

使用Gemini 2.0查詢財務資料

設定Gemini 2.0

使用帶有向量索引的Gemini 2.0啟動查詢引擎

示例查詢和響應

查詢-1

查詢-2

查詢-3

查詢-4

查詢-5

查詢-6

查詢-7

查詢-8

使用Llama 3.2進行本地查詢

設定Llama 3.2:1b

查詢-9

使用LlamaIndex進行高階查詢路由選擇

查詢-10

查詢-11

小結

常見問題

評論留言

取消回覆

利用LlamaIndex和Gemini 2.0構建財務報告檢索系統

文章目录

學習目標

為什麼需要財務報告檢索系統？

專案實施

步驟 1：設定環境

步驟 2：安裝必要的Python庫

步驟 3：建立專案目錄

步驟 4：載入API金鑰

使用Llamaindex處理檔案

載入和分割財務報告

使用ChromaDB建立向量資料庫

使用Gemini 2.0查詢財務資料

設定Gemini 2.0

使用帶有向量索引的Gemini 2.0啟動查詢引擎

示例查詢和響應

查詢-1

查詢-2

查詢-3

查詢-4

查詢-5

查詢-6

查詢-7

查詢-8

使用Llama 3.2進行本地查詢

設定Llama 3.2:1b

查詢-9

使用LlamaIndex進行高階查詢路由選擇

查詢-10

查詢-11

小結

常見問題

相關文章

評論留言

取消回覆