如何對 CSV 檔案進行問答

LLM 非常適合建構基於各種資料來源的問答系統。在本節中，我們將介紹如何建構基於儲存在 CSV 檔案中的資料的問答系統。與使用 SQL 資料庫類似，使用 CSV 檔案的關鍵是讓 LLM 能夠存取查詢和與資料互動的工具。主要有兩種方法：

推薦：將 CSV 檔案載入 SQL 資料庫，並使用 SQL 教學中概述的方法。
讓 LLM 能夠存取 Python 環境，在其中可以使用 Pandas 等程式庫與資料互動。

我們將在本指南中介紹這兩種方法。

⚠️ 安全注意事項 ⚠️

上述兩種方法都存在重大風險。使用 SQL 需要執行模型產生的 SQL 查詢。使用 Pandas 等程式庫需要讓模型執行 Python 程式碼。由於嚴格限制 SQL 連線權限和清理 SQL 查詢比沙箱化 Python 環境更容易，因此我們強烈建議透過 SQL 與 CSV 資料互動。 有關一般安全最佳實務的更多資訊，請參閱此處。

設定

本指南的依賴項

%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas

設定必要的環境變數

# Using LangSmith is recommended but not required. Uncomment below lines to use.
# import os
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

如果您還沒有 Titanic 資料集，請下載

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())

(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

SQL

使用 SQL 與 CSV 資料互動是推薦的方法，因為限制權限和清理查詢比任意 Python 更容易。

大多數 SQL 資料庫都使將 CSV 檔案載入為表格變得容易 (DuckDB、SQLite 等)。完成此操作後，您可以使用 SQL 教學中概述的所有鏈和代理建立技術。以下是如何使用 SQLite 執行此操作的快速範例

from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)

API 參考：SQLDatabase

db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))

sqlite
['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]

並建立 SQL 代理與其互動

選擇聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_community.agent_toolkits import create_sql_agent

agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

API 參考：create_sql_agent

agent_executor.invoke({"input": "what's the average age of survivors"})

[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`


[0m[38;5;200m[1;3mtitanic[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`


[0m[33;1m[1;3m
CREATE TABLE titanic (
	"Survived" BIGINT, 
	"Pclass" BIGINT, 
	"Name" TEXT, 
	"Sex" TEXT, 
	"Age" FLOAT, 
	"Siblings/Spouses Aboard" BIGINT, 
	"Parents/Children Aboard" BIGINT, 
	"Fare" FLOAT
)

/*
3 rows from titanic table:
Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
0	3	Mr. Owen Harris Braund	male	22.0	1	0	7.25
1	1	Mrs. John Bradley (Florence Briggs Thayer) Cumings	female	38.0	1	0	71.2833
1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.925
*/[0m[32;1m[1;3m
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`


[0m[36;1m[1;3m[(28.408391812865496,)][0m[32;1m[1;3mThe average age of survivors in the Titanic dataset is approximately 28.41 years.[0m

[1m> Finished chain.[0m

{'input': "what's the average age of survivors",
 'output': 'The average age of survivors in the Titanic dataset is approximately 28.41 years.'}

此方法很容易推廣到多個 CSV，因為我們可以將每個 CSV 作為自己的表格載入到資料庫中。請參閱下面的多個 CSV 章節。

Pandas

除了 SQL 之外，我們還可以使用 pandas 等資料分析程式庫和 LLM 的程式碼生成能力與 CSV 資料互動。再次強調，除非您有完善的安全措施，否則此方法不適用於生產用例。因此，我們的程式碼執行工具和建構子位於 langchain-experimental 套件中。

鏈

大多數 LLM 都接受過足夠的 pandas Python 程式碼訓練，因此只需要求它們即可生成程式碼

ai_msg = llm.invoke(
    "I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)

\`\`\`python
correlation = df['Age'].corr(df['Fare'])
correlation
\`\`\`

我們可以將此能力與 Python 執行工具結合使用，以建立簡單的資料分析鏈。我們首先要將 CSV 表格載入為 dataframe，並讓工具能夠存取此 dataframe

import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool

df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")

API 參考：ChatPromptTemplate | PythonAstREPLTool

32.30542018038331

為了幫助強制正確使用我們的 Python 工具，我們將使用工具呼叫

llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])

response.tool_calls

[{'name': 'python_repl_ast',
  'args': {'query': "df.corr()['Age']['Fare']"},
  'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]

我們將新增工具輸出解析器，以將函數呼叫萃取為字典

from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser

parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)

API 參考：JsonOutputKeyToolsParser

{'query': "df[['Age', 'Fare']].corr()"}

並與提示結合，以便我們可以僅指定問題，而無需在每次調用時指定 dataframe 資訊

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})

{'query': "df[['Age', 'Fare']].corr()"}

最後，我們將新增 Python 工具，以便實際執行生成的程式碼

chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})

0.11232863699941621

就這樣，我們就有了一個簡單的資料分析鏈。我們可以透過查看 LangSmith 追蹤來查看中間步驟：https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r

我們可以在最後新增額外的 LLM 呼叫來生成對話式回應，這樣我們就不僅僅是用工具輸出來回應。為此，我們需要在提示中新增聊天歷史記錄 MessagesPlaceholder

from operator import itemgetter

from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system,
        ),
        ("human", "{question}"),
        # This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
        # at the end of the prompt using the 'chat_history' arg.
        MessagesPlaceholder("chat_history", optional=True),
    ]
)


def _get_chat_history(x: dict) -> list:
    """Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
    ai_msg = x["ai_msg"]
    tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
    tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
    return [ai_msg, tool_msg]


chain = (
    RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
    .assign(tool_output=itemgetter("ai_msg") | parser | tool)
    .assign(chat_history=_get_chat_history)
    .assign(response=prompt | llm | StrOutputParser())
    .pick(["tool_output", "response"])
)

API 參考：ToolMessage | StrOutputParser | MessagesPlaceholder | RunnablePassthrough

chain.invoke({"question": "What's the correlation between age and fare"})

{'tool_output': 0.11232863699941616,
 'response': 'The correlation between age and fare is approximately 0.1123.'}

這是此運行的 LangSmith 追蹤：https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r

代理

對於複雜的問題，LLM 能夠迭代執行程式碼，同時保持其先前執行的輸入和輸出，這會很有幫助。這就是代理發揮作用的地方。它們允許 LLM 決定需要調用工具多少次，並追蹤到目前為止已進行的執行。 create_pandas_dataframe_agent 是一個內建代理，可以輕鬆處理 dataframes

from langchain_experimental.agents import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(
    llm, df, agent_type="openai-tools", verbose=True, allow_dangerous_code=True
)
agent.invoke(
    {
        "input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
    }
)

API 參考：create_pandas_dataframe_agent

[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`

[0m[36;1m[1;3m0.11232863699941621[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`

[0m[36;1m[1;3m0.2561785496289603[0m[32;1m[1;3mThe correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.

Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).[0m

[1m> Finished chain.[0m

{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
 'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}

這是此運行的 LangSmith 追蹤：https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r

多個 CSV 檔案

為了處理多個 CSV 檔案 (或 dataframes)，我們只需要將多個 dataframes 傳遞給我們的 Python 工具。我們的 create_pandas_dataframe_agent 建構子可以開箱即用地做到這一點，我們可以傳入 dataframes 列表，而不僅僅是一個。如果我們自己建構鏈，我們可以執行類似的操作，例如

df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]

tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """\`\`\`python
{df_name}.head().to_markdown()
>>> {df_head}
\`\`\`"""
df_context = "\n\n".join(
    df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
    for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)

system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:

{df_context}

Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])

chain = prompt | llm_with_tool | parser | tool
chain.invoke(
    {
        "question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
    }
)

0.14384991262954416

這是此運行的 LangSmith 追蹤：https://smith.langchain.com/public/cc2a7d7f-7c5a-4e77-a10c-7b5420fcd07f/r

沙箱化程式碼執行

有許多工具 (例如 E2B 和 Bearly) 提供 Python 程式碼執行的沙箱化環境，以實現更安全的程式碼執行鏈和代理。

下一步

對於更進階的資料分析應用程式，我們建議查看

SQL 教學：使用 SQL 資料庫和 CSV 的許多挑戰對於任何結構化資料類型都是通用的，因此即使您使用 Pandas 進行 CSV 資料分析，閱讀 SQL 技術也很有用。
工具使用：關於在使用調用工具的鏈和代理時的一般最佳實務指南
代理：了解建構 LLM 代理的基本原理。
整合：沙箱化環境 (例如 E2B 和 Bearly)、實用工具 (例如 SQLDatabase)、相關代理 (例如 Spark DataFrame 代理)。

⚠️ 安全注意事項 ⚠️​

設定​

SQL​

Pandas​

鏈​

代理​

多個 CSV 檔案​

沙箱化程式碼執行​

下一步​

此頁面是否有幫助？

⚠️ 安全注意事項 ⚠️

設定

SQL

Pandas

鏈

代理

多個 CSV 檔案

沙箱化程式碼執行

下一步