ArxivRetriever

arXiv 是一個開放存取的檔案庫，收錄物理學、數學、電腦科學、定量生物學、定量金融、統計學、電機工程和系統科學以及經濟學領域的 2 百萬篇學術文章。

本筆記本展示如何將 Arxiv.org 的科學文章檢索到下游使用的 Document 格式中。

如需所有 ArxivRetriever 功能和組態的詳細文件，請前往 API 參考文件。

整合詳細資訊

檢索器	來源	套件
ArxivRetriever	arxiv.org 上的學術文章	langchain_community

設定

如果您想要從個別查詢取得自動追蹤，您也可以設定您的 LangSmith API 金鑰，方法是取消註解下方內容

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安裝

此檢索器位於 langchain-community 套件中。我們也需要 arxiv 相依性

%pip install -qU langchain-community arxiv

例項化

ArxivRetriever 參數包括

選用 load_max_docs：預設值 = 100。使用它來限制下載的文件數量。下載所有 100 份文件需要時間，因此實驗時請使用較小的數字。目前硬性限制為 300。
選用 load_all_available_meta：預設值 = False。預設情況下，僅下載最重要的欄位：Published (文件發布/上次更新的日期)、Title、Authors、Summary。如果為 True，則也會下載其他欄位。
get_full_documents：布林值，預設值為 False。決定是否擷取文件的完整文字。

如需更多詳細資訊，請參閱 API 參考文件。

from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
    load_max_docs=2,
    get_ful_documents=True,
)

API 參考：ArxivRetriever

用法

ArxivRetriever 支援依文章識別碼檢索

docs = retriever.invoke("1605.08386")

docs[0].metadata  # meta-information of the Document

{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',
 'Published': datetime.date(2016, 5, 26),
 'Title': 'Heat-bath random walks with Markov bases',
 'Authors': 'Caprice Stanley, Tobias Windisch'}

docs[0].page_content[:400]  # a content of the Document

'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'

ArxivRetriever 也支援根據自然語言文字檢索

docs = retriever.invoke("What is the ImageBind model?")

docs[0].metadata

{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',
 'Published': datetime.date(2023, 5, 31),
 'Title': 'ImageBind: One Embedding Space To Bind Them All',
 'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}

在鏈中使用

與其他檢索器一樣，ArxivRetriever 可以透過鏈整合到 LLM 應用程式中。

我們將需要 LLM 或聊天模型

選取聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

API 參考：StrOutputParser | ChatPromptTemplate | RunnablePassthrough

chain.invoke("What is the ImageBind model?")

'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'

API 參考

如需所有 ArxivRetriever 功能和組態的詳細文件，請前往 API 參考文件。

檢索器概念指南
檢索器操作指南

整合詳細資訊​

設定​

安裝​

例項化​

用法​

在鏈中使用​

API 參考​

相關內容​

此頁面是否實用？

整合詳細資訊

設定

安裝

例項化

用法

在鏈中使用

API 參考

相關內容