原始碼

本筆記本涵蓋如何使用特殊的語言剖析方法載入原始碼檔案：程式碼中的每個頂層函數和類別都會載入到個別的文件中。任何剩餘的頂層程式碼，若在已載入的函數和類別之外，將會載入到另一個個別的文件中。

這種方法可能會提高 QA 模型在原始碼上的準確性。

程式碼剖析支援的語言為

C (*)
C++ (*)
C# (*)
COBOL
Elixir
Go (*)
Java (*)
JavaScript (需要套件 esprima)
Kotlin (*)
Lua (*)
Perl (*)
Python
Ruby (*)
Rust (*)
Scala (*)
TypeScript (*)

標記 (*) 的項目需要 tree_sitter 和 tree_sitter_languages 套件。使用 tree_sitter 新增對其他語言的支援很簡單，但目前需要修改 LangChain。

可以用於配置剖析的語言，以及啟用基於語法分割所需的最小行數。

如果未明確指定語言，LanguageParser 將從檔案名稱副檔名（如果存在）推斷語言。

%pip install -qU esprima esprima tree_sitter tree_sitter_languages

import warnings

warnings.filterwarnings("ignore")
from pprint import pprint

from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import Language

API 參考：GenericLoader | LanguageParser | Language

loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".py", ".js"],
    parser=LanguageParser(),
)
docs = loader.load()

len(docs)

for document in docs:
    pprint(document.metadata)

{'content_type': 'functions_classes',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'simplified_code',
 'language': <Language.PYTHON: 'python'>,
 'source': 'example_data/source_code/example.py'}
{'content_type': 'functions_classes',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}
{'content_type': 'functions_classes',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}
{'content_type': 'simplified_code',
 'language': <Language.JS: 'js'>,
 'source': 'example_data/source_code/example.js'}

print("\n\n--8<--\n\n".join([document.page_content for document in docs]))

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

--8<--

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

--8<--

# Code for: class MyClass:


# Code for: def main():


if __name__ == "__main__":
    main()

--8<--

class MyClass {
  constructor(name) {
    this.name = name;
  }

  greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");
  const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

main();

可以針對小型檔案停用剖析器。

參數 parser_threshold 指示原始碼檔案必須具有的最小行數，才能使用剖析器進行分段。

loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".py"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),
)
docs = loader.load()

len(docs)

print(docs[0].page_content)

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")


def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()


if __name__ == "__main__":
    main()

分割

對於那些過大的函數、類別或腳本，可能需要額外的分割。

loader = GenericLoader.from_filesystem(
    "./example_data/source_code",
    glob="*",
    suffixes=[".js"],
    parser=LanguageParser(language=Language.JS),
)
docs = loader.load()

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

API 參考：Language | RecursiveCharacterTextSplitter

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

result = js_splitter.split_documents(docs)

len(result)

print("\n\n--8<--\n\n".join([document.page_content for document in result]))

class MyClass {
  constructor(name) {
    this.name = name;

--8<--

}

--8<--

greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");

--8<--

const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

--8<--

main();

使用 Tree-sitter 範本新增語言

使用 Tree-Sitter 範本擴展語言支援涉及幾個基本步驟

建立新的語言檔案:
- 首先在指定目錄 (langchain/libs/community/langchain_community/document_loaders/parsers/language) 中建立一個新檔案。
- 根據現有語言檔案（如 cpp.py）的結構和剖析邏輯來建立此檔案的模型。
- 您還需要在 langchain 目錄 (langchain/libs/langchain/langchain/document_loaders/parsers/language) 中建立一個檔案。
剖析語言細節:
- 模仿 cpp.py 檔案中使用的結構，並調整它以適合您要加入的語言。
- 主要的變更是調整區塊查詢陣列，使其適合您要剖析的語言的語法和結構。
測試語言剖析器:
- 為了徹底驗證，請產生一個特定於新語言的測試檔案。在指定目錄 (langchain/libs/community/tests/unit_tests/document_loaders/parsers/language) 中建立 test_language.py。
- 遵循 test_cpp.py 設定的範例，為新語言中剖析的元素建立基本測試。
整合到剖析器和文字分割器中:
- 將您的新語言納入 language_parser.py 檔案中。確保更新 LANGUAGE_EXTENSIONS 和 LANGUAGE_SEGMENTERS 以及 LanguageParser 的文件字串，以識別和處理新增的語言。
- 此外，確認您的語言已包含在 text_splitter.py 的 Language 類別中，以進行正確的剖析。

透過遵循這些步驟並確保全面的測試和整合，您將使用 Tree-Sitter 範本成功擴展語言支援。

祝您好運！

文件載入器概念指南
文件載入器操作指南

分割​

使用 Tree-sitter 範本新增語言​

相關內容​

此頁面是否對您有幫助？

分割

使用 Tree-sitter 範本新增語言

相關內容