如何依詞符分割文本

語言模型有詞符限制。您不應超過詞符限制。當您將文本分割成區塊時，計算詞符數量是個好主意。詞符器有很多種。當您計算文本中的詞符時，應使用與語言模型中使用的相同的詞符器。

tiktoken

注意

tiktoken 是由 OpenAI 建立的快速 BPE 詞符器。

我們可以使用 tiktoken 來估計使用的詞符。對於 OpenAI 模型來說，這可能會更準確。

文本分割方式：依傳入的字元。
區塊大小的測量方式：依 tiktoken 詞符器。

CharacterTextSplitter、RecursiveCharacterTextSplitter 和 TokenTextSplitter 可以直接與 tiktoken 一起使用。

%pip install --upgrade --quiet langchain-text-splitters tiktoken

from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

API 參考文檔：CharacterTextSplitter

若要使用 CharacterTextSplitter 分割，然後使用 tiktoken 合併區塊，請使用其 .from_tiktoken_encoder() 方法。請注意，此方法分割出的區塊可能大於 tiktoken 詞符器測量出的區塊大小。

.from_tiktoken_encoder() 方法接受 encoding_name 作為引數（例如 cl100k_base），或 model_name（例如 gpt-4）。所有額外引數（如 chunk_size、chunk_overlap 和 separators）都用於實例化 CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

若要在區塊大小上實作硬性約束，我們可以使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder，其中每個分割出的區塊如果大小較大，都會以遞迴方式分割

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

API 參考文檔：RecursiveCharacterTextSplitter

我們也可以載入 TokenTextSplitter 分割器，它可直接與 tiktoken 搭配使用，並確保每個分割出的區塊都小於區塊大小。

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API 參考文檔：TokenTextSplitter

Madam Speaker, Madam Vice President, our

某些書寫語言（例如中文和日文）的字元編碼為 2 個或更多詞符。直接使用 TokenTextSplitter 可能會將字元的詞符在兩個區塊之間分割，導致 Unicode 字元格式錯誤。請使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder 或 CharacterTextSplitter.from_tiktoken_encoder，以確保區塊包含有效的 Unicode 字串。

spaCy

注意

spaCy 是一個用於進階自然語言處理的開放原始碼軟體庫，以 Python 和 Cython 程式語言編寫。

LangChain 實作了基於 spaCy 詞符器的分隔器。

文本分割方式：依 spaCy 詞符器。
區塊大小的測量方式：依字元數。

%pip install --upgrade --quiet  spacy

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API 參考文檔：SpacyTextSplitter

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  

Last year COVID-19 kept us apart.

This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

SentenceTransformers

SentenceTransformersTokenTextSplitter 是一個專門用於 sentence-transformer 模型的文本分割器。預設行為是將文本分割成適合您想要使用的 sentence transformer 模型詞符視窗的區塊。

若要根據 sentence-transformers 詞符器分割文本並限制詞符計數，請實例化 SentenceTransformersTokenTextSplitter。您可以選擇性地指定

chunk_overlap：詞符重疊的整數計數；
model_name：sentence-transformer 模型名稱，預設為 "sentence-transformers/all-mpnet-base-v2"；
tokens_per_chunk：每個區塊所需的詞符計數。

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

API 參考文檔：SentenceTransformersTokenTextSplitter

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem

NLTK

注意

自然語言工具包，或更常見的 NLTK，是一套用於符號和統計自然語言處理 (NLP) 的程式庫和程式，以 Python 程式語言編寫。

我們可以使用 NLTK 基於 NLTK 詞符器進行分割，而不是僅僅依 "\n\n" 分割。

文本分割方式：依 NLTK 詞符器。
區塊大小的測量方式：依字元數。

# pip install nltk

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

API 參考文檔：NLTKTextSplitter

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.

KoNLPY

注意

KoNLPy：Python 中的韓語 NLP 是一個用於韓語自然語言處理 (NLP) 的 Python 套件。

詞符分割涉及將文本分割成更小、更易於管理的單元，稱為詞符。這些詞符通常是單字、詞組、符號或其他對於進一步處理和分析至關重要的有意義元素。在像英語這樣的語言中，詞符分割通常涉及按空格和標點符號分隔單字。詞符分割的有效性很大程度上取決於詞符器對語言結構的理解，確保產生有意義的詞符。由於為英語設計的詞符器無法理解其他語言（如韓語）的獨特語意結構，因此它們無法有效地用於韓語語言處理。

使用 KoNLPy 的 Kkma 分析器進行韓語詞符分割

對於韓語文本，KoNLPy 包含一個名為 Kkma（Korean Knowledge Morpheme Analyzer，韓語知識詞素分析器）的詞法分析器。Kkma 提供韓語文本的詳細詞法分析。它將句子分解為單字，並將單字分解為各自的詞素，識別每個詞符的詞性。它可以將文本塊分割成單個句子，這對於處理長文本特別有用。

使用考量

雖然 Kkma 以其詳細分析而聞名，但重要的是要注意，這種精確度可能會影響處理速度。因此，Kkma 最適合用於分析深度優先於快速文本處理的應用程式。

# pip install konlpy

# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
    korean_document = f.read()

from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()

API 參考文檔：KonlpyTextSplitter

texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])

춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.

그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.

한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.

춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.

어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.

두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.

하지만 좋은 날들은 오래가지 않았다.

도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.

이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.

그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.

춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.

이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.

이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.

두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.

- 춘향전 (The Tale of Chunhyang)

Hugging Face 詞符器

Hugging Face 有許多詞符器。

我們使用 Hugging Face 詞符器，即 GPT2TokenizerFast 來計算文本長度的詞符數。

文本分割方式：依傳入的字元。
區塊大小的測量方式：依 Hugging Face 詞符器計算的詞符數。

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

API 參考文檔：CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

tiktoken​

spaCy​

SentenceTransformers​

NLTK​

KoNLPY​

使用 KoNLPy 的 Kkma 分析器進行韓語詞符分割​

使用考量​

Hugging Face 詞符器​

此頁面是否對您有幫助？