跳到主要內容
Open In ColabOpen on GitHub

Amazon Textract

Amazon Textract 是一項機器學習 (ML) 服務,可自動從掃描文件擷取文字、手寫字和資料。

它超越了簡單的光學字元辨識 (OCR),能夠識別、理解及從表單和表格中擷取資料。現今,許多公司仍舊仰賴人工從掃描文件 (例如 PDF、影像、表格和表單) 擷取資料,或是透過需要手動設定的簡易 OCR 軟體 (通常在表單變更時必須更新)。為了克服這些人工且昂貴的流程,Textract 使用 ML 來讀取和處理任何類型的文件,準確地擷取文字、手寫字、表格和其他資料,無需人工介入。

此範例示範了 Amazon Textract 與 LangChain 結合作為 DocumentLoader 的使用方式。

Textract 支援 PDFTIFFPNGJPEG 格式。

Textract 支援這些文件大小、語言和字元

%pip install --upgrade --quiet  boto3 langchain-openai tiktoken python-dotenv
%pip install --upgrade --quiet  "amazon-textract-caller>=0.2.0"

範例 1

第一個範例使用本機檔案,其內部會傳送至 Amazon Textract 同步 API DetectDocumentText

本機檔案或 URL 端點 (例如 HTTP://) 僅限用於 Textract 的單頁文件。多頁文件必須位於 S3 上。此範例檔案為 jpeg 格式。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

檔案輸出結果

documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

範例 2

下一個範例從 HTTPS 端點載入檔案。它必須是單頁的,因為 Amazon Textract 要求所有多頁文件都儲存在 S3 上。

from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader(
"https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg"
)
documents = loader.load()
documents
[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No ', metadata={'source': 'example_data/alejandro_rosalez_sample-small.jpeg', 'page': 1})]

範例 3

處理多頁文件需要文件位於 S3 上。範例文件位於 us-east-2 的儲存貯體中,且 Textract 需要在同一個區域中呼叫才能成功,因此我們在用戶端上設定 region_name 並將其傳遞至載入器,以確保 Textract 是從 us-east-2 呼叫。您也可以讓您的 notebook 在 us-east-2 中執行,將 AWS_DEFAULT_REGION 設定為 us-east-2,或者在不同的環境中執行時,傳入具有該區域名稱的 boto3 Textract 用戶端,如下方儲存格所示。

import boto3

textract_client = boto3.client("textract", region_name="us-east-2")

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

現在取得頁數以驗證回應 (列印完整回應會非常長...)。我們預期會有 16 頁。

len(documents)
16

範例 4

您可以選擇傳遞名為 linearization_config 的額外參數給 AmazonTextractPDFLoader,這將決定 Textract 執行後,剖析器將如何線性化文字輸出。

from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig

loader = AmazonTextractPDFLoader(
"s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf",
linearization_config=TextLinearizationConfig(
hide_header_layout=True,
hide_footer_layout=True,
hide_figure_layout=True,
),
)
documents = loader.load()

在 LangChain 鏈中使用 AmazonTextractPDFLoader (例如 OpenAI)

AmazonTextractPDFLoader 可以在鏈中使用,方式與其他載入器相同。Textract 本身確實具有查詢功能,其提供與此範例中的 QA 鏈類似的功能,也值得一試。

# You can store your OPENAI_API_KEY in a .env file as well
# import os
# from dotenv import load_dotenv

# load_dotenv()
# Or set the OpenAI key in the environment directly
import os

os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = ["Who are the autors?"]

chain.run(input_documents=documents, question=query)
API 參考:load_qa_chain | OpenAI
' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li, Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L., Lukasz Garncarek, Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Gralinski, F., Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., Harley, A.W., Ufkes, A., Derpanis, K.G., He, K., Gkioxari, G., Dollár, P., Girshick, R., He, K., Zhang, X., Ren, S., Sun, J., Kay, A., Lamiroy, B., Lopresti, D., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., Li, M., Cui, L., Huang,'

此頁面是否對您有幫助?