TensorFlow 資料集
TensorFlow Datasets 是一個可立即使用的資料集集合,適用於 TensorFlow 或其他 Python ML 框架,例如 Jax。所有資料集都以 tf.data.Datasets 的形式公開,實現了易於使用和高性能的輸入管道。若要開始使用,請參閱指南和資料集列表。
此筆記本展示了如何將 TensorFlow Datasets
載入到我們可以下游使用的 Document 格式。
安裝
您需要安裝 tensorflow
和 tensorflow-datasets
python 套件。
%pip install --upgrade --quiet tensorflow
%pip install --upgrade --quiet tensorflow-datasets
範例
作為一個範例,我們使用 mlqa/en
資料集。
MLQA
(Multilingual Question Answering Dataset
) 是一個用於評估多語種問答性能的基準資料集。該資料集包含 7 種語言:阿拉伯語、德語、西班牙語、英語、印地語、越南語、中文。
- 首頁:https://github.com/facebookresearch/MLQA
- 原始碼:
tfds.datasets.mlqa.Builder
- 下載大小:72.21 MiB
# Feature structure of `mlqa/en` dataset:
FeaturesDict(
{
"answers": Sequence(
{
"answer_start": int32,
"text": Text(shape=(), dtype=string),
}
),
"context": Text(shape=(), dtype=string),
"id": string,
"question": Text(shape=(), dtype=string),
"title": Text(shape=(), dtype=string),
}
)
import tensorflow as tf
import tensorflow_datasets as tfds
# try directly access this dataset:
ds = tfds.load("mlqa/en", split="test")
ds = ds.take(1) # Only take a single example
ds
<_TakeDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>
現在我們必須建立一個自訂函數,將資料集樣本轉換為 Document。
這是一個要求。TF 資料集沒有標準格式,因此我們需要建立一個自訂轉換函數。
讓我們使用 context
欄位作為 Document.page_content
,並將其他欄位放置在 Document.metadata
中。
from langchain_core.documents import Document
def decode_to_str(item: tf.Tensor) -> str:
return item.numpy().decode("utf-8")
def mlqaen_example_to_document(example: dict) -> Document:
return Document(
page_content=decode_to_str(example["context"]),
metadata={
"id": decode_to_str(example["id"]),
"title": decode_to_str(example["title"]),
"question": decode_to_str(example["question"]),
"answer": decode_to_str(example["answers"]["text"][0]),
},
)
for example in ds:
doc = mlqaen_example_to_document(example)
print(doc)
break
API 參考:Document
page_content='After completing the journey around South America, on 23 February 2006, Queen Mary 2 met her namesake, the original RMS Queen Mary, which is permanently docked at Long Beach, California. Escorted by a flotilla of smaller ships, the two Queens exchanged a "whistle salute" which was heard throughout the city of Long Beach. Queen Mary 2 met the other serving Cunard liners Queen Victoria and Queen Elizabeth 2 on 13 January 2008 near the Statue of Liberty in New York City harbour, with a celebratory fireworks display; Queen Elizabeth 2 and Queen Victoria made a tandem crossing of the Atlantic for the meeting. This marked the first time three Cunard Queens have been present in the same location. Cunard stated this would be the last time these three ships would ever meet, due to Queen Elizabeth 2\'s impending retirement from service in late 2008. However this would prove not to be the case, as the three Queens met in Southampton on 22 April 2008. Queen Mary 2 rendezvoused with Queen Elizabeth 2 in Dubai on Saturday 21 March 2009, after the latter ship\'s retirement, while both ships were berthed at Port Rashid. With the withdrawal of Queen Elizabeth 2 from Cunard\'s fleet and its docking in Dubai, Queen Mary 2 became the only ocean liner left in active passenger service.' metadata={'id': '5116f7cccdbf614d60bcd23498274ffd7b1e4ec7', 'title': 'RMS Queen Mary 2', 'question': 'What year did Queen Mary 2 complete her journey around South America?', 'answer': '2006'}
``````output
2023-08-03 14:27:08.482983: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
from langchain_community.document_loaders import TensorflowDatasetLoader
from langchain_core.documents import Document
loader = TensorflowDatasetLoader(
dataset_name="mlqa/en",
split_name="test",
load_max_docs=3,
sample_to_document_function=mlqaen_example_to_document,
)
API 參考:TensorflowDatasetLoader | Document
TensorflowDatasetLoader
具有以下參數
dataset_name
:要載入的資料集名稱split_name
:要載入的分割名稱。預設為 "train"。load_max_docs
:載入的文件數量限制。預設為 100。sample_to_document_function
:將資料集樣本轉換為 Document 的函數
docs = loader.load()
len(docs)
2023-08-03 14:27:22.998964: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
3
docs[0].page_content
'After completing the journey around South America, on 23 February 2006, Queen Mary 2 met her namesake, the original RMS Queen Mary, which is permanently docked at Long Beach, California. Escorted by a flotilla of smaller ships, the two Queens exchanged a "whistle salute" which was heard throughout the city of Long Beach. Queen Mary 2 met the other serving Cunard liners Queen Victoria and Queen Elizabeth 2 on 13 January 2008 near the Statue of Liberty in New York City harbour, with a celebratory fireworks display; Queen Elizabeth 2 and Queen Victoria made a tandem crossing of the Atlantic for the meeting. This marked the first time three Cunard Queens have been present in the same location. Cunard stated this would be the last time these three ships would ever meet, due to Queen Elizabeth 2\'s impending retirement from service in late 2008. However this would prove not to be the case, as the three Queens met in Southampton on 22 April 2008. Queen Mary 2 rendezvoused with Queen Elizabeth 2 in Dubai on Saturday 21 March 2009, after the latter ship\'s retirement, while both ships were berthed at Port Rashid. With the withdrawal of Queen Elizabeth 2 from Cunard\'s fleet and its docking in Dubai, Queen Mary 2 became the only ocean liner left in active passenger service.'
docs[0].metadata
{'id': '5116f7cccdbf614d60bcd23498274ffd7b1e4ec7',
'title': 'RMS Queen Mary 2',
'question': 'What year did Queen Mary 2 complete her journey around South America?',
'answer': '2006'}