ML笔记：fuggingface datasets 怎么自定义函数筛选数据

作者既智

8 月 10, 2024 #AI, #数据

在 HuggingFace 的 datasets 库中，dataset.map 函数主要用于对数据集中的每个样本应用自定义处理函数。如果你想根据复杂条件筛选数据，只保留符合条件的数据，可以使用 dataset.filter 函数。以下是如何实现这一目标的详细步骤：

步骤1：加载数据集

首先，加载你需要处理的数据集。例如，这里我们使用 squad 数据集：

from datasets import load_dataset

# 加载 squad 数据集
dataset = load_dataset('squad')

步骤2：定义筛选函数

定义一个函数，该函数接收一个样本并返回一个布尔值，表示该样本是否符合条件。例如，假设我们只想保留问题长度大于10个字符的样本：

def filter_function(example):
    return len(example['question']) > 10

步骤3：应用筛选函数

使用 dataset.filter 函数应用筛选条件：

filtered_dataset = dataset.filter(filter_function)

复杂条件筛选

如果你的筛选条件比较复杂，可以在筛选函数中添加更多的逻辑。例如，假设我们只保留问题长度大于10个字符且答案长度小于50个字符的样本：

def complex_filter_function(example):
    return len(example['question']) > 10 and len(example['answers']['text'][0]) < 50

filtered_dataset = dataset.filter(complex_filter_function)

示例代码

以下是完整的示例代码：

from datasets import load_dataset

# 加载 squad 数据集
dataset = load_dataset('squad')

# 定义复杂筛选函数
def complex_filter_function(example):
    return len(example['question']) > 10 and len(example['answers']['text'][0]) < 50

# 应用筛选函数
filtered_dataset = dataset.filter(complex_filter_function)

# 打印筛选后的数据集
print(filtered_dataset)

注意事项

性能：filter 函数会遍历整个数据集，因此对于非常大的数据集，筛选操作可能会比较耗时。
内存：确保你的筛选函数不会导致内存溢出，特别是在处理大型数据集时。

通过上述步骤，你可以使用 HuggingFace 的 datasets 库根据自定义条件筛选数据，只保留符合条件的数据1 4 6。

[1] https://blog.csdn.net/jclian91/article/details/131906508
[2] https://discuss.huggingface.co/t/data-exploration-visualisation/68602
[3] https://mskter.com/2023/08/27/huggingface-datasets-intro-zh/
[4] https://huggingface.co/docs/datasets/v1.2.0/processing.html
[5] https://developer.baidu.com/article/details/3251004
[6] https://discuss.huggingface.co/t/how-to-filter-datasets-object/90718
[7] https://www.cnblogs.com/shengshengwang/p/17510014.html
[8] https://stackoverflow.com/questions/73145394/how-can-i-take-the-unique-rows-of-a-huggingface-dataset
[9] https://discuss.huggingface.co/t/dataset-map-method-how-to-pass-argument-to-the-function/16274
[10] https://github.com/huggingface/datasets
[11] https://discuss.huggingface.co/t/datasets-filter-map-hangs-when-multithreading/36967