ML笔记：fuggingface datasets 怎么自定义函数筛选数据

作者既智

8 月 10, 2024 #AI, #数据

在 HuggingFace 的 datasets 库中，dataset.map 函数主要用于对数据集中的每个样本应用自定义处理函数。如果你想根据复杂条件筛选数据，只保留符合条件的数据，可以使用 dataset.filter 函数。以下是如何实现这一目标的详细步骤：

步骤1：加载数据集

首先，加载你需要处理的数据集。例如，这里我们使用 squad 数据集：

from datasets import load_dataset

# 加载 squad 数据集
dataset = load_dataset('squad')

步骤2：定义筛选函数

定义一个函数，该函数接收一个样本并返回一个布尔值，表示该样本是否符合条件。例如，假设我们只想保留问题长度大于10个字符的样本：

def filter_function(example):
    return len(example['question']) > 10

步骤3：应用筛选函数

使用 dataset.filter 函数应用筛选条件：

filtered_dataset = dataset.filter(filter_function)

复杂条件筛选

如果你的筛选条件比较复杂，可以在筛选函数中添加更多的逻辑。例如，假设我们只保留问题长度大于10个字符且答案长度小于50个字符的样本：

def complex_filter_function(example):
    return len(example['question']) > 10 and len(example['answers']['text'][0]) < 50

filtered_dataset = dataset.filter(complex_filter_function)

示例代码

以下是完整的示例代码：

from datasets import load_dataset

# 加载 squad 数据集
dataset = load_dataset('squad')

# 定义复杂筛选函数
def complex_filter_function(example):
    return len(example['question']) > 10 and len(example['answers']['text'][0]) < 50

# 应用筛选函数
filtered_dataset = dataset.filter(complex_filter_function)

# 打印筛选后的数据集
print(filtered_dataset)

注意事项

性能：filter 函数会遍历整个数据集，因此对于非常大的数据集，筛选操作可能会比较耗时。
内存：确保你的筛选函数不会导致内存溢出，特别是在处理大型数据集时。

通过上述步骤，你可以使用 HuggingFace 的 datasets 库根据自定义条件筛选数据，只保留符合条件的数据1 4 6。

[1] https://blog.csdn.net/jclian91/article/details/131906508
[2] https://discuss.huggingface.co/t/data-exploration-visualisation/68602
[3] https://mskter.com/2023/08/27/huggingface-datasets-intro-zh/
[4] https://huggingface.co/docs/datasets/v1.2.0/processing.html
[5] https://developer.baidu.com/article/details/3251004
[6] https://discuss.huggingface.co/t/how-to-filter-datasets-object/90718
[7] https://www.cnblogs.com/shengshengwang/p/17510014.html
[8] https://stackoverflow.com/questions/73145394/how-can-i-take-the-unique-rows-of-a-huggingface-dataset
[9] https://discuss.huggingface.co/t/dataset-map-method-how-to-pass-argument-to-the-function/16274
[10] https://github.com/huggingface/datasets
[11] https://discuss.huggingface.co/t/datasets-filter-map-hangs-when-multithreading/36967

AI生成 IT 机器学习

ML笔记：利用 DeepSeek 的 GPRO 算法优化 LLM 在金融文本和数据预测中的性能

3 月 20, 2025 既智

AI生成 NEWS USD 智能新闻

2818亿日元债券：伯克希尔·哈撒韦发行日元债券背后的策略性考量 281.8 billion yen bonds: Berkshire Hathaway’s strategic considerations behind issuing yen bonds

10 月 10, 2024 既智

AI AI生成 NEWS 智能新闻

2024年中国近期经济下行的宏微观经济分析 Macroeconomic and Microeconomic Analysis of China’s Recent Economic Downturn in 2024

9 月 30, 2024 既智

ML笔记：fuggingface datasets 怎么自定义函数筛选数据

作者既智

步骤1：加载数据集

步骤2：定义筛选函数

步骤3：应用筛选函数

复杂条件筛选

示例代码

注意事项

相关文章

ML笔记：利用 DeepSeek 的 GPRO 算法优化 LLM 在金融文本和数据预测中的性能

2818亿日元债券：伯克希尔·哈撒韦发行日元债券背后的策略性考量 281.8 billion yen bonds: Berkshire Hathaway’s strategic considerations behind issuing yen bonds

2024年中国近期经济下行的宏微观经济分析 Macroeconomic and Microeconomic Analysis of China’s Recent Economic Downturn in 2024

发表回复取消回复

为您推荐

Gemini 2.5升级！挑战Veo 2，AI视频大战爆发

Gemini 2.5 震撼登场：Pro、Flash 与优化器齐发！

人形机器人：资本狂涌，亿元融资成常态

Tariff Fears Tesla Halts Sales of Top Models in China

2025 年 4 月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

作者既智

步骤1：加载数据集

步骤2：定义筛选函数

步骤3：应用筛选函数

复杂条件筛选

示例代码

注意事项

相关文章

发表回复 取消回复

为您推荐

发表回复取消回复