我正在尝试加载(人民语音)数据集,但它太大了,有没有办法只下载其中的一部分?
from datasets import load_dataset
from datasets import load_dataset
train = load_dataset("MLCommons/peoples_speech", "clean",split="train[:10%]")
test = load_dataset("MLCommons/peoples_speech", "clean",split="test[:10%]")
使用 ("train [: 10%]") 没有帮助,它仍在尝试下载整个数据集......
您是否考虑过使用 Datasets
from datasets import load_dataset
from torch.utils.data import DataLoader
# you get a dict of {"split": IterableDataset}
dataset = load_dataset("MLCommons/peoples_speech", "clean", streaming=True)
# your preprocessing and filtering
...
train_dataloader = DataLoader(dataset["train"], batch_size=4)
valid_dataloader = DataLoader(dataset["validation"], batch_size=4)
train_steps_per_epoch = 500
# training loop
for n in range(5):
for i, batch in enumerate(train_dataloader):
# if you only want to do a limited amount of optimization steps per epoch
if i == train_steps_per_epoch:
break
# train step
...
希望有帮助。