上图显示了我的数据结构。
from sklearn.model_selection import train_test_split
from datasets import Features, ClassLabel, Value, Dataset, DatasetDict
df_train, df_tmp = train_test_split(
movie_df,stratify=movie_df["label"], test_size=0.2)
df_val, df_test = train_test_split(
df_tmp,stratify=df_tmp["label"], test_size=0.5)
ds_features = Features({"text": Value("string"), "label": ClassLabel(names=labels)})
dataset = DatasetDict({
"train": Dataset.from_pandas(df_train.reset_index(drop=True),features=ds_features),
"valid": Dataset.from_pandas(df_val.reset_index(drop=True),features=ds_features),
"test": Dataset.from_pandas(df_test.reset_index(drop=True),features=ds_features)})
dataset
此代码给了我一个值错误,如下所示:
我期待类似的东西,但不具有相同的值:
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 13267
})
valid: Dataset({
features: ['text', 'label'],
num_rows: 1658
})
test: Dataset({
features: ['text', 'label'],
num_rows: 1659
})
})
谁能告诉我我做错了什么?
您需要删除列标题和 label_name 并将其保存到新数据框。尝试从新的数据帧创建特征