SMOTE初始化期望n_neighbors <= n_samples，但是n_samples <n_neighbors

Question

我已经预先清理了数据，下面显示了前4行的格式：

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

我已经调用了train_test_split（），如下所示：

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

然后，我使用以下TfidfVectorizer和fit / transform程序对X训练和测试数据进行了矢量化：

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

我现在处于通常应用分类器等的阶段（如果这是一组平衡的数据）。但是，我初始化imblearn的SMOTE()类（执行过采样）......

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

......但这会导致：

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

我试图削减n_neighbors的数量，但无济于事，任何提示或建议将不胜感激。谢谢阅读。

------------------------------------------------------------------------------------------------------------------------------------

编辑：

Full Traceback

数据集/数据框（df）包含两列的2380行，如上面的df.head()所示。 X_train以字符串列表（df['cleaned']）的形式包含这些行中的1785行，y_train还包含1785行（字符串格式（df['Year']））。

使用TfidfVectorizer()进行后矢量化：X_train和X_test分别从形状'（1785，）'和'（595，）'的pandas.core.series.Series转换为形状的scipy.sparse.csr.csr_matrix'（1785,126459）'和'（595,126459）' 。

关于类的数量：使用Counter()，我计算出有199个类（年），类的每个实例都附加到前面提到的df['cleaned']数据的一个元素，其中包含从文本语料库中提取的字符串列表。

该过程的目的是基于词汇表现自动确定/猜测输入文本数据的年，十年或世纪（任何程度的分类将会做！）。

Answer 1

由于训练集中大约有200个班级和1800个样本，因此每个班级平均有9个样本。错误消息的原因是：a）数据可能不是完全平衡的，并且存在少于6个样本的类，以及b）邻居的数量是6.针对您的问题的一些解决方案：

计算199个类中的最小样本数（n_samples），并选择SMOTE类的n_neighbors参数小于或等于n_samples。
使用ratio类的SMOTE参数排除使用n_samples <n_neighbors对类进行过采样。
使用没有类似限制的RandomOverSampler类。
结合3和4解决方案：创建一个使用SMOTE和RandomOversampler的管道，其满足条件n_neighbors <= n_samples用于屏蔽类，并在条件不满足时使用随机过采样。

Answer 2

尝试为SMOTE执行以下代码

oversampler=SMOTE(kind='regular',k_neighbors=2)

这对我有用。

SMOTE初始化期望n_neighbors <= n_samples，但是n_samples <n_neighbors

问题描述投票：1回答：1

1个回答

最新问题

SMOTE初始化期望n_neighbors <= n_samples，但是n_samples <n_neighbors

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1