为预测的群集创建新列：SettingWithCopyWarning

Question

不幸的是，这个问题将重复，但是即使查看了其他类似的问题及其相关的答案，我也无法在代码中解决该问题。我需要将我的数据集拆分为训练一个数据集。但是，当我添加新列以预测群集时，似乎在做一些错误。我得到的错误是：

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

关于此错误，有几个问题，但是可能我做错了，因为我尚未解决此问题，并且仍然遇到与上述相同的错误。数据集如下：

    Date    Link    Value   
0   03/15/2020  https://www.bbc.com 1
1   03/15/2020  https://www.netflix.com 4   
2   03/15/2020  https://www.google.com 10
...

我将数据集分为以下训练集和测试集：

import sklearn
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string as st 

train_data=df.Link.tolist()
df_train=pd.DataFrame(train_data, columns = ['Review'])
X = df_train

X_train, X_test = train_test_split(
        X, test_size=0.4).copy()
X_test, X_val = train_test_split(
        X_test, test_size=0.5).copy()
print(X_train.isna().sum())
print(X_test.isna().sum())

stop_words = stopwords.words('english')

def preprocessor(t):
    t = re.sub(r"[^a-zA-Z]", " ", t())
    words = word_tokenize(t)
    w_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return w_lemm


vect =TfidfVectorizer(tokenizer= preprocessor)
vectorized_text=vect.fit_transform(X_train['Review'])
kmeans =KMeans(n_clusters=3).fit(vectorized_text)

导致错误的代码行是：

cl=kmeans.predict(vectorized_text)
X_train['Cluster']=pd.Series(cl, index=X_train.index)

我认为这两个问题应该可以帮助我编写代码：

How to add k-means predicted clusters in a column to a dataframe in Python

How to deal with SettingWithCopyWarning in Pandas?

但是我的代码中仍然存在某些问题。

请您仔细看看并帮助我解决此问题，然后再将其作为重复项关闭？

Answer 1

恕我直言，train_test_split给您一个元组，当您执行copy()时，copy()是tuple的操作，而不是熊猫的操作。因此，您只创建元组的浅表副本，而不创建元素。换句话说

X_train, X_test = train_test_split(X, test_size=0.4).copy()

等效于：

train_test = train_test_split(X, test_size=0.4)
train_test_copy = train_test.copy()
X_train, X_test = train_test_copy[0], train_test_copy[1]

由于熊猫数据帧是指针，因此X_train和X_test可能会或可能不会指向与X相同的数据。如果要复制数据帧，则应在每个数据帧上显式强制copy()：

X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

或

X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]

然后X_train和X_test每个都是指向新内存数据的新数据帧。

更新：测试了此代码，没有任何警告：

X = pd.DataFrame(np.random.rand(100,3))
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

X_train['abcd'] = 1

为预测的群集创建新列：SettingWithCopyWarning

问题描述投票：0回答：1

1个回答

最新问题

为预测的群集创建新列：SettingWithCopyWarning

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1