在过滤后的 pandas 数据上使用 kmeans 时出现问题

问题描述 投票:0回答:1

我导入了一个 CSV 文件,并过滤了两列。非常标准并且按预期工作。然而,当我运行 KMeans 测试时,结果出乎意料。我要么在整个数据集(不是过滤后的数据)上运行它,要么在错误的数据上运行。

# Load data
data = pd.read_csv('ISD.csv', encoding='ISO-8859-1')

# Remove '%' from the entire DataFrame
data = data.apply(lambda x: x.str.rstrip('%') if x.dtype == 'O' else x)

# Replace '#DIV/0!' with NaN
data.replace('#DIV/0!', pd.NA, inplace=True)

# Convert relevant columns to float, replacing missing values with 0
data['Debt to Equity'] = data['Debt to Equity'].replace(pd.NA, '0').astype(float)

这会加载已加载的数据。

# Select relevant financial metrics
selected_metrics = data[['Debt to Equity', 'Stable Financial Postion']]

# Apply filters
filtered_data = selected_metrics[(selected_metrics['Stable Financial Postion'] == 1)].copy()

# Check counts of Stable Financial Position in the filtered data
print("Counts of Stable Financial Position in Filtered Data:")
print(filtered_data['Stable Financial Postion'].value_counts())

# Reset index of filtered_data
filtered_data.reset_index(drop=True, inplace=True)
print(filtered_data['Stable Financial Postion'].value_counts())

print (filtered_data['Debt to Equity'].max())
print (filtered_data['Debt to Equity'].min())

结果符合预期且正确

过滤数据中稳定财务状况的计数:

Stable Financial Postion
1    1316
Name: count, dtype: int64
Stable Financial Postion
1    1316
Name: count, dtype: int64
1.923278013
1.25e-09

然而,下一步,它似乎崩溃了,结果变得出乎意料。

# Clustering
kmeans = KMeans(n_clusters=3, n_init=10)  # Adjust the number of clusters as needed

# Fit and predict on the filtered data
cluster_labels = kmeans.fit_predict(filtered_data[['Debt to Equity', 'Stable Financial Postion']])

# Assign cluster labels to the original DataFrame
data.loc[filtered_data.index, 'Cluster'] = cluster_labels

运行以下描述,集群 0 的最大值为 226149,集群 1 的最大值为 370,集群 2 的最大值为 5000。

# Inspect cluster labels
print("\nCluster Counts:")
print(data['Cluster'].value_counts())

# Explore cluster characteristics
cluster_stats = data.groupby('Cluster')['Debt to Equity'].describe()
print("\nCluster Characteristics:")
print(cluster_stats)

结果如下

Cluster Counts:
Cluster
0.0    998
1.0    244
2.0     74
Name: count, dtype: int64

Cluster Characteristics:
         count        mean          std  ...       50%       75%            max
Cluster                                  ...                                   
0.0      968.0  524.708244  8306.283145  ...  0.056144  1.629182  226149.000000
1.0      239.0  550.843018  4978.241124  ...  0.054620  1.931508   50000.000000
2.0       71.0   13.488451    64.011376  ...  0.007957  0.373877     370.186275

我的过滤或 Kmeans 测试有问题吗?

谢谢

python pandas k-means
1个回答
0
投票

我已经找到答案了。我将簇附加到原始数据而不是过滤后的数据。

© www.soinside.com 2019 - 2024. All rights reserved.