您能为以下任务提出一个解决方案吗? 假设我有一个像这样的数据框:
data = {
"Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
"Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
"Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}
df = pd.DataFrame(data)
我正在根据匈牙利汇率列创建 5 个间隔。目标是创建人口总数尽可能相似的区间。下面添加的代码仅根据每个间隔中城镇数量的相似性创建间隔。
df["interval"] = pd.qcut(df.Hun_rate, q=5)
grouped = df.groupby("interval").size().reset_index(name='num')
分组
但我的目标是根据“匈牙利人比率 (%)”列生成 5 个区间,目的是在每个区间内实现相似的人口总数。 下面,您可以看到想要的结果。
我尝试通过 Stack Overflow 和 AI 等各种来源找到创建间隔的解决方案,但尚未成功。您能否提供有关如何根据特定标准创建间隔的指导?
正如评论中提到的,这是一个优化问题,您的目标是优化每个间隔中的个体数量,基本上试图达到
target_population_per_interval = total_population / num_intervals
。以下是有关如何执行此操作的建议:
import pandas as pd
# Data setup
data = {
"Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
"Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
"Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}
df = pd.DataFrame(data)
df_sorted = df.sort_values(by="Hungarian_rate")
num_intervals = 5
total_population = df_sorted['Population'].sum()
target_population_per_interval = total_population / num_intervals
current_population = 0
interval_number = 0
interval_start = df_sorted.iloc[0]['Hungarian_rate']
intervals = []
interval_bounds = []
for index, row in df_sorted.iterrows():
if interval_number < num_intervals - 1:
if abs((current_population + row['Population']) - target_population_per_interval) < abs(current_population - target_population_per_interval):
current_population += row['Population']
intervals.append(interval_number)
else:
interval_end = df_sorted.at[index - 1, 'Hungarian_rate'] if index > 0 else row['Hungarian_rate']
interval_bounds.append((interval_start, interval_end))
interval_start = row['Hungarian_rate']
interval_number += 1
current_population = row['Population']
intervals.append(interval_number)
else:
intervals.append(interval_number)
interval_end = df_sorted.iloc[-1]['Hungarian_rate']
interval_bounds.append((interval_start, interval_end))
df_sorted['Interval'] = intervals
interval_df = pd.DataFrame({
'Interval_Number': range(num_intervals),
'Range': interval_bounds,
'Count': df_sorted.groupby('Interval').size().values,
'Population_Sum': df_sorted.groupby('Interval')['Population'].sum().values
})
print(interval_df)
这会给你
Interval_Number Range Count Population_Sum
0 0 (1, 5) 3 80
1 1 (8, 9) 2 80
2 2 (10, 15) 3 90
3 3 (20, 31) 4 70
4 4 (41, 75) 3 120
但是,我怀疑您能否通过简单的决策逻辑实现图像中发布的预期输出。
还有其他方法可以做到这一点,例如贪婪算法,但这一切都取决于实际数据集的大小。这些很重。这是一个允许您调整参数阈值的示例。它的作用是尝试将人口分布在 5 个间隔中,使得间隔之间的人口计数差异小于“阈值”。但请注意:这在计算上是昂贵的:
import pandas as pd
import numpy as np
data = {
"Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
"Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
"Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}
df = pd.DataFrame(data)
df_sorted = df.sort_values(by="Hungarian_rate").reset_index(drop=True)
num_intervals = 5
interval_populations = np.zeros(num_intervals)
intervals = np.zeros(len(df_sorted), dtype=int)
threshold = 20
for i in range(len(df_sorted)):
row = df_sorted.iloc[i]
current_interval = intervals[i]
new_population = interval_populations[current_interval] + row['Population']
if current_interval < num_intervals - 1 and new_population > interval_populations[current_interval]:
current_interval += 1
intervals[i] = current_interval
interval_populations[current_interval] += row['Population']
while True:
max_idx = np.argmax(interval_populations)
min_idx = np.argmin(interval_populations)
max_pop = interval_populations[max_idx]
min_pop = interval_populations[min_idx]
if max_idx == min_idx or (max_pop - min_pop) < threshold:
break
for i in range(len(df_sorted)):
if intervals[i] == max_idx:
proposed_new_max = max_pop - df_sorted.at[i, 'Population']
proposed_new_min = min_pop + df_sorted.at[i, 'Population']
if abs(proposed_new_max - proposed_new_min) < abs(max_pop - min_pop):
intervals[i] = min_idx
interval_populations[max_idx] -= df_sorted.at[i, 'Population']
interval_populations[min_idx] += df_sorted.at[i, 'Population']
break
df_sorted['Interval'] = intervals
interval_summary = df_sorted.groupby('Interval').agg(
Population_Sum=('Population', 'sum'),
Towns=('Town', list),
Count=('Town', 'size')
).reset_index()
print(interval_summary)
这给你(阈值 = 25):
Interval Population_Sum Towns Count
0 0 100 [F, C, E] 3
1 1 80 [D, H] 2
2 2 90 [A, B, K, G] 4
3 3 80 [N, L, J, M] 4
4 4 90 [O, I] 2