如何使用 Python 创建间隔,以便根据特定标准获得每组(间隔)最相似的大小?

问题描述 投票:0回答:1

您能为以下任务提出一个解决方案吗? 假设我有一个像这样的数据框:

data = {
    "Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
    "Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
    "Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}

df = pd.DataFrame(data)

我正在根据匈牙利汇率列创建 5 个间隔。目标是创建人口总数尽可能相似的区间。下面添加的代码仅根据每个间隔中城镇数量的相似性创建间隔。

df["interval"] = pd.qcut(df.Hun_rate, q=5)

grouped = df.groupby("interval").size().reset_index(name='num')

分组

enter image description here

但我的目标是根据“匈牙利人比率 (%)”列生成 5 个区间,目的是在每个区间内实现相似的人口总数。 下面,您可以看到想要的结果。

enter image description here

我尝试通过 Stack Overflow 和 AI 等各种来源找到创建间隔的解决方案,但尚未成功。您能否提供有关如何根据特定标准创建间隔的指导?

python pandas dataframe
1个回答
0
投票

正如评论中提到的,这是一个优化问题,您的目标是优化每个间隔中的个体数量,基本上试图达到

target_population_per_interval = total_population / num_intervals
。以下是有关如何执行此操作的建议:

import pandas as pd

# Data setup
data = {
    "Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
    "Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
    "Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}

df = pd.DataFrame(data)

df_sorted = df.sort_values(by="Hungarian_rate")

num_intervals = 5

total_population = df_sorted['Population'].sum()
target_population_per_interval = total_population / num_intervals

current_population = 0
interval_number = 0
interval_start = df_sorted.iloc[0]['Hungarian_rate']
intervals = []
interval_bounds = []

for index, row in df_sorted.iterrows():
    if interval_number < num_intervals - 1:
        if abs((current_population + row['Population']) - target_population_per_interval) < abs(current_population - target_population_per_interval):
            current_population += row['Population']
            intervals.append(interval_number)
        else:
            interval_end = df_sorted.at[index - 1, 'Hungarian_rate'] if index > 0 else row['Hungarian_rate']
            interval_bounds.append((interval_start, interval_end))
            interval_start = row['Hungarian_rate']
            interval_number += 1
            current_population = row['Population']
            intervals.append(interval_number)
    else:
        intervals.append(interval_number)

interval_end = df_sorted.iloc[-1]['Hungarian_rate']
interval_bounds.append((interval_start, interval_end))

df_sorted['Interval'] = intervals

interval_df = pd.DataFrame({
    'Interval_Number': range(num_intervals),
    'Range': interval_bounds,
    'Count': df_sorted.groupby('Interval').size().values,
    'Population_Sum': df_sorted.groupby('Interval')['Population'].sum().values
})

print(interval_df)

这会给你

   Interval_Number     Range  Count  Population_Sum
0                0    (1, 5)      3              80
1                1    (8, 9)      2              80
2                2  (10, 15)      3              90
3                3  (20, 31)      4              70
4                4  (41, 75)      3             120

但是,我怀疑您能否通过简单的决策逻辑实现图像中发布的预期输出。

还有其他方法可以做到这一点,例如贪婪算法,但这一切都取决于实际数据集的大小。这些很重。这是一个允许您调整参数阈值的示例。它的作用是尝试将人口分布在 5 个间隔中,使得间隔之间的人口计数差异小于“阈值”。但请注意:这在计算上是昂贵的:

import pandas as pd
import numpy as np

data = {
    "Town": ["F", "A", "N", "O", "B", "L", "C", "K", "J", "E", "G", "M", "I", "D", "H"],
    "Hungarian_rate": [1, 4, 5, 8, 9, 10, 15, 15, 20, 22, 23, 31, 41, 60, 75],
    "Population": [40, 10, 30, 50, 30, 20, 40, 30, 20, 20, 20, 10, 40, 50, 30]
}

df = pd.DataFrame(data)
df_sorted = df.sort_values(by="Hungarian_rate").reset_index(drop=True)

num_intervals = 5

interval_populations = np.zeros(num_intervals)
intervals = np.zeros(len(df_sorted), dtype=int)
threshold = 20

for i in range(len(df_sorted)):
    row = df_sorted.iloc[i]
    current_interval = intervals[i]
    new_population = interval_populations[current_interval] + row['Population']
    if current_interval < num_intervals - 1 and new_population > interval_populations[current_interval]:
        current_interval += 1
        intervals[i] = current_interval
    interval_populations[current_interval] += row['Population']

while True:
    max_idx = np.argmax(interval_populations)
    min_idx = np.argmin(interval_populations)
    max_pop = interval_populations[max_idx]
    min_pop = interval_populations[min_idx]

    if max_idx == min_idx or (max_pop - min_pop) < threshold:
        break

    for i in range(len(df_sorted)):
        if intervals[i] == max_idx:
            proposed_new_max = max_pop - df_sorted.at[i, 'Population']
            proposed_new_min = min_pop + df_sorted.at[i, 'Population']

            if abs(proposed_new_max - proposed_new_min) < abs(max_pop - min_pop):
                intervals[i] = min_idx
                interval_populations[max_idx] -= df_sorted.at[i, 'Population']
                interval_populations[min_idx] += df_sorted.at[i, 'Population']
                break

df_sorted['Interval'] = intervals

interval_summary = df_sorted.groupby('Interval').agg(
    Population_Sum=('Population', 'sum'),
    Towns=('Town', list),
    Count=('Town', 'size')
).reset_index()

print(interval_summary)

这给你(阈值 = 25):

   Interval  Population_Sum         Towns  Count
0         0             100     [F, C, E]      3
1         1              80        [D, H]      2
2         2              90  [A, B, K, G]      4
3         3              80  [N, L, J, M]      4
4         4              90        [O, I]      2
© www.soinside.com 2019 - 2024. All rights reserved.