我想为每个项目地块分配两个控制地块。 在第一次迭代中,我想将最近的控制图(具有 dist_project_plot_id* 的最小值)分配给正在评估的项目图。如果该控制图已分配给项目图,我们将寻找下一个最近的控制图。
一旦所有项目地块都被分配了第一个控制地块,我们就为每个项目地块分配第二个控制地块,遵循相同的标准:找到距离最小的控制地块,只要它之前没有被分配给另一个项目地块.
I have a dataframe which looks like:
`data = {
'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}
df = pd.DataFrame(data)`
地点: control_plot_id 表示控制图的标识符 dist_project_plot_id1 表示控制图和项目图 1 之间的距离 dist_project_plot_id2 表示控制图和项目图 2 之间的距离 等等
我已经在以下代码中尝试了第一次搜索:
import pandas as pd
df = pd.DataFrame(data)
# Add new columns "PP" and "dist"
df['PP'] = ''
df['dist'] = np.nan
# Get the column names starting with 'project_plot_id'
project_columns = [col for col in df.columns if col.startswith('project_plot_id')]
# Iterate over the project_plot_id columns
for col in project_columns:
# Sort the dataframe by the current column in ascending order
df_sorted = df.sort_values(col)
# Find the k-nearest control plots for the current column
k = 1 # Set the value of k
nearest_control_plots = []
for i in range(k):
min_value = df_sorted.loc[~df_sorted['control_plot_id'].isin(nearest_control_plots)].head(1)[['control_plot_id', col]]
nearest_control_plots.append(min_value['control_plot_id'].values[0])
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'PP'] = col
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'dist'] = min_value[col].values[0]
我无法编程的是,如果已经为项目图选择了控制图,则代码应该继续搜索下一个最近的控制图,可能是第三个、第四个,甚至最后一个。 也许某个特定的图书馆正在做我正在寻找的事情。
预期输出应包含以下列:
好的,根据描述,这就是您要找的:
import pandas as pd
data = {
'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}
df = pd.DataFrame(data)
cols = [c for c in df.columns if c.startswith("dist")]
project_data = {}
assignments = []
for c in cols:
project_data[c] = {}
for i in range(1, 3):
search_df = df.loc[~df.control_plot_id.isin(assignments)]
control_plot = search_df.loc[search_df[c].idxmin()]
project_data[c][f"control_plot{i}"] = control_plot.control_plot_id
project_data[c][f"dist{i}"] = control_plot[c]
assignments.append(control_plot.control_plot_id)
out_data = pd.DataFrame.from_dict(project_data, orient='index').reset_index()
out_data.rename(columns={'index': 'project'}, inplace=True)
print(out_data)
返回:
project control_plot1 dist1 control_plot2 dist2
0 dist_project_plot_id1 1528123.0 1697.68 1539206.0 2140.41
1 dist_project_plot_id2 1507770.0 427.82 1526258.0 488.07
2 dist_project_plot_id3 1504105.0 2008.31 2019722.0 5573.02
这符合您的要求:
我想将最近的控制图(具有 dist_project_plot_id* 的最小值)分配给正在评估的项目图。如果该控制图已分配给项目图,我们将查找下一个最近的控制图。 一旦所有项目地块都被分配了第一个控制图,我们就按照相同的标准为每个项目地块分配第二个控制图:找到具有最小距离的控制图,只要它之前没有被分配给另一个项目地块。
但是您的示例输出与您所解释的想要的不同,不知道为什么......