我目前有这个快速示例可以使用:
import pandas as pd
left = pd.DataFrame({"left_val": [1, 2, 3, 6, 7]}, index=pd.to_datetime([1, 2, 3, 6, 7], unit='s'))
right = pd.DataFrame({"right_val": ["a", "b", "c"]}, index=pd.to_datetime([1, 5, 10], unit='s'))
# Filter to contain samples that are within the time interval of left
right_filtered = right[(right.index >= left.index.min()) & (right.index <= left.index.max())]
output = pd.merge_asof(left, right_filtered, left_index=True, right_index=True, direction="nearest")
我的输出是:
left_val right_val
1970-01-01 00:00:01 1 a
1970-01-01 00:00:02 2 a
1970-01-01 00:00:03 3 a
1970-01-01 00:00:06 6 b
1970-01-01 00:00:07 7 b
但是我想要以下内容:
left_val right_val
1970-01-01 00:00:01 1 a
1970-01-01 00:00:02 2 Nan
1970-01-01 00:00:03 3 Nan
1970-01-01 00:00:06 6 b
1970-01-01 00:00:07 7 Nan
主要区别在于,我希望正确的值仅在输出数据框中出现一次,并填充
Nan
其他值,以便我可以创建稀疏数据框并节省一些空间。我想避免迭代结果以将重复值设置为 Nan
,因为:
right
内有两个连续的值,此方法将删除原始信息我一直在寻找输入参数和方法来执行类似的操作,但我找不到它。
谢谢!
完成后可以调整输出
.merge_asof
:
groups = (output["right_val"] != output["right_val"].shift(1)).cumsum()
output["right_val"] = np.where(~groups.duplicated(), output["right_val"], np.nan)
print(output)
打印:
left_val right_val
1970-01-01 00:00:01 1 a
1970-01-01 00:00:02 2 NaN
1970-01-01 00:00:03 3 NaN
1970-01-01 00:00:06 6 b
1970-01-01 00:00:07 7 NaN