我想用当前时间戳之间的时差填充数据框列和最接近的时间戳“类型A”或“非类型A”,即type_A = 1或type_A = 0。下面显示了一个小示例:
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,3,4],
'tmstmp':[datetime(2018,5,4,13,27,10), datetime(2018,5,3,13,27,10),
datetime(2018,5,2,13,27,10), datetime(2018,5,1,13,27,10)],
'type_A':[0, 1, 0, 1],
'dt_A': [np.nan]*4,
'dt_notA': [np.nan]*4
})
(A和非A行不一定交替,但是timestamp列是已按降序排序)。我通过迭代整数行索引和访问元素(通过此整数索引和列名)来分别计算当前行和下一行具有type_A = 1或type_A = 0的时间戳之间的时差:]
keys = {0: 'dt_A', 1: 'dt_notA'} ridx = 0 while ridx + 1 < df.shape[0]: ts1 = df.iloc[ridx]['tmstmp'] ts2 = df.iloc[ridx + 1]['tmstmp'] found = 0 if df.iloc[ridx + 1]['type_A'] == 0 else 1 key = keys[found] df.loc[ridx, key] = (ts1 - ts2).total_seconds()/3600 complement = 1 - found j = 2 while ridx + j < df.shape[0] and df.iloc[ridx + j]['type_A'] != complement: j += 1 if ridx + j < df.shape[0]: ts1 = df.iloc[ridx]['tmstmp'] ts2 = df.iloc[ridx + j]['tmstmp'] val = (ts1 - ts2).total_seconds()/3600 else: val = np.nan df.loc[ridx, keys[complement]] = val ridx += 1
出于效率方面的考虑,“取消了对数据框的迭代”(请参见How to iterate over rows in a DataFrame in Pandas?)并且使用整数索引甚至更少“ pythonic”,所以我的问题是:在这种特殊情况下,是否存在“更好的”(更高效,更pythonic)遍历数据框以实现给定任务的方法?非常感谢您的任何建议或想法!
Edit
:此小示例的输入和输出数据帧-列dt_A
包含当前行与下一行具有type_A = 1
的行之间的时间增量,dt_notA
包含具有具有type_A = 0
的最接近的行。输入:
id tmstmp type_A dt_A dt_notA 0 1 2018-05-04 13:27:10 0 NaN NaN 1 2 2018-05-03 13:27:10 1 NaN NaN 2 3 2018-05-02 13:27:10 0 NaN NaN 3 4 2018-05-01 13:27:10 1 NaN NaN
输出:
id tmstmp type_A dt_A dt_notA
0 1 2018-05-04 13:27:10 0 48.0 24.0
1 2 2018-05-03 13:27:10 1 24.0 48.0
2 3 2018-05-02 13:27:10 0 NaN 24.0
3 4 2018-05-01 13:27:10 1 NaN NaN
我想用当前时间戳和最接近的时间戳“类型A”或“不是类型A”之间的时间差来填充数据帧列,即type_A = 1或type_A = 0。...
def next_value_index(l, i, val):
"""Return index of l where val occurs next from position i."""
try:
return l[(i+1):].index(val) + (i + 1)
except ValueError:
return np.nan
def next_value_indexes(l, val):
"""Return for each position in l next-occurrence-indexes of val in l."""
return np.array([next_value_index(l, i, val) for i, _ in enumerate(l)])
def nan_allowing_access(df, col, indexes):
idxs = np.array([idx if not np.isnan(idx) else 0 for idx in indexes])
res = df[col].iloc[idxs]
res[np.isnan(indexes)] = np.nan
return res # NaT for timestamps
def diff_timestamps(dfcol1, dfcol2): # timestamp columns of pandas subtraction
return [x - y for x, y in zip(list(dfcol1), list(dfcol2))]
# this is not optimal in speed, but numpy did unwanted type conversions
def td2hours(timedelta): # convert timedelta to hours
return timedelta.total_seconds() / 3600
def time_diff_to_next_val(df, tmstmp_col, col, val, converter_func):
"""
Return time differences (timestamps are given in tmstmp_col column
of the pandas data frame `df`) from the row's timestamp to the next
time stamp of the row, which has in column `col` the next occurrence
of value given in `val` in the data frame.
tdconverter_func is the function used to convert the timedelta
value.
"""
next_val_indexes = next_value_indexes(df[col].tolist(), val)
next_val_timestamps = nan_allowing_access(df, tmstmp_col, next_val_indexes)
return [converter_func(x) for x in diff_timestamps(df[tmstmp_col], next_val_timestamps)]