在Pandas中使用group by后如何添加列或更改每组中的数据？

Question

我现在使用 Pandas 来处理一些数据。当我在 pandas 中使用

group by

后，简化后的 DataFrame 格式为 [MMSI(Vessel_ID), BaseTime, Location, Speed, Course,...]。

我用

for MMSI, group in grouped_df:
    print(MMSI)
    print(group)

打印数据。

例如一组数据为：

             MMSI         BaseDateTime       LAT        LON  SOG  COG
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6

我想添加一列，它是两点的时间差。

下面是我想要的输出

             MMSI         BaseDateTime       LAT        LON  SOG  COG   Time-diff
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6   3.0(hours)
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6   Na

所以我使用下面的代码来尝试获得结果：

for MMSI, group in grouped_df:
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # When Time-diff >= 2hours split them into different group

我可以使用 print 来显示 group['GROUP'] 和 group['time-diff']。再次尝试访问grouped_df后，结果没有显示。有一条警告显示，

group

中的

grouped_df

只是 DataFrame 中切片的副本，它建议我改用

.loc[row_indexer,col_indexer] = value

。但在这种情况下我不知道如何使用

.loc

来访问特定的[行，列]。

一开始我尝试使用

  grouped_df['new-time'] = grouped_df.shift(-1)['BaseDateTime']
  grouped_df.dropna()

但它表明

'DataFrameGroupBy' object does not support item assignment

现在我的解决方案是创建一个空_df，然后像这样一步步连接

grouped

_df

中的组：

df['time-diff'] = pd.Series(dtype='float64')
df['GROUP'] = pd.Series(dtype='int')
grouped_df = df.groupby('MMSI')
for MMSI, group in grouped_df:

    # ... as the same as the code above
    group = group.sort_values(by='BaseDateTime')
    group['new-time'] = group.shift(-1)['BaseDateTime']
    group.dropna()

    for x in group.index:
      group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference


    group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
    # ... as the same as the code above

    frame = [empty_df, group]
    empty_df = pd.concat(frames)

我对这个解决方案不满意，但我没有找到正确的方法来更改

grouped_df

中的值。

我现在尝试使用这个问题中的解决方案来处理分组之前的数据帧。

有人可以帮助我吗？

Answer 1

不要使用循环，直接使用

groupby.shift

或

groupby.diff

:

s = pd.to_datetime(df['BaseDateTime'])

df['Time-diff'] = (s.groupby(df['MMSI']).shift(-1)
                    .sub(s).dt.total_seconds().div(3600)
                  )

或者：

s = pd.to_datetime(df['BaseDateTime'])

df['Time-diff'] = (s.groupby(df['MMSI']).diff(-1)
                    .mul(-1).dt.total_seconds().div(3600)
                  )

输出：

             MMSI         BaseDateTime       LAT        LON  SOG  COG  Time-diff
1507  538007509.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1508  538007509.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN
1509  538007510.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1510  538007510.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN
1511  538007511.0  2022-12-08T00:02:25  49.29104 -123.19135  0.0  9.6       0.05
1523  538007511.0  2022-12-08T00:05:25  49.29102 -123.19138  0.0  9.6        NaN

在Pandas中使用group by后如何添加列或更改每组中的数据？

问题描述投票：0回答：1

1个回答

最新问题

在Pandas中使用group by后如何添加列或更改每组中的数据？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1