如果至少一个单元格为NaN，熊猫会加入行

Question

我有一个从PDF文件提取的文本构成的熊猫数据框。看起来像这样：

index      date         description1        description2        value1        value2
   0       18-01-2019    some                  more                1             2
   1       NaN           text                  text                NaN           NaN
   2       NaN           here                   NaN                NaN           NaN
   3       19-01-2019    some                  some                3             4
   4       NaN           text                  more                NaN           NaN
   5       NaN           here                  text                NaN           NaN
   6       NaN            NaN                  here                NaN           NaN
   .
   .
   .

总是至少有1行没有NaN，并且该行将始终包含日期和值。只有说明在多行上。

是否有一种方法，例如根据日期将行与下面的行连接起来，直到值不为NaN，然后再加入描述？

预期输出：

index      date         description1        description2           value1        value2
   0       18-01-2019    some text here      more text              1             2
   1       19-01-2019    some text here      some more text here    3             4
   .
   .
   .

Answer 1

[一个想法是通过向前填充date（或一些用于区分组的任何列）来创建用于分组的列，然后如果数字获取第一个值，否则使用join并删除缺失的值：

f = lambda x: x.iloc[0] if np.issubdtype(x.dtype, np.number) else ' '.join(x.dropna())

或指定字典中的每一列：

f1 = lambda x: ' '.join(x.dropna())

f = {'date':'first', 'description1':f1, 'description1':f1, 'value1':'first', 'value2':'first'}

应动态创建的内容同时创建字典和合并在一起：

f1 = lambda x: ' '.join(x.dropna())

c =['description1','description2']
d1 = dict.fromkeys(c, f1)
d2 = dict.fromkeys(df.columns.difference(c), 'first')
f = {**d1, **d2}

df = df.groupby(df['date'].ffill()).agg(f).reset_index(drop=True)
#alternative
#df = df.groupby(df['date'].ffill(), as_index=False).agg(f)

print (df)
         date    description1         description2  value1  value2
0  18-01-2019  some text here            more text     1.0     2.0
1  19-01-2019  some text here  some more text here     3.0     4.0

Answer 2

将fillna与ffill一起使用，然后按此时间戳分组，然后使用agg中的描述进行处理：

df['date'] = df['date'].fillna(method='ffill')

df_new = df.groupby('date').agg({'description1': lambda x: ' '.join(x.values)})

更新：可能，对于您的输出格式，您需要稍微操作索引，如下所示：

df_new = df.groupby('date', as_index=False).agg({'description1': lambda x: ' '.join(x.values)}).reset_index(drop=True)

如果至少一个单元格为NaN，熊猫会加入行

问题描述投票：2回答：2

2个回答

最新问题

如果至少一个单元格为NaN，熊猫会加入行

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2