我有一个从PDF文件提取的文本构成的熊猫数据框。看起来像这样:
index date description1 description2 value1 value2
0 18-01-2019 some more 1 2
1 NaN text text NaN NaN
2 NaN here NaN NaN NaN
3 19-01-2019 some some 3 4
4 NaN text more NaN NaN
5 NaN here text NaN NaN
6 NaN NaN here NaN NaN
.
.
.
总是至少有1行没有NaN,并且该行将始终包含日期和值。只有说明在多行上。
是否有一种方法,例如根据日期将行与下面的行连接起来,直到值不为NaN,然后再加入描述?
预期输出:
index date description1 description2 value1 value2
0 18-01-2019 some text here more text 1 2
1 19-01-2019 some text here some more text here 3 4
.
.
.
[一个想法是通过向前填充date
(或一些用于区分组的任何列)来创建用于分组的列,然后如果数字获取第一个值,否则使用join
并删除缺失的值:
f = lambda x: x.iloc[0] if np.issubdtype(x.dtype, np.number) else ' '.join(x.dropna())
或指定字典中的每一列:
f1 = lambda x: ' '.join(x.dropna())
f = {'date':'first', 'description1':f1, 'description1':f1, 'value1':'first', 'value2':'first'}
应动态创建的内容同时创建字典和合并在一起:
f1 = lambda x: ' '.join(x.dropna())
c =['description1','description2']
d1 = dict.fromkeys(c, f1)
d2 = dict.fromkeys(df.columns.difference(c), 'first')
f = {**d1, **d2}
df = df.groupby(df['date'].ffill()).agg(f).reset_index(drop=True)
#alternative
#df = df.groupby(df['date'].ffill(), as_index=False).agg(f)
print (df)
date description1 description2 value1 value2
0 18-01-2019 some text here more text 1.0 2.0
1 19-01-2019 some text here some more text here 3.0 4.0
将fillna与ffill一起使用,然后按此时间戳分组,然后使用agg中的描述进行处理:
df['date'] = df['date'].fillna(method='ffill')
df_new = df.groupby('date').agg({'description1': lambda x: ' '.join(x.values)})
更新:可能,对于您的输出格式,您需要稍微操作索引,如下所示:
df_new = df.groupby('date', as_index=False).agg({'description1': lambda x: ' '.join(x.values)}).reset_index(drop=True)