如果该行满足条件,我将尝试创建重复行。在下表中,我根据 groupby 创建了累积计数,然后对 groupby 的 MAX 进行了另一个计算。
df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)
Date Completed PathID MaxPathID
1/31/17 1 3
1/31/17 2 3
1/31/17 3 3
2/1/17 1 1
2/2/17 1 2
2/2/17 2 2
在本例中,我只想复制 2/1/17 的记录,因为该日期只有一个实例(即 MaxPathID == 1)。
所需输出:
Date Completed PathID MaxPathID
1/31/17 1 3
1/31/17 2 3
1/31/17 3 3
2/1/17 1 1
2/1/17 1 1
2/2/17 1 2
2/2/17 2 2
提前致谢!
我认为您需要按
unique
获取 Date Completed
行,然后将 concat
行恢复为原始值:
df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
Date Completed
3 2/1/17
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
Date Completed PathID MaxPathID
0 1/31/17 1 3
1 1/31/17 2 3
2 1/31/17 3 3
3 2/1/17 1 2
6 2/1/17 2 2
4 2/2/17 1 2
5 2/2/17 2 2
编辑:
print (df)
Date Completed a b
0 1/31/17 4 5
1 1/31/17 3 5
2 1/31/17 6 3
3 2/1/17 7 9
4 2/2/17 2 0
5 2/2/17 6 7
df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
Date Completed a b
3 2/1/17 7 9
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
Date Completed a b
0 1/31/17 4 5
1 1/31/17 3 5
2 1/31/17 6 3
3 2/1/17 7 9
6 2/1/17 7 9
4 2/2/17 2 0
5 2/2/17 6 7
使用
numpy
+ duplicated
的创意
repeat
方法
dc = df['Date Completed']
rg = np.arange(len(dc)).repeat((~dc.duplicated(keep=False).values) + 1)
df.iloc[rg]
Date Completed PathID MaxPathID
0 1/31/17 1 3
1 1/31/17 2 3
2 1/31/17 3 3
3 2/1/17 1 1
3 2/1/17 1 1
4 2/2/17 1 2
5 2/2/17 2 2
我知道这可能是一个有点不同的问题,但它确实符合问题描述,所以人们会来自谷歌。我没有考虑过下面的优化或类似的事情,我确信有更好的方法,但有时只需要接受缺陷;)所以只是在这里发帖,以防有人遇到类似的情况并想要快速尝试并完成。看起来工作相当快。
假设我们有这样的数据框(df):
我们希望将其转换为类似给定的条件,即 field3 具有多个条目,并且我们希望像这样扩展其中的所有条目:
这是一种方法:
import pandas as pd
import numpy as np
from datetime import date,datetime
index = []
double_values = []
### get index and get list of values on which to expand per indexed row
for i,r in df.iterrows():
index.append(i)
### below transform your column with multiple entries to a list based on delimetter
double_values.append(str(r[2]).split(' '))
serieses = []
print('tot row to process', len(index))
count = 0
for i,dvs in zip(index,double_values):
count+= 1
if count % 1000 == 0:
print('elem left', len(index)- count, datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
if len(dvs)>1:
for dv in dvs:
series = df.iloc[i]
series.loc['field3'] = dv
serieses.append(list(series))
#create dataframe out of expanded rows now appended to serieses list, creating a list of lists
df2 = pd.DataFrame.from_records(serieses,columns=df.columns)
### drop original rows with double entries, which have been expanded and appended already
indexes_to_drop = []
for i,dvs in zip(index,double_values):
if len(dvs)>1:
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop],inplace=True)
len(df)
df = df.append(df2)
这是一个适用于任何情况的简单方法。
condition = df["MaxPathID"] == 1
df = pd.concat([df, df[condition].copy()], ignore_index=True)
print(df)
Date Completed PathID MaxPathID
0 1/31/17 1 3
1 1/31/17 2 3
2 1/31/17 3 3
3 2/1/17 1 1
4 2/2/17 1 2
5 2/2/17 2 2
6 2/1/17 1 1