将日期范围行拆分为年(取消分组)-Python Pandas

问题描述 投票:1回答:3

我有一个像这样的数据框:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2023      1    2
    .......

我想将end-start> 1年的行拆分(请参阅end = 2023和start = 2020的最后一行),保持A列的值相同,同时按比例拆分B列中的值:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2020      1    2/4
    01.01.2021  31.12.2021      1    2/4
    01.01.2022  31.12.2022      1    2/4
    01.01.2023  31.12.2023      1    2/4
    .......

任何想法?

python pandas date dataframe
3个回答
0
投票
一种不同的方法,添加新列而不是新行。但是我认为这可以完成您想要的工作。

df["years_apart"] = ( (df["end_date"] - df["start_date"]).dt.days / 365 ).astype(int) for years in range(1, df["years_apart"].max().astype(int)): df[f"{years}_end_date"] = pd.NaT df.loc[ df["years_apart"] == years, f"{years}_end_date" ] = df.loc[ df["years_apart"] == years, "start_date" ] + dt.timedelta(days=365*years) df["B_bis"] = df["B"] / df["years_apart"]

输出

start_date end_date years_apart 1_end_date 2_end_date ... 2018-01-01 2018-01-02 0 NaT NaT 2018-01-02 2019-01-02 1 2019-01-02 NaT 2018-01-03 2020-01-03 2 NaT 2020-01-03


0
投票
我已经解决了它造成的日期差和一个计数器,该计数器为重复的行增加了几年:

#calculate difference between start and end year table['diff'] = (table['end'] - table['start'])//timedelta(days=365) table['diff'] = table['diff']+1 #replicate rows depending on number of years table = table.reindex(table.index.repeat(table['diff'])) #counter that increase for diff>1, assign increasing years to the replicated rows table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff'] table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start']) table['end'] = table['start'] #split B among years table['B'] = table['B']//table['diff']


0
投票
这是我的解决方案。请参阅以下评论:

import io # TEST DATA: text=""" start end A B 01.01.2020 30.06.2020 2 3 01.01.2020 31.12.2020 3 1 01.04.2020 30.04.2020 6 2 01.01.2021 31.12.2021 2 3 01.07.2020 31.12.2020 8 2 31.12.2020 20.01.2021 12 12 31.12.2020 01.01.2021 22 22 30.12.2020 01.01.2021 32 32 10.05.2020 28.09.2023 44 44 27.11.2020 31.12.2023 88 88 31.12.2020 31.12.2023 100 100 01.01.2020 31.12.2021 200 200 """ df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1]) #print("\n----\n df:",df) #---------------------------------------- # SOLUTION: def split_years(r): """ Split row 'r' where "end"-"start" greater than 0. The new rows have repeated values of 'A', and 'B' divided by the number of years. Return: a DataFrame with rows per year. """ t1,t2 = r["start"], r["end"] ys= t2.year - t1.year kk= 0 if t1.is_year_end else 1 if ys>0: l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ] l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2] return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)}) print("year difference <= 0!") return None # Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others: grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups print("\n---- grps:\n",grps) # Extract the "one year" rows in a data frame: df1= df.loc[grps[False]] #print("\n---- df1:\n",df1) # Extract the rows to be splitted: df2= df.loc[grps[True]] print("\n---- df2:\n",df2) # Split the rows and put the resulting data frames into a list: ldfs=[ split_years(df2.loc[row]) for row in df2.index ] print("\n---- ldfs:") for fr in ldfs: print(fr,"\n") # Insert the "one year" data frame to the list, and concatenate them: ldfs.insert(0,df1) df_rslt= pd.concat(ldfs,sort=False) #print("\n---- df_rslt:\n",df_rslt) # Housekeeping: df_rslt= df_rslt.sort_values("start").reset_index(drop=True) print("\n---- df_rslt:\n",df_rslt)

输出:

---- grps: {False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')} ---- df2: start end A B 5 2020-12-31 2021-01-20 12 12 6 2020-12-31 2021-01-01 22 22 7 2020-12-30 2021-01-01 32 32 8 2020-10-05 2023-09-28 44 44 9 2020-11-27 2023-12-31 88 88 10 2020-12-31 2023-12-31 100 100 11 2020-01-01 2021-12-31 200 200 ---- ldfs: start end A B 0 2020-12-31 2020-12-31 12 6.0 1 2021-01-01 2021-01-20 12 6.0 start end A B 0 2020-12-31 2020-12-31 22 11.0 1 2021-01-01 2021-01-01 22 11.0 start end A B 0 2020-12-30 2020-12-31 32 16.0 1 2021-01-01 2021-01-01 32 16.0 start end A B 0 2020-10-05 2020-12-31 44 11.0 1 2021-01-01 2021-12-31 44 11.0 2 2022-01-01 2022-12-31 44 11.0 3 2023-01-01 2023-09-28 44 11.0 start end A B 0 2020-11-27 2020-12-31 88 22.0 1 2021-01-01 2021-12-31 88 22.0 2 2022-01-01 2022-12-31 88 22.0 3 2023-01-01 2023-12-31 88 22.0 start end A B 0 2020-12-31 2020-12-31 100 25.0 1 2021-01-01 2021-12-31 100 25.0 2 2022-01-01 2022-12-31 100 25.0 3 2023-01-01 2023-12-31 100 25.0 start end A B 0 2020-01-01 2020-12-31 200 100.0 1 2021-01-01 2021-12-31 200 100.0 ---- df_rslt: start end A B 0 2020-01-01 2020-06-30 2 3.0 1 2020-01-01 2020-12-31 3 1.0 2 2020-01-01 2020-12-31 200 100.0 3 2020-01-04 2020-04-30 6 2.0 4 2020-01-07 2020-12-31 8 2.0 5 2020-10-05 2020-12-31 44 11.0 6 2020-11-27 2020-12-31 88 22.0 7 2020-12-30 2020-12-31 32 16.0 8 2020-12-31 2020-12-31 12 6.0 9 2020-12-31 2020-12-31 100 25.0 10 2020-12-31 2020-12-31 22 11.0 11 2021-01-01 2021-12-31 100 25.0 12 2021-01-01 2021-12-31 88 22.0 13 2021-01-01 2021-12-31 44 11.0 14 2021-01-01 2021-01-01 32 16.0 15 2021-01-01 2021-01-01 22 11.0 16 2021-01-01 2021-01-20 12 6.0 17 2021-01-01 2021-12-31 2 3.0 18 2021-01-01 2021-12-31 200 100.0 19 2022-01-01 2022-12-31 88 22.0 20 2022-01-01 2022-12-31 100 25.0 21 2022-01-01 2022-12-31 44 11.0 22 2023-01-01 2023-09-28 44 11.0 23 2023-01-01 2023-12-31 88 22.0 24 2023-01-01 2023-12-31 100 25.0

© www.soinside.com 2019 - 2024. All rights reserved.