使用更简单(矢量化?)操作而不是循环来聚合数据帧

问题描述 投票:0回答:1

我有一段代码可以正确工作(给出预期的答案),但效率低下且不必要地复杂。它使用我想要简化并提高效率的循环,可能使用矢量化运算。它还将数据帧转换为系列,然后再次转换回数据帧 - 另一个需要工作的代码块。换句话说,我想让这段代码变得更加Pythonic。

我用以

# TODO:
开头的注释标记了代码中有问题的地方(如下)。

代码的目标是总结和聚合输入数据框

df
(其中具有两种类型区域的DNA片段长度分布:
all
captured
)。这是一个生物信息学问题,是一个更大项目的一部分,该项目根据酶将某些 DNA 区域切割成指定长度片段的能力对酶进行排名。就这个问题而言,唯一相关的信息是
length
是整数,并且 DNA
regions
有两种类型:
all
captured
。目的是产生纯度为
vs.
df_pur(纯化 DNA 时长度的截止值)的数据框
length_cutoff
。步骤是:

  • 计算每种
    regions
    类型在每个
    length_cutoffs
    之上的总长度的分数。
  • 找到每个
    captured / all
    的分数的比率:
    length_cutoffs
    ,并将结果存储在数据框中。
import io
import pandas as pd

# This is a minimal reproducible example. The real dataset has 2
# columns and 10s of millions of rows. Column 1 is integer, column 2
# has 2 values: 'all' and 'captured':
TESTDATA="""
1   all
49  all
200 all
20  captured
480 captured
2000    captured
"""

df = pd.read_csv(io.StringIO(TESTDATA),
                 sep='\s+', header=None, names='length regions'.split())

# This is a minimal reproducible example. The real list has ~10
# integer values (cutoffs):
length_cutoffs = [10, 100, 1000]

df_tot_length = pd.DataFrame(columns=['tot_length'])
df_tot_length['tot_length'] = df.groupby(['regions']).length.sum()
df_tot_length.reset_index(inplace=True)

print(df_tot_length)

#     regions  tot_length
# 0       all         250
# 1  captured        2500


df_frc_tot = pd.DataFrame(columns=['regions', 'length_cutoff', 'sum_lengths'])
regions = df['regions'].unique()
df_index = pd.DataFrame({'regions': regions}).set_index('regions')

# TODO: simplify this loop (vectorize?):
for length_cutoff in length_cutoffs:
    df_cur = (pd.DataFrame({'length_cutoff': length_cutoff,
                            'sum_lengths': df[df['length'] >= length_cutoff]
                            .groupby(['regions']).length.sum()},
                           # Prevent dropping rows where no elements
                           # are selected by the above
                           # condition. Re-insert the dropped rows,
                           # use for those sum_lengths = NaN
                        index=df_index.index)
              # Correct the above sum_lengths = NaN to 0:
              .fillna(0)).reset_index()
    # Undo the effect of `fillna(0)` above, which casts the
    # integer column as float:
    df_cur['sum_lengths'] = df_cur['sum_lengths'].astype('int')
    # TODO: simplify this loop (vectorize?):
    for region in regions:
        df_cur.loc[df_cur['regions'] == region, 'frc_tot_length'] = (
            df_cur.loc[df_cur['regions'] == region, 'sum_lengths'] /
            df_tot_length.loc[df_tot_length['regions'] == region, 'tot_length'])
    df_frc_tot = pd.concat([df_frc_tot, df_cur], axis=0)

df_frc_tot.reset_index(inplace=True, drop=True)

print(df_frc_tot)

#     regions length_cutoff sum_lengths  frc_tot_length
# 0       all            10         249           0.996
# 1  captured            10        2500           1.000
# 2       all           100         200           0.800
# 3  captured           100        2480           0.992
# 4       all          1000           0           0.000
# 5  captured          1000        2000           0.800

# TODO: simplify the next 2 statements:
ser_pur = (df_frc_tot.loc[df_frc_tot['regions'] == 'captured', 'frc_tot_length']
           .reset_index(drop=True) /
           df_frc_tot.loc[df_frc_tot['regions'] == 'all',      'frc_tot_length']
           .reset_index(drop=True))
df_pur = pd.DataFrame({'length_cutoff': length_cutoffs, 'purity': ser_pur})

print(df_pur)

#    length_cutoff    purity
# 0             10  1.004016
# 1            100  1.240000
# 2           1000       inf
python pandas group-by aggregate vectorization
1个回答
0
投票

IIUC,你可以这样做:

length_cutoffs = [10, 100, 1000]

df["bins"] = pd.cut(
    df["length"],
    pd.IntervalIndex.from_breaks([-np.inf] + length_cutoffs + [np.inf], closed="left"),
)

out = df.pivot_table(index=["regions", "bins"], values="length", aggfunc="sum")

g = out.groupby(level=0)
out["frc_tot_length"] = (
    g["length"].transform(lambda x: [x.iloc[i:].sum() for i in range(len(x))])
) / g["length"].sum()
print(out)
print()

打印:

                          length  frc_tot_length
regions  bins                                   
all      [-inf, 10.0)          1           1.000
         [10.0, 100.0)        49           0.996
         [100.0, 1000.0)     200           0.800
         [1000.0, inf)         0           0.000
captured [-inf, 10.0)          0           1.000
         [10.0, 100.0)        20           1.000
         [100.0, 1000.0)     480           0.992
         [1000.0, inf)      2000           0.800

然后:

x = out.unstack(level=0)
x = x[("frc_tot_length", "captured")] / x[("frc_tot_length", "all")]
print(x)

打印:

bins
[-inf, 10.0)       1.000000
[10.0, 100.0)      1.004016
[100.0, 1000.0)    1.240000
[1000.0, inf)           inf
dtype: float64
© www.soinside.com 2019 - 2024. All rights reserved.