我该如何优化/替换这两个在熊猫中嵌套的While循环

Question

我是Python的新手，很高兴能做到这一点！我有一个包含1,700,000条记录和4列（e，n，o，p）+索引的数据框（df1），这是大量预处理和连接的结果。

[目前，这段代码需要2个小时才能运行，并且显然相交在1.3m的n可能值和700000的e可能值之间减少了一个因数一方面是1000000条记录，另一方面是170万条记录。

[我没有其他任何预处理可以产生例如e，n或它们的交集的子集。

result = ""
cp = True
ep = 0
while ep < 700000: #700000
    np = 0
    while np < 1300000: #1300000
        df2 = df1[(df1["e"] >= ep) & (df1["e"] < ep + 1000) & (df1["n"] >= np) & (df1["n"] < np + 1000)]
        if not df2.dropna().empty:
            df3 = df2[df2.o== df2.o.min()]
            df4 = df3.drop(columns = ["e", "n", "o"])
            z = df4.to_string(header = cp)
            result = result + "\n" + z
            cp = False
        np += 1000
    np = 0
    ep += 1000

样本数据：

       p     e         n         o
15646  str0  134746.0  466842.0  421.283752
15643  str1  134229.0  466923.0  502.364410
15588  str2  134023.0  467007.0  685.986880
15645  str3  133142.0  467081.0  551.112511
15649  str4  132632.0  467511.0  132.457540
32508  str5  133995.0  607803.0  580.374017
32502  str6  133750.0  607900.0  471.699057
32509  str7  133462.0  607987.0  488.480296
32532  str8  134761.0  608314.0  320.494930
32526  str9  130148.0  608801.0  463.146845

@Błotosmętek的建议，使用df_aux（但是使用while循环，请参见下面的讨论）已经有了很大的改进。

Answer 1

在代码的第5行中，您显然是从df1中选择记录的子集，其中e的值在当前ep到ep+1000的范围内-但是您每次都在内部环。将这部分移到外部循环应该会大大加快速度。另一个优化不是从df3中删除列，而是选择'p'列。我还冒昧地用for循环替换了while循环，以提高可读性。

result = ''
for ep in range(0, 700000, 1000):
    df_aux = df2 = df1[(df1["e"] >= ep) & (df1["e"] < ep + 1000)]
    for np in range(0, 1300000, 1000):
        df2 = df_aux[(df_aux["n"] >= np) & (df_aux["n"] < np + 1000)]
        if not df2.dropna().empty:
            df3 = df2[df2.offset == df2.offset.min()]
            z = df3["p"].to_string()
            result += "\n" + z

请检查结果是否与原始代码相同。

我该如何优化/替换这两个在熊猫中嵌套的While循环

问题描述投票：0回答：1

1个回答

最新问题

我该如何优化/替换这两个在熊猫中嵌套的While循环

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1