优化 Python 中大型数据集处理的嵌套循环性能

Question

我目前正在使用 Python 3.8 编写数据分析脚本，需要处理包含超过一百万行的大型数据集。我的脚本使用嵌套循环根据特定条件将每一行与其他多行进行比较。我注意到性能明显很慢，我怀疑嵌套循环是瓶颈。

这是代码有问题部分的简化版本：

import csv

file_path = 'data.csv'


data = []
with open(file_path, 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        data.append(row)
  
matching_pairs = []  # List to store the indices of matching row pairs

for i in range(len(data)):
    for j in range(i + 1, len(data)):
        if data[i][0] == data[j][0]: 
            # Append the pair of indices to the matching_pairs list
            matching_pairs.append(i)


output_file = 'matching_pairs.txt'
with open(output_file, 'w') as file:
    for pair in matching_pairs:
        file.write(f'{pair}\n')

内部循环将当前行与所有后续行进行比较，这对于我的分析至关重要。不过，我希望处理速度会更快。我正在寻找一种方法来优化这部分代码以减少执行时间。

我可以采用什么策略来提高 Python 中此类密集操作的性能？ Python 中是否有内置库或技术可以帮助优化此嵌套循环？

Answer 1

要获取某些列值重复的行，您可以使用 groupby 并排除长度为 1 的组。

import pandas as pd
df = pd.DataFrame({'val':[1,2,1,2,3,3,4],'data':['A','B','C','D','E','F','G']})
groups = df.groupby('val', sort=False)
results = []
for group in groups:
  if len(group[1]) != 1:
    results.extend(group[1].index)
print(results)

[0, 2, 1, 3, 4, 5]

Answer 2

您可能想研究使用 Numpy“数组操作”，它可以让您有效地将函数应用于数组的每个项目，这里特别提供了有关使用字符串执行此操作的信息 - https://www.geeksforgeeks .org/numpy-string-operations/ .

# Python program explaining
# numpy.equal() function
 
import numpy as np
 
# comparing a string elementwise
# using equal() method
a=np.char.equal('geeks','for')
 
print(a)
Run on IDE
Output :

False

此外，如果您正在处理大文件，将它们作为 Numpy 内存映射进行访问可能会更有效。这将允许您像在内存中一样对磁盘上的大文件进行写入和读取，因此您可以写入文件的不同部分，而无需将其全部保存在内存中。

https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/

import numpy as np
nrows, ncols = 1000000, 100

f = np.memmap('memmapped.dat', dtype=np.float32,
              shape=(nrows, ncols))
np.array_equal(f[:, -1], x)

True

del f  ## Flush changes to disk

最后，您可以尝试展开内部循环，这样您就可以连续执行 4 次操作，而不是每次循环执行一次操作。如果您有时批量执行一些操作，可以大大加快速度。

# Process 4 at a time to keep the CPU supplied with work:

for i in range(100, +4):
    do_something(i)
    do_something(i + 1)
    do_something(i + 2)
    do_something(i + 3)

优化 Python 中大型数据集处理的嵌套循环性能

问题描述投票：0回答：2

2个回答

最新问题

优化 Python 中大型数据集处理的嵌套循环性能

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2