从文件中提取数据并计算其平均值

问题描述 投票:0回答:1

我正在尝试制作一个脚本来创建一个包含不同文件每行平均值的文件。 我真正的问题是,我不知道如何制作一个高效的脚本来在几秒钟内处理 30Mo 的 300 个 files.txt(大约 9000000 个值)...

这就是我想要的(澄清真正的目标):

第 1 行:每个文件中的行数的平均值

第 2 行:每个文件的行数的平均值 ...

这是我当前的脚本(如果你有时间喝杯咖啡就可以使用):

import glob
import numpy as np

# Repertory where my files are
directory = '/directory/*.txt'
files = glob.glob(directory)

# buffer
data = []

# Get the values
for file in files :
    with open(file, 'r') as f:
        data.append([int(line.strip()) for line in f])

# Convert to array numpy for perf
data_np = np.array(data, dtype=np.float64).T 

# Average for each line
average = np.mean(data_np, axis=1)

# Round off averages
round_average = np.round(average).astype(int)

# fill in the new file
with open('my_new_average_file.txt', 'w') as f:
    for average in round_average:
        f.write(f"{average}\n")

print("done")

编辑:输入文件示例:

值1

值2

值3 ...

这些文件具有相同的行数,并且每行包含一个整数值(从 0 到 255)

非常感谢您的帮助

python dataframe numpy file
1个回答
0
投票

Dask 非常适合这种事情,使用 Dask Bags... 这是适合使用 Dask Bags 的示例代码,它在我的机器上在 10-12 秒内完成计算(假设文本文件已经存在)。

尚不清楚的一件事是您是否想要所有数据或每个文件的平均值。 Dask 可以通过将每个文件视为分区来处理后者,并使用

map_partition
函数而不是仅仅
map
...

计算每个分区的平均值
import os.path
import numpy as np
import dask.bag as db

num_files = 300
lines_in_file = 30_000
overwrite_files = False #set to true to overwrite existing text files
data_directory = os.path.join(os.getcwd(), 'directory')
print(data_directory)

# %%
if __name__ == "__main__":
    
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    #create 300 files with 30_000 lines each, each line contains a random number between 0 and 1000
    for i in range(num_files):
        outfile = os.path.join(data_directory, f'file_{i}.txt')
        if os.path.exists(outfile) and overwrite_files==False:
            continue
        with open(outfile, 'w') as f:
            for j in range(lines_in_file):
                f.write(f"{np.random.randint(0, 1000)}\n")
                
    data = db.read_text(f"{data_directory}/file_*.txt")

    #remove whitespace and convert to int (for each line)
    data = data.map(lambda x: int(x.strip()))

    average = data.mean() #lazy queue mean function
    computed_avg = average.compute() #compute result
    print(f"average: {computed_avg}")
    computed_avg = np.round(computed_avg).astype(int)

    #fill in the new file
    with open('my_new_average_file.txt', 'w') as f:
        #no need for for loop since there's a single average value
        # for average in computed_avg:
        #     f.write(f"{average}\n")
        f.write(f"{computed_avg}\n")

    print("done")
© www.soinside.com 2019 - 2024. All rights reserved.