使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法? (Python)

问题描述 投票:0回答:1

我需要将 CSV 读入元组列表,同时根据值 (>=0.75) 调整列表并将列更改为不同的类型。 请注意你不能!!使用熊猫,而不是熊猫

我正在尝试找出如何以最快的方法做到这一点。

我就是这样做的(我认为效率不高):

def load_csv_to_list(path):
  with open(path) as csv_file:
    table = list(reader(csv_file))
  lst = [table[0]]
  count = 0
  for row in table[1:]:
    if float(row[2]) >= 0.75:
      date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
      row = (date,int(row[1]),float(row[2]))
      lst.append(row)
  return (lst)

start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)

答案:0.00013872199997422285

python csv memory-efficient timeit
1个回答
0
投票

原始代码执行相同的

float(row[2])
转换两次。在我的测试中,将转换后的值分配给变量并稍后重用它会带来轻微的性能提升。利用 Python 3.8 中引入的海象运算符
:=
可以进一步改进。使用批处理或内存映射数据文件可提供最佳性能。

def load_variable(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        float_two = float(row[2])
        if float_two >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

def load_walrus(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        if (float_two := float(row[2])) >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

加载 1,000,000 行的 csv 文件的时间:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |

作为进一步的实验,我实现了一个批量处理数据的功能。

def batch_walrus(path, batch_size=1000):
    lst = []
    with open(path) as csv_file:
        csv_reader = reader(csv_file)
        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list
        batch = []
        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                batch.append((date, int(row[1]), two))
            # If batch size is reached or end of file, process the batch
            if len(batch) == batch_size or not row:
                lst.extend(batch)
                batch = []
    return lst

更新时间信息:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |

Python 的

mmap
模块提供内存映射文件 I/O。它利用较低级别的操作系统功能来读取文件,就好像它们是一个大字符串/数组一样。此版本的函数在创建
mmapped_file
之前使用
decode("utf-8")
csv.reader
内容解码为字符串。

from csv import reader
from datetime import datetime
import mmap

def load_mmap_walrus(path):
    lst = []
    with open(path, "r") as csv_file:
        # Memory-map the file, size 0 means the entire file
        with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
            # Decode the bytes-like object to a string
            content = mmapped_file.read().decode("utf-8")

        # Create a CSV reader from the decoded string
        csv_reader = reader(content.splitlines())

        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list

        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                lst.append((date, int(row[1]), two))

        # Close the memory-mapped file
        mmapped_file.close()

    return lst

更新时间信息:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |
load_mmap_walrus | 5.49s   | 5.68s   | 5.57s   |

用于生成 1,000,000 行 csv 数据的代码:

import csv
import random
from datetime import datetime, timedelta

# Function to generate a random date within a range
def random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randint(0, delta.days)
    return start_date + timedelta(days=random_days)

# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

with open("sample_data.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Date", "Integer", "Float"])
    for _ in range(1_000_000):
        date = random_date(start_date, end_date).strftime("%d/%m/%Y")
        integer = random.randint(0, 100)
        float_num = round(random.uniform(0, 1), 2)
        writer.writerow([date, integer, float_num])
© www.soinside.com 2019 - 2024. All rights reserved.