使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法？（Python）

Question

我需要将 CSV 读入元组列表，同时根据值 (>=0.75) 调整列表并将列更改为不同的类型。 请注意你不能！！使用熊猫，而不是熊猫

我正在尝试找出如何以最快的方法做到这一点。

我就是这样做的（我认为效率不高）：

def load_csv_to_list(path):
  with open(path) as csv_file:
    table = list(reader(csv_file))
  lst = [table[0]]
  count = 0
  for row in table[1:]:
    if float(row[2]) >= 0.75:
      date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
      row = (date,int(row[1]),float(row[2]))
      lst.append(row)
  return (lst)

start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)

答案：0.00013872199997422285

Answer 1

原始代码执行相同的

float(row[2])

转换两次。在我的测试中，将转换后的值分配给变量并稍后重用它会带来轻微的性能提升。利用 Python 3.8 中引入的海象运算符

:=

可以进一步改进。使用批处理或内存映射数据文件可提供最佳性能。

def load_variable(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        float_two = float(row[2])
        if float_two >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

def load_walrus(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        if (float_two := float(row[2])) >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

加载 1,000,000 行的 csv 文件的时间：

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |

作为进一步的实验，我实现了一个批量处理数据的功能。

def batch_walrus(path, batch_size=1000):
    lst = []
    with open(path) as csv_file:
        csv_reader = reader(csv_file)
        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list
        batch = []
        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                batch.append((date, int(row[1]), two))
            # If batch size is reached or end of file, process the batch
            if len(batch) == batch_size or not row:
                lst.extend(batch)
                batch = []
    return lst

更新时间信息：

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |

Python 的

mmap

模块提供内存映射文件 I/O。它利用较低级别的操作系统功能来读取文件，就好像它们是一个大字符串/数组一样。此版本的函数在创建

mmapped_file

之前使用

decode("utf-8")

将

csv.reader

内容解码为字符串。

from csv import reader
from datetime import datetime
import mmap

def load_mmap_walrus(path):
    lst = []
    with open(path, "r") as csv_file:
        # Memory-map the file, size 0 means the entire file
        with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
            # Decode the bytes-like object to a string
            content = mmapped_file.read().decode("utf-8")

        # Create a CSV reader from the decoded string
        csv_reader = reader(content.splitlines())

        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list

        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                lst.append((date, int(row[1]), two))

        # Close the memory-mapped file
        mmapped_file.close()

    return lst

更新时间信息：

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |
load_mmap_walrus | 5.49s   | 5.68s   | 5.57s   |

用于生成 1,000,000 行 csv 数据的代码：

import csv
import random
from datetime import datetime, timedelta

# Function to generate a random date within a range
def random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randint(0, delta.days)
    return start_date + timedelta(days=random_days)

# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

with open("sample_data.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Date", "Integer", "Float"])
    for _ in range(1_000_000):
        date = random_date(start_date, end_date).strftime("%d/%m/%Y")
        integer = random.randint(0, 100)
        float_num = round(random.uniform(0, 1), 2)
        writer.writerow([date, integer, float_num])

使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法？（Python）

问题描述投票：0回答：1

1个回答

最新问题

使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法？ （Python）

问题描述 投票：0回答：1

1个回答

最新问题

使用条件/过滤器和列类型分配将 CSV 读取到元组列表的最快方法？（Python）

问题描述投票：0回答：1