在两个csv文件中匹配一个字符串,但第二个文件太大,无法读取到列表中。

问题描述 投票:1回答:1

下面的代码在文件大小不超过300万条记录的情况下都能正常工作,但如果超过这个大小,我就会耗尽内存,因为我是把数据读到列表中,然后用列表来循环并找到匹配的数据。

从以前的贴子中,我收集到我应该一次通过循环处理每一行,但是找不到任何贴子来说明如何从CSV文件中一次抽取一行,然后通过两个迭代循环来处理,就像我下面的代码一样。

任何帮助都将非常感激。先谢谢你。

import csv

# open two csv files and read into lists lsts and lstl
with open('small.csv') as s:
    sml = csv.reader(s)
    lsts = [tuple(row) for row in sml]

with open('large.csv') as l:
    lrg = csv.reader(l)
    lstl = [tuple(row) for row in lrg] # can be two large for memory

# find a match and print 
for rows in lsts:
    for rowl in lstl:

        if rowl[7] != rows[0]: # if no match continue
            continue
        else: 
            print(rowl[7], rowl[2]) # when matched print data required from large file
python csv
1个回答
1
投票

假设你只对小csv中的一列感兴趣,你可以把它变成一个集合,并与大csv逐行比较。集合比较完全取代了外部循环

import csv

with open('small.csv') as s:
    sml = csv.reader(s)
    sml_set = set(row[0] for row in sml)

with open('large.csv') as l:
    lrg = csv.reader(l)
    for row in lrg:
        if row[7] in sml_set:
            print(rowl[7], rowl[2])

你可以把它变成一个像

def row_matches():
    with open('small.csv') as s:
        sml = csv.reader(s)
        sml_set = set(row[0] for row in sml)

    with open('large.csv') as l:
        lrg = csv.reader(l)
        for row in lrg:
            if row[7] in sml_set:
                yield rowl[7], rowl[2]
© www.soinside.com 2019 - 2024. All rights reserved.