下面的代码在文件大小不超过300万条记录的情况下都能正常工作,但如果超过这个大小,我就会耗尽内存,因为我是把数据读到列表中,然后用列表来循环并找到匹配的数据。
从以前的贴子中,我收集到我应该一次通过循环处理每一行,但是找不到任何贴子来说明如何从CSV文件中一次抽取一行,然后通过两个迭代循环来处理,就像我下面的代码一样。
任何帮助都将非常感激。先谢谢你。
import csv
# open two csv files and read into lists lsts and lstl
with open('small.csv') as s:
sml = csv.reader(s)
lsts = [tuple(row) for row in sml]
with open('large.csv') as l:
lrg = csv.reader(l)
lstl = [tuple(row) for row in lrg] # can be two large for memory
# find a match and print
for rows in lsts:
for rowl in lstl:
if rowl[7] != rows[0]: # if no match continue
continue
else:
print(rowl[7], rowl[2]) # when matched print data required from large file
假设你只对小csv中的一列感兴趣,你可以把它变成一个集合,并与大csv逐行比较。集合比较完全取代了外部循环
import csv
with open('small.csv') as s:
sml = csv.reader(s)
sml_set = set(row[0] for row in sml)
with open('large.csv') as l:
lrg = csv.reader(l)
for row in lrg:
if row[7] in sml_set:
print(rowl[7], rowl[2])
你可以把它变成一个像
def row_matches():
with open('small.csv') as s:
sml = csv.reader(s)
sml_set = set(row[0] for row in sml)
with open('large.csv') as l:
lrg = csv.reader(l)
for row in lrg:
if row[7] in sml_set:
yield rowl[7], rowl[2]