我在mongo集合中有(900k,300)条记录。当我尝试将数据读取到熊猫时,内存消耗急剧增加,直到进程被杀死为止。我必须提到,如果我从csv文件中读取数据,则该数据适合内存(1.5GB〜)。
我的机器是32GB RAM和16个CPU的Centos 7。
我的简单代码:
client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = pd.DataFrame(list(cursor))
我的多处理代码:
def read_mongo_parallel(skipses):
print('Starting process')
client = MongoClient(skipses[4],skipses[5])
db = client[skipses[2]]
collection = db['skipses[3]]
print('range of {} to {}'.format(skipses[0],skipses[0]+skipses[1]))
cursor = collection.find().skip(skipses[0]).limit(skipses[1])
return list(cursor)
all_lists = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for rows in executor.map(read_mongo_parallel, skipesess):
all_lists.extend(rows)
df = pd.DataFrame(all_lists)
这两种方法中的内存增加都会杀死内核,
我在做什么?
此测试工具可以创建900k(尽管很小)的记录,并且可以在我的笔记本电脑上正常运行。试试看。
import pymongo
import pandas as pd
db = pymongo.MongoClient()['mydatabase']
db.mycollection.drop()
operations = []
for i in range(900000):
operations.append(pymongo.InsertOne({'a': i}))
db.mycollection.bulk_write(operations, ordered=False)
cursor = db.mycollection.find({})
df = pd.DataFrame(list(cursor))
print(df.count())