当使用pymongo客户端从MongoDB读取数据到熊猫时,OOM

问题描述 投票:0回答:1

我在mongo集合中有(900k,300)条记录。当我尝试将数据读取到熊猫时,内存消耗急剧增加,直到进程被杀死为止。我必须提到,如果我从csv文件中读取数据,则该数据适合内存(1.5GB〜)。

我的机器是32GB RAM和16个CPU的Centos 7。

我的简单代码:

client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = pd.DataFrame(list(cursor))

我的多处理代码:

def read_mongo_parallel(skipses):


    print('Starting process')
    client = MongoClient(skipses[4],skipses[5])
    db = client[skipses[2]]
    collection = db['skipses[3]]
    print('range of {} to {}'.format(skipses[0],skipses[0]+skipses[1]))

    cursor = collection.find().skip(skipses[0]).limit(skipses[1])

    return list(cursor)

all_lists = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
        for  rows in  executor.map(read_mongo_parallel, skipesess):
            all_lists.extend(rows)


df = pd.DataFrame(all_lists)   

这两种方法中的内存增加都会杀死内核,

我在做什么?

python pandas pymongo
1个回答
0
投票

此测试工具可以创建900k(尽管很小)的记录,并且可以在我的笔记本电脑上正常运行。试试看。

import pymongo
import pandas as pd

db = pymongo.MongoClient()['mydatabase']
db.mycollection.drop()
operations = []

for i in range(900000):
    operations.append(pymongo.InsertOne({'a': i}))

db.mycollection.bulk_write(operations, ordered=False)
cursor = db.mycollection.find({})
df = pd.DataFrame(list(cursor))
print(df.count())
© www.soinside.com 2019 - 2024. All rights reserved.