pandas json 直接从文件标准化

Question

我正在为

pymongo

编写一个包装类。我的主要目标是让它与

pandas

兼容，这样我就可以做到这一点：

from loader import PyMongoLoader
import pandas as pd

_loader = PyMongoLoader(url="my_url", port="my_port")
df = pd.read_json(_loader, orient="records")

根据 pandas 文档，

pd.read_json

接受任何具有

read

方法的对象作为第一个参数。在内部，我调用

collection.find

并使用

bson.dumps

:

将结果解析为字符串

    def read(self, size: int = -1) -> str:
        assert self._is_valid, "MongoDB connection must be valid for retrieving data"

        self.logger.debug(
            f"Retrieving documents from {self._collection}. Expected size: \033[94m{self._collection.count_documents({})}\033[0m"
        )
        result = list(self._collection.find())

        self.logger.debug(
            f"Retrieved {len(result)} documents from remote. Size of json: {sum([len(document) for document in result])}"
        )
        return dumps(result)

问题

这个效果很好。但是，我正在使用的数据库具有复杂的字段名称，例如“26344T Control Measure [5.3-10.8]”。由于这些名称，数据以嵌套 json 的形式存储在数据库中。我希望能够规范化这些字段，但据我所知，

pd.json_normalize

不接受路径参数，只接受字符串。

我想到了两种解决方案，但都没有让我信服：

更改数据库字段名称。这可以解决问题，但每次我必须向数据库添加新数据时都要记住这一点，这会很麻烦。
在加载器源文件中修补
```
pd.resd_json
```
或
```
pd.json_normalize
```
。虽然这也可以解决问题，但我确信这不是一个好的解决方案，并且可能会破坏使用 pandas 的其他代码。

问题

是否有支持的方法直接从文件规范化 json？如果没有，我如何规范化传递给

pd.read_json

的 json 以消除数据库中奇怪的缩进问题？

编辑

我从数据库获取的 JSON 结果遵循以下格式：

{
  "_id": {
    "$oid": "651ec788c110096a55c8d4de"
  },
  "DateTime": "20/02/2017 15:00:00",
  "546321B": {
    "measure": 0
  },
  "538612B": {
    "measure": 80
  },
  "517713B": {
    "measure": 70
  },
  "508021V": {
    "avg": 37
  }
}

我希望我的数据框是：

日期时间	546321B.测量	538612B.测量	517713B.测量	508021V.平均
“2017年2月20日15:00:00”	0	80	70	37

理想情况下，我想直接从

pd.read_json(loader, orient="records")

获得此结果

Answer 1

如果你想标准化你的数据并且你只有嵌套字典并且不想使用

pd.json_normalize

你可以使用这样的东西：

def unest_level(data):
    for key, item in list(data.items()):
        if isinstance(item, dict):
            for sub_key, sub_item in item.items():
                if isinstance(item, dict):
                     sub_key, sub_item = unest_level(sub_item)
                yield f"{key}.{sub_key}", sub_item
        yield key, item

考虑您提供的数据：

{key: item for key, item in unest_level(data)}
>>>{'DateTime': '20/02/2017 15:00:00',
 '_id.$oid': '651ec788c110096a55c8d4de',
 '546321B.measure': 0,
 '538612B.measure': 80,
 '517713B.measure': 70,
 '508021V.avg': 37}

您可以将其添加到您的阅读器中，如下所示：

def read(self, size: int = -1) -> str:
    assert self._is_valid, "MongoDB connection must be valid for retrieving data"

    self.logger.debug(
    f"Retrieving documents from {self._collection}. Expected size: \033[94m{self._collection.count_documents({})}\033[0m"
        )
    # modified line, to call unest_level for every entry coming out of self._collection.find()
    result = [{key: item for key, item in unest_level(data)}
              for data in self._collection.find()]

    self.logger.debug(
    f"Retrieved {len(result)} documents from remote. Size of json: {sum([len(document) for document in result])}"
        )
    return dumps(result)

尝试解决类似的问题，如果它们是数据中其他字典内的字典，这应该会释放更多嵌套级别。

pandas json 直接从文件标准化

问题描述投票：0回答：1

1个回答

最新问题

pandas json 直接从文件标准化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1