带有Dask DataFrames的平面JSON

问题描述 投票:1回答:1

我正在尝试使Dask数据帧中的JSON数组对象(没有文件.json)变平,因为我有很多数据,并且RAM一直被进程连续运行消耗,所以我需要一种并行形式的解决方案。

这是我拥有的JSON:

[ {
        "id": "0001",
        "name": "Stiven",
        "location": [{
                "country": "Colombia",
                "department": "Choco",
                "city": "Quibdo"
            }, {
                "country": "Colombia",
                "department": "Antioquia",
                "city": "Medellin"
            }, {
                "country": "Colombia",
                "department": "Cundinamarca",
                "city": "Bogota"
            }
        ]
    }, {
        "id": "0002",
        "name": "Jhon Jaime",
        "location": [{
                "country": "Colombia",
                "department": "Valle del Cauca",
                "city": "Cali"
            }, {
                "country": "Colombia",
                "department": "Putumayo",
                "city": "Mocoa"
            }, {
                "country": "Colombia",
                "department": "Arauca",
                "city": "Arauca"
            }
        ]
    }, {
        "id": "0003",
        "name": "Francisco",
        "location": [{
                "country": "Colombia",
                "department": "Atlantico",
                "city": "Barranquilla"
            }, {
                "country": "Colombia",
                "department": "Bolivar",
                "city": "Cartagena"
            }, {
                "country": "Colombia",
                "department": "La Guajira",
                "city": "Riohacha"
            }
        ]
    }
]

这是我拥有的数据框:

index   id    name         location
0       0001  Stiven       [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1       0002  Jhon Jaime   [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2       0003  Francisco    [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}] 

我需要将每个id转换为dataframe,如下所示:

index   id    name         country   department       city
0       0001  Stiven       Colombia  Choco            Quibdo
1       0001  Stiven       Colombia  Antioquia        Medellin
2       0001  Stiven       Colombia  Cundinamarca     Bogota
3       0002  Jhon Jaime   Colombia  Valle del Cauca  Cali
4       0002  Jhon Jaime   Colombia  Putumayo         Mocoa
5       0002  Jhon Jaime   Colombia  Arauca           Arauca
6       0003  Francisco    Colombia  Atlantico        Barranquilla
7       0003  Francisco    Colombia  Bolivar          Cartagena 
8       0003  Francisco    Colombia  La Guajira       Riohacha   

所有过程必须与Dask并行。有什么建议吗?

提前感谢。

dataframe dask dask-distributed dask-delayed dask-ml
1个回答
0
投票

我建议首先使用Pandas数据框解决此问题,然后使用.map_partitions函数将该功能应用于Dask数据框内的所有Pandas分区。

© www.soinside.com 2019 - 2024. All rights reserved.