使用 Python 和 Pandas 从 CSV 文件中提取和规范化嵌套 JSON 数据

问题描述 投票:0回答:1

问题描述:

我正在尝试使用 Pandas 库从 Python 中的 CSV 文件中提取数据。但是,CSV 文件中的数据采用嵌套的 JSON 格式,并且由于多余的逗号和字符,结构格式不正确。以下是文件前几行的示例:

reviews
"[{'funny': '', 'posted': 'Posted November 5, 2011.', 'last_edited': '', 'item_id': '1250', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'Simple yet with great replayability. In my opinion does ""zombie"" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth ""zombie"" splattering fun for the whole family. Amazed this sort of FPS is so rare.'}, {'funny': '', 'posted': 'Posted July 15, 2011.', 'last_edited': '', 'item_id': '22200', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'It's unique and worth a playthrough.'}, {'funny': '', 'posted': 'Posted April 21, 2011.', 'last_edited': '', 'item_id': '43110', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!'}]"
"[{'funny': '', 'posted': 'Posted June 24, 2014.', 'last_edited': '', 'item_id': '251610', 'helpful': '15 of 20 people (75%) found this review helpful', 'recommend': True, 'review': 'I know what you think when you see this title ""Barbie Dreamhouse Party"" but do not be intimidated by it\'s title, this is easily one of my GOTYs. You don\'t get any of that cliche game mechanics that all the latest games have, this is simply good core gameplay. Yes, you can\'t 360 noscope your friends, but what you can do is show them up with your bad ♥♥♥ dance moves and put them to shame as you show them what true fashion and color combinations are.I know this game says for kids but, this is easily for any age range and any age will have a blast playing this.8/8'}, {'funny': '', 'posted': 'Posted September 8, 2013.', 'last_edited': '', 'item_id': '227300', 'helpful': '0 of 1 people (0%) found this review helpful', 'recommend': True, 'review': 'For a simple (it's actually not all that simple but it can be!) truck driving Simulator, it is quite a fun and relaxing game. Playing on simple (or easy?) its just the basic WASD keys for driving but (if you want) the game can be much harder and realistic with having to manually change gears, much harder turning, etc. And reversing in this game is a ♥♥♥♥♥, as I imagine it would be with an actual truck. Luckily, you don't have to reverse park it but you get extra points if you do cause it is bloody hard. But this is suprisingly a nice truck driving game and I had a bit of fun with it.'}, {'funny': '', 'posted': 'Posted November 29, 2013.', 'last_edited': '', 'item_id': '239030', 'helpful': '1 of 4 people (25%) found this review helpful', 'recommend': True, 'review': 'Very fun little game to play when your bored or as a time passer. Very gud. Do Recommend. pls buy'}]"

由于 CSV 文件中存在不正确的结构和附加字符,我正在努力标准化这些数据。如何使用 Python 和 Pandas 正确提取和标准化这些数据?您能提供的任何指导将不胜感激!

使用此代码获取“评论”键,这是问题所在,但它不起作用

import pandas as pd
import ast

# Carga el archivo CSV en un DataFrame
df = pd.read_csv('reviews_explotadas.csv')

# Define una función para corregir los diccionarios
def fix_dicts(row):
    try:
        # Corrige las comillas y demás caracteres en los diccionarios
        row = row.replace("'", '"').replace('""', '"')
        # Parsea el string JSON a un diccionario Python
        row = ast.literal_eval(row)
        return row
    except:
        return None

# Aplica la función a la columna que contiene los diccionarios
df['reviews'] = df['reviews'].apply(fix_dicts)

# Utiliza la función explode para dividir los diccionarios en filas separadas
df_exploded = df.explode('reviews').reset_index(drop=True)

# Muestra el DataFrame resultante
print(df_exploded.head())
python json pandas dataframe
1个回答
0
投票

这个相关答案似乎正在解决与您完全相同的问题。

解决方案是将解析部分外包给python的标准库,让pandas接手。

如何读取包含 JSON 数组的 CSV 文件来获取 Pandas DataFrame?

© www.soinside.com 2019 - 2024. All rights reserved.