我的数据框看起来像 -
serial_no text
23 {'Headers': ['LA-Spanish (Español)[Change]',
'5790B/5/AF Addendum', 'Secondary menu'], 'Divs': ['',
'Document(s):5790B/5/AF Addendum (168.88 KB)5790B/5/AF
AddendumN.º de revisión:00']}
25 {'Headers': ['LA-Spanish (Español)[Change]',
'700HPPK Service Information', 'Secondary menu'],
'Paragraphs': [], 'Tables': [], 'Lists': ["
['Disclaimer', 'Declaración de privacidad', 'Terms
of Use', 'Términos y Condiciones']"], 'Divs': ['',
'Document(s):700HPPK Service Information (3.36
MB)N.º de revisión:00']}
我想将文本栏转换为英语。但该列以键值格式存储。我的代码如下 -
import pandas as pd
from googletrans import Translator
translator = Translator()
df['text2'] = df['text'].apply(lambda x: translator.translate(x, dest='en').text)
我遇到以下错误 -
AttributeError: 'NoneType' object has no attribute 'group'
这主要有2个问题:
[[]]
中有一个双
Lists
googletrans==3.1.0a0
,那个对我有用。然后,程序应从数据预处理开始,如下所示:
import pandas as pd
import ast
data = {
'serial_no': [23, 25],
'text': [
"{'Headers': ['LA-Spanish (Español)[Change]', '5790B/5/AF Addendum', 'Secondary menu'],"
" 'Divs': ['', 'Document(s):5790B/5/AF Addendum (168.88 KB)5790B/5/AF AddendumN.º de revisión:00']}",
"{'Headers': ['LA-Spanish (Español)[Change]', '700HPPK Service Information', 'Secondary menu'],"
" 'Paragraphs': [], 'Tables': [], 'Lists': [['Disclaimer', 'Declaración de privacidad', 'Terms of Use', 'Términos y Condiciones']],"
" 'Divs': ['', 'Document(s):700HPPK Service Information (3.36 MB)N.º de revisión:00']}"
]
}
def data_preprocessing(data):
# Convert string representation of dictionary to actual dictionary
data['text'] = [ast.literal_eval(text) for text in data['text']]
df = pd.DataFrame(data)
# Normalize the JSON structure within the 'text' column
df_normalized = pd.json_normalize(df['text'])
# Remove one level of nesting from the 'Lists' column
df_normalized['Lists'] = [inner_list[0] if isinstance(inner_list, list) and len(inner_list) == 1 else inner_list for inner_list in df_normalized['Lists']]
return df_normalized
df_normalized = data_preprocessing(data)
print(df_normalized)
结果如下:
Headers \
0 [LA-Spanish (Español)[Change], 5790B/5/AF Adde...
1 [LA-Spanish (Español)[Change], 700HPPK Service...
Divs Paragraphs Tables \
0 [, Document(s):5790B/5/AF Addendum (168.88 KB)... NaN NaN
1 [, Document(s):700HPPK Service Information (3.... [] []
Lists
0 NaN
1 [Disclaimer, Declaración de privacidad, Terms ...
然后您可以分解并正确格式化列:
def explode_and_create_columns(df):
new_rows = []
for index, row in df.iterrows():
new_row = {}
for column_name, column_values in row.items():
if isinstance(column_values, list):
for i, value in enumerate(column_values, start=1):
new_row[f"{column_name}_{i}"] = value
else:
new_row[column_name] = column_values
new_rows.append(new_row)
return pd.DataFrame(new_rows)
结果如下:
Headers_1 Headers_2 Headers_3 \
0 LA-Spanish (Español)[Change] 5790B/5/AF Addendum Secondary menu
1 LA-Spanish (Español)[Change] 700HPPK Service Information Secondary menu
Divs_1 Divs_2 Paragraphs \
0 Document(s):5790B/5/AF Addendum (168.88 KB)579... NaN
1 Document(s):700HPPK Service Information (3.36 ... NaN
Tables Lists Lists_1 Lists_2 Lists_3 \
0 NaN NaN NaN NaN NaN
1 NaN NaN Disclaimer Declaración de privacidad Terms of Use
Lists_4
0 NaN
1 Términos y Condiciones
最后,您需要应用翻译:
from googletrans import Translator
def translate_text(text):
if pd.isna(text):
return text
translator = Translator()
translated = translator.translate(text, dest='en')
return translated.text
# Apply translation to each cell of the DataFrame
df_translated = df_final.applymap(translate_text)
获取翻译后的df:
Headers_1 Headers_2 Headers_3 \
0 LA-Spanish (Spanish)[Change] 5790B/5/AF Addendum Secondary menu
1 LA-Spanish (Spanish)[Change] 700HPPK Service Information Secondary menu
Divs_1 Divs_2 Paragraphs \
0 Document(s):5790B/5/AF Addendum (168.88 KB)579... NaN
1 Document(s):700HPPK Service Information (3.36 ... NaN
Tables Lists Lists_1 Lists_2 Lists_3 \
0 NaN NaN NaN NaN NaN
1 NaN NaN Disclaimer Privacy statement Terms of Use
Lists_4
0 NaN
1 Terms and Conditions
我有这个问题,我尝试翻译单个句子,但我得到“已超出限制。23小时内回来”
所以,我认为你需要选择另一位翻译者