使用 pandas 将单列翻译成英语

问题描述 投票:0回答:2

我的数据框看起来像 -

serial_no                                                       text
   23                                  {'Headers': ['LA-Spanish (Español)[Change]', 
                                      '5790B/5/AF Addendum', 'Secondary menu'], 'Divs': ['', 
                                      'Document(s):5790B/5/AF Addendum (168.88 KB)5790B/5/AF 
                                        AddendumN.º de revisión:00']}

   25                                  {'Headers': ['LA-Spanish (Español)[Change]', 
                                       '700HPPK Service Information', 'Secondary menu'], 
                                      'Paragraphs': [], 'Tables': [], 'Lists': [" 
                                       ['Disclaimer', 'Declaración de privacidad', 'Terms 
                                       of Use', 'Términos y Condiciones']"], 'Divs': ['', 
                                       'Document(s):700HPPK Service Information (3.36 
                                       MB)N.º de revisión:00']}

我想将文本栏转换为英语。但该列以键值格式存储。我的代码如下 -

import pandas as pd
from googletrans import Translator
translator = Translator()
df['text2'] = df['text'].apply(lambda x: translator.translate(x, dest='en').text)

我遇到以下错误 -

AttributeError: 'NoneType' object has no attribute 'group'
python-3.x pandas group-by
2个回答
0
投票

这主要有2个问题:

  • 该文件在数据集中的格式很差,您在
    [[]]
     中有一个双 
    Lists
  • 翻译包给了我同样的错误,正如有人建议你应该尝试
    googletrans==3.1.0a0
    ,那个对我有用。

然后,程序应从数据预处理开始,如下所示:

import pandas as pd    
import ast

data = {
    'serial_no': [23, 25],
    'text': [
        "{'Headers': ['LA-Spanish (Español)[Change]', '5790B/5/AF Addendum', 'Secondary menu'],"
        " 'Divs': ['', 'Document(s):5790B/5/AF Addendum (168.88 KB)5790B/5/AF AddendumN.º de revisión:00']}",
        "{'Headers': ['LA-Spanish (Español)[Change]', '700HPPK Service Information', 'Secondary menu'],"
        " 'Paragraphs': [], 'Tables': [], 'Lists': [['Disclaimer', 'Declaración de privacidad', 'Terms of Use', 'Términos y Condiciones']],"
        " 'Divs': ['', 'Document(s):700HPPK Service Information (3.36 MB)N.º de revisión:00']}"
    ]
}

def data_preprocessing(data):
    # Convert string representation of dictionary to actual dictionary
    data['text'] = [ast.literal_eval(text) for text in data['text']]

    df = pd.DataFrame(data)
    
    # Normalize the JSON structure within the 'text' column
    df_normalized = pd.json_normalize(df['text'])

    # Remove one level of nesting from the 'Lists' column
    df_normalized['Lists'] = [inner_list[0] if isinstance(inner_list, list) and len(inner_list) == 1 else inner_list for inner_list in df_normalized['Lists']]

    return df_normalized


df_normalized = data_preprocessing(data)
print(df_normalized)

结果如下:

         Headers  \
    0  [LA-Spanish (Español)[Change], 5790B/5/AF Adde...   
    1  [LA-Spanish (Español)[Change], 700HPPK Service...   
    
                                                    Divs Paragraphs Tables  \
    0  [, Document(s):5790B/5/AF Addendum (168.88 KB)...        NaN    NaN   
    1  [, Document(s):700HPPK Service Information (3....         []     []   
    
                                                   Lists  
    0                                                NaN  
    1  [Disclaimer, Declaración de privacidad, Terms ...

然后您可以分解并正确格式化列:

def explode_and_create_columns(df):
    new_rows = []
    for index, row in df.iterrows():
        new_row = {}
        for column_name, column_values in row.items():
            if isinstance(column_values, list):
                for i, value in enumerate(column_values, start=1):
                    new_row[f"{column_name}_{i}"] = value
            else:
                new_row[column_name] = column_values
        new_rows.append(new_row)
    return pd.DataFrame(new_rows)

结果如下:

         Headers_1                    Headers_2       Headers_3  \
0  LA-Spanish (Español)[Change]          5790B/5/AF Addendum  Secondary menu   
1  LA-Spanish (Español)[Change]  700HPPK Service Information  Secondary menu   

  Divs_1                                             Divs_2  Paragraphs  \
0         Document(s):5790B/5/AF Addendum (168.88 KB)579...         NaN   
1         Document(s):700HPPK Service Information (3.36 ...         NaN   

   Tables  Lists     Lists_1                    Lists_2       Lists_3  \
0     NaN    NaN         NaN                        NaN           NaN   
1     NaN    NaN  Disclaimer  Declaración de privacidad  Terms of Use   

                  Lists_4  
0                     NaN  
1  Términos y Condiciones

最后,您需要应用翻译:

from googletrans import Translator

def translate_text(text):
    if pd.isna(text):
        return text
    translator = Translator()
    translated = translator.translate(text, dest='en')
    return translated.text

# Apply translation to each cell of the DataFrame
df_translated = df_final.applymap(translate_text)

获取翻译后的df:

  Headers_1                    Headers_2       Headers_3  \
0  LA-Spanish (Spanish)[Change]          5790B/5/AF Addendum  Secondary menu   
1  LA-Spanish (Spanish)[Change]  700HPPK Service Information  Secondary menu   

  Divs_1                                             Divs_2  Paragraphs  \
0         Document(s):5790B/5/AF Addendum (168.88 KB)579...         NaN   
1         Document(s):700HPPK Service Information (3.36 ...         NaN   

   Tables  Lists     Lists_1            Lists_2       Lists_3  \
0     NaN    NaN         NaN                NaN           NaN   
1     NaN    NaN  Disclaimer  Privacy statement  Terms of Use   

                Lists_4  
0                   NaN  
1  Terms and Conditions

0
投票

我有这个问题,我尝试翻译单个句子,但我得到“已超出限制。23小时内回来”

所以,我认为你需要选择另一位翻译者

© www.soinside.com 2019 - 2024. All rights reserved.