我正在尝试将我的数据集翻译为英语 - 它包含一些不同语言的数据,然后我首先写了这个:
import pandas as pd
from langdetect import detect
df = pd.read_csv('jobs.csv')
def detect_language(text):
try:
return detect(text)
except:
return None
df['language'] = df['job_title'].apply(detect_language)
print(df.head())
然后尝试将非英语语言翻译为英语
from googletrans import Translator
def translate_to_english(text):
if pd.isnull(text):
return text
else:
translator = Translator()
translated = translator.translate(text, src='auto', dest='en')
return translated.text
english_jobs = df[df['language'] == 'en']
non_english_jobs = df[df['language'] != 'en']
non_english_jobs['translated_job_title'] = non_english_jobs['job_title'].apply(translate_to_english)
translated_df = pd.concat([english_jobs, non_english_jobs])
print(translated_df.head())
但是第二个给出了这样的错误:
AttributeError Traceback (most recent call last)
Cell In[29], line 20
18 # Translate non-English text to English
19 non_english_jobs = df[df['language'] != 'en']
---> 20 non_english_jobs['translated_job_title'] = non_english_jobs['job_title'].apply(translate_to_english)
22 # Concatenate English and translated non-English jobs
23 translated_df = pd.concat([english_jobs, non_english_jobs])
AttributeError: 'NoneType' object has no attribute 'group'
你能帮我解决这个问题吗?
首先我对此进行了测试,我发现主要问题来自
googletrans
。我用 deep-translator
代替那个。
import pandas as pd
from langdetect import detect
from deep_translator import GoogleTranslator
我创建了虚拟 df 来检查我的代码。
data = {'job_title': ["Software Engineer", "Ingeniero de Software", "Développeur logiciel", "Data Scientist", "Gerente de Proyecto"]}
df = pd.DataFrame(data)
使用
lambda
代替 detect_language(text):
df['language'] = df['job_title'].apply(lambda x: detect(x) if x is not None else None)
def translate_to_english(text):
if pd.isnull(text):
return text
else:
try:
translator = GoogleTranslator(source='auto', target='en')
translated = translator.translate(text)
return translated
except Exception as e:
print(f"Error translating '{text}': {e}")
return None
non_english_jobs = df[df['language'] != 'en']
我在修改之前创建了一个副本来处理
SettingWithCopyWarning
。
non_english_jobs_copy = non_english_jobs.copy()
non_english_jobs_copy['translated_job_title'] = non_english_jobs_copy['job_title'].apply(translate_to_english)
translated_df = pd.concat([df[df['language'] == 'en'], non_english_jobs_copy])
print(translated_df.head())
输出如下:
job_title language translated_job_title
0 Software Engineer en NaN
1 Ingeniero de Software de Software engineer
2 Développeur logiciel fr Software developer
3 Data Scientist it Data Scientist
4 Gerente de Proyecto es Project Manager