我在数据框中有一个字符串列,其中的值带有重音符号,例如
'México', 'Albânia', 'Japão'
如何用重音符号替换字母以获得此效果:
'Mexico', 'Albania', 'Japao'
我尝试了 Stack OverFlow 中提供的许多解决方案,如下所示:
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
但失望归来
strip_accents('México')
>>> 'M?xico'
您可以使用
translate
:
df = spark.createDataFrame(
[
('1','Japão'),
('2','Irã'),
('3','São Paulo'),
('5','Canadá'),
('6','Tókio'),
('7','México'),
('8','Albânia')
],
["id", "Local"]
)
df.show(truncate = False)
+---+---------+
|id |Local |
+---+---------+
|1 |Japão |
|2 |Irã |
|3 |São Paulo|
|5 |Canadá |
|6 |Tókio |
|7 |México |
|8 |Albânia |
+---+---------+
from pyspark.sql import functions as F
df\
.withColumn('Loc_norm', F.translate('Local',
'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))\
.show(truncate=False)
+---+---------+---------+
|id |Local |Loc_norm |
+---+---------+---------+
|1 |Japão |Japao |
|2 |Irã |Ira |
|3 |São Paulo|Sao Paulo|
|5 |Canadá |Canada |
|6 |Tókio |Tokio |
|7 |México |Mexico |
|8 |Albânia |Albânia |
+---+---------+---------+
pandas_udf
,因此它优于常规 udf
。
这似乎是在pandas中做到这一点的最好方法。因此,我们可以使用它为 PySpark 应用程序创建
pandas_udf
。
from pyspark.sql import functions as F
import pandas as pd
@F.pandas_udf('string')
def strip_accents(s: pd.Series) -> pd.Series:
return s.str.normalize('NFKD').str.encode('ascii', 'ignore').str.decode('utf-8')
测试:
df = spark.createDataFrame([('México',), ('Albânia',), ('Japão',)], ['country'])
df = df.withColumn('country2', strip_accents('country'))
df.show()
# +-------+--------+
# |country|country2|
# +-------+--------+
# | México| Mexico|
# |Albânia| Albania|
# | Japão| Japao|
# +-------+--------+
这个解决方案对于 pyspark 确实非常有效,谢谢。