删除pyspark数据框列中的非ASCII和特殊字符

问题描述 投票:0回答:1

我正在从具有约50列的csv文件中读取数据,很少的列(4至5)包含具有非ASCII字符和特殊字符的文本数据。

df = spark.read.csv(path, header=True, schema=availSchema)

我正在尝试删除所有非Ascii和特殊字符,并且仅保留英文字符,并且我尝试如下进行操作

df = df['textcolumn'].str.encode('ascii', 'ignore').str.decode('ascii')

我的列名中没有空格。我收到一个错误

TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-1486957561378215> in <module>
----> 1 InvFilteredDF = InvFilteredDF['SearchResultDescription'].str.encode('ascii', 'ignore').str.decode('ascii')

TypeError: 'Column' object is not callable

是否有其他方法可以完成此操作,请对此提供任何帮助。

python pyspark apache-spark-sql pyspark-sql azure-databricks
1个回答
0
投票

这应该工作。

首先创建一个临时示例数据框:

df = spark.createDataFrame([
    (0, "This is Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Data science is  cool"),
    (3, "This is aSA")
], ["id", "words"])

df.show()

输出

+---+--------------------+
| id|               words|
+---+--------------------+
|  0|       This is Spark|
|  1|I wish Java could...|
|  2|Data science is  ...|
|  3|      This is aSA|
+---+--------------------+

现在编写UDF,因为不能直接对列类型执行所使用的那些函数,并且会得到Column object not callable error

解决方案>>

from pyspark.sql.functions import udf

def ascii_ignore(x):
    return x.encode('ascii', 'ignore').decode('ascii')

ascii_udf = udf(ascii_ignore)

df.withColumn("foo", ascii_udf('words')).show()

输出

+---+--------------------+--------------------+
| id|               words|                 foo|
+---+--------------------+--------------------+
|  0|       This is Spark|       This is Spark|
|  1|I wish Java could...|I wish Java could...|
|  2|Data science is  ...|Data science is  ...|
|  3|      This is aSA|         This is aSA|
+---+--------------------+--------------------+
© www.soinside.com 2019 - 2024. All rights reserved.