将NP.sign应用于pyspark系列,即使使用udf也不起作用

问题描述 投票:0回答:1

我目前正在尝试通过使用numpy内置函数np.sign将所有行值转换为某些符号

我的代码:

import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
                             'values': [10,5,3,-1,0,-10,-4,10,0,10]})

sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: np.sign([x]), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()

错误:

y4JJavaError: An error occurred while calling o2586.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 320.0 failed 1 times, most recent failure: Lost task 0.0 in stage 320.0 (TID 3199, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

预期输出:

    id  values  sign
0   0   10  1
1   1   5   1
2   2   3   1
3   3   -1  -1
4   4   0   0
5   5   -10 -1
6   6   -4  -1
7   7   10  1
8   8   0   0
9   9   10  1

如果允许的话,是侧面问题:

我想创建另一列,当该值与上一行不同时,该列将另外返回1。

预期输出:

    id  values  sign    numbering
0   0   10  1   1
1   1   5   1   1
2   2   3   1   1
3   3   -1  -1  2
4   4   0   0   3
5   5   -10 -1  4
6   6   -4  -1  4
7   7   10  1   5
8   8   0   0   6
9   9   10  1   7
pyspark
1个回答
1
投票

您快到了。 np.sign返回pyspark无法理解的numpy.int64对象。为了使它们兼容,可以执行以下操作:

sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
© www.soinside.com 2019 - 2024. All rights reserved.