数据帧上的Pyspark余弦相似度

问题描述 投票:0回答:1

我有一个PySpark DataFrame df1,看起来像:

Customer1  Customer2  v_cust1   v_cust2
   1           2         0.9      0.1
   1           3         0.3      0.4
   1           4         0.2      0.9
   2           1         0.8      0.8

我想获取两个数据帧的余弦相似度。并有类似的东西

Customer1  Customer2  v_cust1   v_cust2  cosine_sim
   1           2         0.9      0.1       0.1
   1           3         0.3      0.4       0.9
   1           4         0.2      0.9       0.15
   2           1         0.8      0.8       1

我有一个python函数,可以接收像这样的数字/数字数组:

def cos_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

如何使用udf在数据框中创建cosine_sim列?我可以将几列而不是一列传递给udf cosine_sim函数吗?

python apache-spark pyspark user-defined-functions
1个回答
0
投票

如果您想使用pandas_udf,效率会更高。

[它在矢量化操作中比spark udfs表现更好:Introducing Pandas UDF for PySpark

from pyspark.sql.functions import PandasUDFType, pandas_udf
import pyspark.sql.functions as F

a, b = "v_cust1", "v_cust2"
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def cos_sim(df):
    float(np.dot(df[a], df[b]) / (np.linalg.norm(df[a]) * np.linalg.norm(df[b])))

# Assuming that you want to groupby Customer1 and Customer2 for arrays
df2 = df.groupby(["Customer1", "Customer2"]).apply(cos_sim)
# But if you want to send entire columns then make a column with the same 
# value in all rows and group by it
# e.g.
df3 = df.withColumn("group", F.lit("group_a")).groupby("group").apply(cos_sim)
© www.soinside.com 2019 - 2024. All rights reserved.