pyspark MinHashLSH Jaccard距离:未计算某些对的距离

问题描述 投票:0回答:1

我正在尝试使用MinHashLSH pyspark计算某些产品之间的Jaccard距离。

我使用的玩具数据是

sdf = spark.read.csv('dt.csv',header=True, sep=',', inferSchema=True)
sdf = sdf.withColumn("ticket", sdf["ticket"].cast(StringType()))
sdf.show(60)

ticket|   Brand|value|
+------+--------+-----+
|     0|     YL |    1|
|     1|     YL |    1|
|     2|     YL |    1|
|     3|     YL |    1|
|     7|     YL |    1|
|     3|   Paco |    1|
|     7|   Paco |    1|
|     0|Lacoste |    1|
|     1|Lacoste |    1|
|     2|Lacoste |    1|
|     4|Lacoste |    1|
|     5|Lacoste |    1|
|     6|Lacoste |    1|
|     0|   Dior |    1|
|     4|   Dior |    1|
|     4|   Boss |    1|
|     5|   Boss |    1|
|     6|   Boss |    1|
|     0|Channel |    1|
|     1|Channel |    1|
|     4|Channel |    1|
|     8|   Boss |    1|
|     8|Channel |    1|
+------+--------+-----+

所以我想计算这些品牌之间的Jaccard距离,所以我按照文档中的示例进行操作:

order_matrix_jaccard_sdf = sdf.groupby('Brand').agg(F.collect_set('ticket'))
order_matrix_jaccard_sdf = order_matrix_jaccard_sdf.withColumnRenamed('collect_set(ticket)', 'ticket')
order_matrix_jaccard_sdf.show()

输出:

+--------+------------------+
|   Brand|            ticket|
+--------+------------------+
|Channel |      [1, 8, 4, 0]|
|   Paco |            [3, 7]|
|Lacoste |[1, 2, 5, 4, 6, 0]|
|     YL |   [3, 1, 2, 7, 0]|
|   Boss |      [5, 8, 4, 6]|
|   Dior |            [4, 0]|
+--------+------------------+

然后应用模型:

cv = CountVectorizer(inputCol='ticket', outputCol="features")
model = cv.fit(order_matrix_jaccard_sdf)
result = model.transform(order_matrix_jaccard_sdf)
mh = MinHashLSH(inputCol="features", outputCol="hashes")
model_mh = mh.fit(result)
model_mh.transform(result)

然后计算Jaccard距离:

jaccard_dist = model_mh.approxSimilarityJoin(result.select(['Brand','features']), result.select(['Brand','features']), 2)
jaccard_dist.show(20)

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|           distCol|
+--------------------+--------------------+------------------+
|[Lacoste , (9,[0,...|[Dior , (9,[0,1],...|0.6666666666666667|
|[Dior , (9,[0,1],...|[Lacoste , (9,[0,...|0.6666666666666667|
|[Lacoste , (9,[0,...|[Lacoste , (9,[0,...|               0.0|
|[Dior , (9,[0,1],...|[Channel , (9,[0,...|               0.5|
|[YL , (9,[0,2,3,4...|[YL , (9,[0,2,3,4...|               0.0|
|[YL , (9,[0,2,3,4...|[Channel , (9,[0,...|0.7142857142857143|
|[Boss , (9,[1,5,6...|[Boss , (9,[1,5,6...|               0.0|
|[YL , (9,[0,2,3,4...|[Lacoste , (9,[0,...|             0.625|
|[Lacoste , (9,[0,...|[YL , (9,[0,2,3,4...|             0.625|
|[Channel , (9,[0,...|[Channel , (9,[0,...|               0.0|
|[Dior , (9,[0,1],...|[Dior , (9,[0,1],...|               0.0|
|[Dior , (9,[0,1],...|[YL , (9,[0,2,3,4...|0.8333333333333334|
|[Lacoste , (9,[0,...|[Channel , (9,[0,...|0.5714285714285714|
|[Channel , (9,[0,...|[Dior , (9,[0,1],...|               0.5|
|[Channel , (9,[0,...|[YL , (9,[0,2,3,4...|0.7142857142857143|
|[YL , (9,[0,2,3,4...|[Dior , (9,[0,1],...|0.8333333333333334|
|[Channel , (9,[0,...|[Lacoste , (9,[0,...|0.5714285714285714|
|[Paco , (9,[3,4],...|[Paco , (9,[3,4],...|               0.0|
+--------------------+--------------------+------------------+

[它缺少与'Paco'组成的对,我用熊猫计算了距离,除Paco对缺失外,得到了相同的结果:

enter image description here

如何同时获得Paco对?以及为什么我没有得到它们?

谢谢

pandas apache-spark pyspark distance similarity
1个回答
0
投票
对不起,是我只显示前20个,当我收集所有结果时,由于没有,我将在这里举这个玩具示例,因为>]
© www.soinside.com 2019 - 2024. All rights reserved.