我有一个形状为(70000 x 10)的熊猫数据框(例如df)。数据框的头部如下图所示:
0_x 1_x 2_x ... 7_x 8_x 9_x
userid ...
1000010249674395648 0.000007 0.999936 0.000007 ... 0.000007 0.000007 0.000007
1000282310388932608 0.000060 0.816790 0.000060 ... 0.000060 0.000060 0.000060
1000290654755450880 0.000050 0.000050 0.000050 ... 0.000050 0.191159 0.000050
1000304603840241665 0.993157 0.006766 0.000010 ... 0.000010 0.000010 0.000010
1000600081165438977 0.000064 0.970428 0.000064 ... 0.000064 0.000064 0.000064
我想找到用户ID之间的成对余弦距离。例如:
cosine_distance(1000010249674395648,1000282310388932608)= 0.9758776214797362
我使用了以下提到的方法here,但是由于CPU内存有限,在计算余弦距离时,所有方法都抛出了内存错误:
scikit-learn的余弦相似度:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(df)
在线上找到了更快的矢量化解决方案:
def get_cosine_sim_df(df):
topic_vectors = df.values
norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis]
cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T)
cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index)
return cosine_sim_df
cosine_sim = get_cosine_sim_df(df)
系统硬件概述:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,4
Processor Name: Quad-Core Intel Core i7
Processor Speed: 2.2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
我正在寻找一种有效的方法来更快地计算CPU内存中的成对余弦距离,这类似于pyspark数据帧或熊猫批处理技术,而不是一次处理所有数据帧。
感谢任何建议/方法。
FYI-我正在使用Python 3.7
我正在使用spark 2.4和python 3.7
# build spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("cos_sim") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
将您的熊猫df转换为火花df
# Pandas to Spark
df = spark_session.createDataFrame(pand_df)
我改为生成一些随机数据
import random
import pandas as pd
from pyspark.sql.functions import monotonically_increasing_id
def generate_random_data(num_usrs = 20, num_cols = 10):
cols = [str(i)+"_x" for i in range(num_cols)]
usrsdata = [ [random.random() for i in range(num_cols)] for i in range(num_usrs)]
# return pd.DataFrame(usrsdata, columns = cols)
return spark.createDataFrame(data = usrsdata, schema = cols)
df = generate_random_data()
df = df.withColumn("uid", monotonically_increasing_id())
df.limit(5).toPandas() # just for nice display of df (df not actually changed)
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembled = assembler.transform(df).select(['uid', 'features'])
assembled.limit(2).toPandas()
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="features", outputCol="norm")
data = normalizer.transform(assembled)
data.limit(2).toPandas()
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(data.select("uid", "norm").rdd\
.map(lambda row: IndexedRow(row.uid, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()[:2] # displaying first 2 users only
参考:Calculating the cosine similarity between all the rows of a dataframe in pyspark