使用项目相似性的csr_matrix将大多数相似的项目添加到项目X，而不必将csr_matrix转换为密集矩阵

Question

我有购买数据（df_temp）。我设法使用Pandas Dataframe替换使用稀疏的csr_matrix，因为我有很多产品（89000），我必须获取他们的用户项信息（购买或未购买），然后计算产品之间的相似性。

首先，我将Pandas DataFrame转换为Numpy数组：

 df_user_product = df_temp[['user_id','product_id']].copy()
 ar1 = np.array(df_user_product.to_records(index=False))

其次，创建了一个coo_matrix，因为它以稀疏矩阵构造快速着称。

 rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
 cols, c_pos = np.unique(ar1['user_id'], return_inverse=True)
 s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))

第三，对于矩阵计算，最好使用csr_matrix或csc_matrix，因此我使用csr_matrix，因为我在行中有product_id（>）>比csc_matrix更有效的行切片。

    sparse_csr_mat = s.tocsr()
    sparse_csr_mat[sparse_csr_mat > 1] = 1

然后，我计算了产品之间的cosine similarity并将结果置于相似之处：

import sklearn.preprocessing as pp
col_normed_mat = pp.normalize(sparse_csr_mat, axis=1)
similarities = col_normed_mat * col_normed_mat.T

这是：

<89447x89447 sparse matrix of type '<type 'numpy.float64'>'
    with 1332945 stored elements in Compressed Sparse Row format>

现在，我希望最后有一本字典，每个产品都有5个最相似的产品。怎么做？由于内存使用限制，我不想将稀疏矩阵转换为密集数组。但是我也不知道是否有办法访问csr_matrix就像我们为数组做的那样我们检查例如index = product_id并获取index = product_id的所有行，这样我将得到所有类似的产品product_id并按余弦相似度值排序得到5个最相似的值。

例如，相似度矩阵中的一行：

(product_id1, product_id2) 0.45

如何只过滤X（在我的情况下= 5）最相似的产品到product_id1，而不必将矩阵转换为数组？

看看Stackoverflow，我认为lil_matrix可以用于这种情况吗？怎么样？

谢谢您的帮助！

Answer 1

我终于明白了如何为每个产品获得5个最相似的项目，这是通过使用.tolil()矩阵然后将每行转换为numpy数组并使用argsort获得5个最相似的项目。我在这个link中使用了@hpaulj解决方案。

def max_n(row_data, row_indices, n):
        i = row_data.argsort()[-n:]
        # i = row_data.argpartition(-n)[-n:]
        top_values = row_data[i]
        top_indices = row_indices[i]  # do the sparse indices matter?

        return top_values, top_indices, i

然后我将它应用于一行进行测试：

top_v, top_ind, ind = max_n(np.array(arr_ll.data[0]),np.array(arr_ll.rows[0]),5)

我需要的是top_indices，这是5个最相似的产品的指数，但那些指数不是真正的product_id。当我构建coo_matrix时，我映射了它们

rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)

但如何从指数中获得真正的product_id？

现在举个例子我有：

top_ind = [2 1 34 9 123]

如何知道qazxsw poi对应什么qazxsw poi，qazxsw poi到什么等？

使用项目相似性的csr_matrix将大多数相似的项目添加到项目X，而不必将csr_matrix转换为密集矩阵

问题描述投票：0回答：1

1个回答

最新问题

使用项目相似性的csr_matrix将大多数相似的项目添加到项目X，而不必将csr_matrix转换为密集矩阵

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1