使用pandas创建稀疏矩阵,并使用来自.dat文件的其他两列的索引[x,y]的.dat文件的一列中的值填充它

问题描述 投票:0回答:1

我有一个.dat文件,其中包含三列 - userID,artistID和weight。使用Python,我将数据读入带有data = pd.read_table('train.dat')的pandas Dataframe。

我想创建一个稀疏矩阵(/ 2D数组),它将数据Dataframe的前两列('userID','artistID')中的值作为索引,将第三列中的值作为值('weight')。数据帧中未给出的索引组合应为NaN。

我尝试使用for循环创建一个空的numpy数组并填充它,但是需要花费很多时间(train.dat中有大约10万行)。

import csv
import numpy as np

f = open("train.dat", "rt")
reader = csv.reader(f, delimiter="\t")
next(reader)
data = [d for d in reader]
f.close()

data = np.array(data, dtype=float)
col = int(a[:,0].max()) + 1
row = int(a[:,1].max()) + 1

empty = np.empty((row, col))
empty[:] = np.nan

for d in data:
   empty[int(d[0]), int(d[1])] = d[2]

还尝试创建一个coo_matrix并将其转换为csr_matrix(因此我可以使用索引访问数据),但索引重置。

import scipy.sparse as sps
import pandas as pd

data = pd.read_table('train.dat')
matrix = sps.coo_matrix((data.weight, (data.index.labels[0], data.index.labels[1])))
matrix = matrix.tocsr()

数据示例:

userID    artistID  weight
    45           7      0.7114779874213837
   204         144      0.46399999999999997
    36         650      2.4232887490165225
   140         146      1.0146699266503667
   170          31      1.4124783362218372
   240         468      0.6529992406985573
python pandas numpy matrix scipy
1个回答
3
投票

将您的数据复制到文件:

In [290]: data = pd.read_csv('stack48133358.txt',delim_whitespace=True)
In [291]: data
Out[291]: 
   userID  artistID    weight
0      45         7  0.711478
1     204       144  0.464000
2      36       650  2.423289
3     140       146  1.014670
4     170        31  1.412478
5     240       468  0.652999
In [292]: M = sparse.csr_matrix((data.weight, (data.userID, data.artistID)))
In [293]: M
Out[293]: 
<241x651 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [294]: print(M)
  (36, 650)     2.42328874902
  (45, 7)       0.711477987421
  (140, 146)    1.01466992665
  (170, 31)     1.41247833622
  (204, 144)    0.464
  (240, 468)    0.652999240699

我也可以用genfromtxt加载该文件:

In [307]: data = np.genfromtxt('stack48133358.txt',dtype=None, names=True)
In [308]: data
Out[308]: 
array([( 45,   7,  0.71147799), (204, 144,  0.464     ),
       ( 36, 650,  2.42328875), (140, 146,  1.01466993),
       (170,  31,  1.41247834), (240, 468,  0.65299924)],
      dtype=[('userID', '<i4'), ('artistID', '<i4'), ('weight', '<f8')])
In [309]: M = sparse.csr_matrix((data['weight'], (data['userID'], data['artistID
     ...: '])))
In [310]: M
Out[310]: 
<241x651 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
© www.soinside.com 2019 - 2024. All rights reserved.