具有所有可能的标签列表:
all_labels = ['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s','t',\
'u','v','w','z']
以及具有每行中特定标签的值的数据框:
import pandas as pd
data = {'labels': [['b','a'],['q'],['n','j','v']], 'scores':[[0.1,0.2],[0.7],[0.3,0.5,0.1]]}
df = pd.DataFrame(data)
我正在尝试创建一个新列,其中每个行输入都将具有稀疏矩阵(向量)。这是我的方法:
from scipy import sparse
from scipy.sparse import coo_matrix
def labels_to_sparse(input_):
all_, lables_, scores_ = input_
rows = [0]*len(all_)
cols = range(len(all_))
vals = [0]*len(all_)
for i in range(len(lables_)):
vals[all_.index(lables_[i])] = scores_[i]
return coo_matrix((vals, (rows, cols)))
df['sparse_row'] = df.apply(
lambda x: labels_to_sparse((all_labels, x['labels'], x['scores'])), axis=1
)
df
即使这行得通,但是对于大数据它却非常慢...是否有一种方法可以向量化此函数,而不是使用apply
?
最后,我想使用此数据框创建矩阵:
my_result = sparse.vstack(df['sparse_row'].values)
my_result.todense()