我正在关注这个tutorial,其中我具有来自Quora的以下数据集:
这里我已经清理并标记了q1_clean和q1_clean列中的数据。
现在我已经通过使用带有以下代码的GoogleNews预训练模型训练了W2vModel。
# We are concating the two columns of Question1 and Question2
nData = pd.Series(pd.concat([data['q1_clean'], data['q2_clean']]))
model_w2v = Word2Vec(nData, size=300)
# step 2: intersect the initialized word2vec model with the pre-trained fasttext model
model_w2v.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',lockf=1.0,binary=True)
# step 3: improve model with transfer-learning using the training data
model_w2v.train(nData, total_examples=model_w2v.corpus_count, epochs= 10)
现在我必须进行特征分析,为此我具有以下函数来获取平均计算距离。
def get_pairwise_distance(word1, word2, weight1, weight2, method = 'euclidean'):
if(word1.size==0 or word2.size==0):
return np.nan
dist_matrix = pairwise_distances(word1, word2, metric=method)
return np.average(dist_matrix, weights=np.matmul(weight1.reshape(-1,1),weight2.reshape(-1,1).T))
这里我已经计算出tfidf用作权重:
X_train_tokens = get_tokenized_questions(data=X_train)
from sklearn.feature_extraction.text import TfidfVectorizer
pass_through = lambda x:x
tfidf = TfidfVectorizer(analyzer=pass_through)
# compute tf-idf weights for the words in the training set questions
X_tfidf = tfidf.fit_transform(X_train_tokens)
# split into two
# X1_tfidf -> tf-idf weights of first question in question pair and
# X2_tfidf -> tf-idf weights of second question in question pair
X1_tfidf = X_tfidf[:len(X_train)]
X2_tfidf = X_tfidf[len(X_train):]
而且我正在像在tutorial中一样调用此get_pairwise_distance函数。
#cosine similarities
# here X1 and X2 are the embedded versions of the first and second questions in the question-pair data
# and X1_tfidf and X2_tfidf are the tf-idf weights of the first and second questions in the question-pair data
cosine = compute_pairwise_dist(X1, X2, X1_tfidf, X2_tfidf)
对于此功能,我需要将q1_clean和q2_clean的嵌入式版本作为X1和X2传递,其中已经使用TFIDF计算了权重。而且我不知道如何使用预训练的模型将这两列嵌入向量并将其传递给给定的功能?
您可以使用Keras Embedded Matrix
。请点击以下链接。Keras Embedded Layers