我正在研究word2vec要解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在精度很低。大约0.02告诉我代码中有问题。但我找不到它。我尝试了TF-IDF和BOW的这段代码(显然word2vec部分除外),我的准确度得分更高,例如0.28,但似乎这是一个错误的错误:
np.set_printoptions(threshold=sys.maxsize)
wv = gensim.models.KeyedVectors.load_word2vec_format("E:\\GoogleNews-vectors-negative300.bin", binary=True)
wv.init_sims(replace=True)
#Pre-Processor Function
pre_processor = TextPreProcessor(
omit=['url', 'email', 'percent', 'money', 'phone', 'user',
'time', 'url', 'date', 'number'],
normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
'time', 'url', 'date', 'number'],
segmenter="twitter",
corrector="twitter",
unpack_hashtags=True,
unpack_contractions=True,
tokenizer=SocialTokenizer(lowercase=True).tokenize,
dicts=[emoticons]
)
#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
all_words, mean = set(), []
for word in words:
if isinstance(word, np.ndarray):
mean.append(word)
elif word in wv.vocab:
mean.append(wv.syn0norm[wv.vocab[word].index])
all_words.add(wv.vocab[word].index)
if not mean:
logging.warning("cannot compute similarity with no input %s", words)
# FIXME: remove these examples in pre-processing
return np.zeros(wv.vector_size,)
mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
return mean
def word_averaging_list(wv, text_list):
return np.vstack([word_averaging(wv, post) for post in text_list ])
#Secondary Word-Averaging Method
def get_mean_vector(word2vec_model, words):
# remove out-of-vocabulary words
words = [word for word in words if word in word2vec_model.vocab]
if len(words) >= 1:
return np.mean(word2vec_model[words], axis=0)
else:
return []
#Loading data
raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = pandas.read_excel('E:\\train.xlsx').iloc[:,2:13] #Loading corresponding train labels (11 emotions)
raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:300,1] #Loading 300 test tweets
test_gold_labels = pandas.read_excel('E:\\test.xlsx').iloc[:300,2:13] #Loading corresponding test labels (11 emotions)
print("please wait")
#Pre-Processing
train_tweets=[]
test_tweets=[]
for tweets in raw_train_tweets:
train_tweets.append(pre_processor.pre_process_doc(tweets))
for tweets in raw_test_tweets:
test_tweets.append(pre_processor.pre_process_doc(tweets))
#Vectorizing
train_array = word_averaging_list(wv,train_tweets)
test_array = word_averaging_list(wv,test_tweets)
#Secondary Vectorizing
train_array=[]
test_array=[]
for tweet in train_tweets:
vec = get_mean_vector(wv, tweet)
if len(vec) > 0:
train_array.append(vec)
for tweet in test_tweets:
vec = get_mean_vector(wv, tweet)
if len(vec) > 0:
test_array.append(vec)
clf = LabelPowerset(LogisticRegression(C=1, class_weight=None))
clf.fit(train_array,train_labels)
predicted= clf.predict(test_array)
print("Logistic Regression Accuracy = ",accuracy_score(test_gold_labels,predicted))
print("F1 score = ",f1_score(test_gold_labels,predicted, average="micro"))
print("Hamming loss = ",hamming_loss(test_gold_labels,predicted))
我的结果,请逐步进行。首先,我用它来查看Google预训练模型的一部分。我从this link下载了Google预先训练的模型。
from itertools import islice
print(list(islice(wv.vocab, 13030, 13050)))
结果:
['Memorial_Hospital', 'Seniors', 'memorandum', 'elephant', 'Trump', 'Census', 'pilgrims', 'De', 'Dogs', '###-####_ext', 'chaotic', 'forgive', 'scholar', 'Lottery', 'decreasing', 'Supervisor', 'fundamentally', 'Fitness', 'abundance', 'Hold']
然后,raw_train_tweets:
0 “Worry is a down payment on a problem you may ...
1 Whatever you decide to do make sure it makes y...
2 @Max_Kellerman it also helps that the majorit...
3 Accept the challenges so that you can literall...
4 My roommate: it's okay that we can't spell bec...
...
6833 @nicky57672 Hi! We are working towards your hi...
6834 @andreamitchell said @berniesanders not only d...
6835 @isthataspider @dhodgs i will fight this guy! ...
6836 i wonder how a guy can broke his penis while h...
6837 I'm highly animated even though I'm decomposing.
Name: Tweet, Length: 6838, dtype: object
train_labels:
anger anticipation disgust fear joy love optimism pessimism \
0 0 1 0 0 0 0 1 0
1 0 0 0 0 1 1 1 0
2 1 0 1 0 1 0 1 0
3 0 0 0 0 1 0 1 0
4 1 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ...
6833 0 0 0 0 0 0 0 0
6834 0 1 0 0 0 0 0 0
6835 1 0 1 0 0 0 0 1
6836 0 0 0 0 0 0 0 0
6837 0 0 0 0 0 0 0 1
sadness surprise trust
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
... ... ... ...
6833 0 0 0
6834 0 1 0
6835 0 0 0
6836 0 1 0
6837 0 0 0
[6838 rows x 11 columns]
raw_test_tweets:
0 @Adnan__786__ @AsYouNotWish Dont worry Indian ...
1 Academy of Sciences, eschews the normally sobe...
2 I blew that opportunity -__- #mad
3 This time in 2 weeks I will be 30... 😥
4 #Deppression is real. Partners w/ #depressed p...
...
295 you begin to irritate me, primitive
296 @Malala Happy day to you Malala 👏🏻😃I seem to s...
297 Hey The Success Novel thanks for the follow! (...
298 Easy day at school today; last rehearsal befor...
299 @bookish_yogi It's been a little Twitter saga ...
Name: Tweet, Length: 300, dtype: object
test_gold_labels:
anger anticipation disgust fear joy love optimism pessimism \
0 1 1 0 0 0 0 1 0
1 0 0 1 0 0 0 0 0
2 1 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0
.. ... ... ... ... ... ... ... ...
295 1 0 1 0 0 0 0 0
296 0 0 0 0 1 1 1 0
297 0 0 0 0 1 0 1 0
298 0 0 0 0 1 0 1 0
299 1 0 1 1 0 0 0 0
sadness surprise trust
0 0 0 1
1 0 0 0
2 1 0 0
3 1 0 0
4 1 0 0
.. ... ... ...
295 0 0 0
296 0 0 0
297 0 0 1
298 0 0 0
299 0 0 0
[300 rows x 11 columns]
train_tweets(在For循环之后,在转换为Series之前:
[['“', 'worry', 'is', 'a', 'down', 'payment', 'on', 'a', 'problem', 'you', 'may', 'never', 'have', "'", '.', 'joyce', 'meyer', '.', 'motivation', 'leadership', 'worry'], ['whatever', 'you', 'decide', 'to', 'do', 'make', 'sure', 'it', 'makes', 'you', 'happy', '.'], ['it', 'also', 'helps', 'that', 'the', 'majority', 'of', 'nfl', 'coaching', 'is', 'inept', '.', 'some', 'of', 'bill', 'o', "'", 'brien', "'", 's', 'play', 'calling', 'was', 'wow', ',', '!', 'gopats'],...........['i', 'will', 'fight', 'this', 'guy', '!', 'do', 'not', 'insult', 'the', 'lions', 'like', 'that', '!', 'but', 'seriously', 'they', 'kinda', 'are', '.', 'wasted', 'some', 'of', 'the', 'best', 'players'], ['i', 'wonder', 'how', 'a', 'guy', 'can', 'broke', 'his', 'penis', 'while', 'having', 'sex', '?', 'serious'], ['i', 'am', 'highly', 'animated', 'even', 'though', 'i', 'am', 'decomposing', '.']]
train_tweets(在For循环之后并转换为Series之后:
0 [“, worry, is, a, down, payment, on, a, proble...
1 [whatever, you, decide, to, do, make, sure, it...
2 [it, also, helps, that, the, majority, of, nfl...
3 [accept, the, challenges, so, that, you, can, ...
4 [my, roommate, :, it, ', s, okay, that, we, ca...
...
6833 [hi, !, we, are, working, towards, your, highl...
6834 [said, not, only, did, not, play, up, hrc, in,...
6835 [i, will, fight, this, guy, !, do, not, insult...
6836 [i, wonder, how, a, guy, can, broke, his, peni...
6837 [i, am, highly, animated, even, though, i, am,...
Length: 6838, dtype: object
test_tweets具有相同的格式。train_array.shape:(6838,300)。打印(train_array):
[[ 4.93713953e-02 -3.80963739e-03 2.69042160e-02 1.20123737e-01
-6.49027973e-02 3.87404412e-02 1.04131535e-01 -6.84523731e-02
6.69738054e-02 7.30312392e-02 -4.77092117e-02 -9.83719155e-02
2.29239091e-03 2.25124136e-02 -7.63789043e-02 6.14396073e-02
9.35590491e-02 1.06499337e-01 -2.69374326e-02 -1.41343296e-01
.........
test_array.shape:(300,300)
Logistic Regression Accuracy = 0.25
F1 score = 0.5894897182025896
Hamming loss = 0.16333333333333333
尚不清楚您的TextPreProcessor
或SocialTokenizer
类可能会做什么。您应该编辑问题以显示其代码,或显示结果文本的一些示例,以确保其按预期运行。 (例如:显示all_tweets
的前几项和后几项。)
[all_tweets = train_tweets.append(test_tweets)
行不可能达到您的期望。 (它将整个列表test_tweets
放置为all_tweets
的最后一个元素–但返回分配给None
的all_tweets
。然后,Word2Vec
模型可能为空-您应启用INFO日志记录以观察其进度并查看输出是否有异常,并添加代码后训练以打印有关模型的一些详细信息,从而确认进行了有用的训练。)
您确定train_tweets
是通往.fit()
的管道的正确格式吗? (发送到Word2Vec
培训的文本似乎已经通过.split()
被标记化了,但是pandas.Series
train_tweets
中的文本可能从未被标记化。)
通常,一个好主意是启用日志记录,并在每个步骤之后添加更多代码,通过检查属性值或打印较长集合的摘录来确认每个步骤均具有预期的效果。