Mallet输出主题权重0.0或1.0,并且两者之间均不输出

问题描述 投票:1回答:1

因此,使用槌的API following this example in the developer's guide创建了一个小程序。但是,我不了解最终的重量输出。

程序运行时,它正在为每个主题输出合理的权重(见下文):

Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask
max tokens: 5179
total tokens: 31712
<10> LL/token: -7,88809
<20> LL/token: -7,54327
<30> LL/token: -7,44727
<40> LL/token: -7,3755

0   0,5 parses files browser creates selects docking entity 
1   0,5 boolean listener handles enabled directory mouse lines 
2   0,5 text area selected inserts creates deletes user 
3   0,5 int line offset caret screen moves end 
4   0,5 creates node container widget namespace block grid 
5   0,5 selection key event processes shows word start 
6   0,5 boolean search index indent hyper bundle dialog 
7   0,5 string element adds starts ends reader map 
8   0,5 handles changed message properties mode content loads 
9   0,5 creates fold plugin list marker model handler 
10  0,5 action set invokes edit creates char token 
11  0,5 pane option saves inits error save creates 
12  0,5 component adds size layout removes dockable window 
13  0,5 converts type view tostring rule parser closes 
14  0,5 buffer update updates handles status invalidates byte 
15  0,5 evals creates menu callstack eval inits document 
16  0,5 class manager path url bsh impl chunk 
17  0,5 handles variable expression color property primitive icon 
18  0,5 file creates vfs request literal parent runs 
19  0,5 string parse editor.getexpansion preferredlayoutsize(parent preprocesskeyevent startlinecomment getstringproperty 

[...]

0   0,07447 parses string files entity selected decl lists 
1   0,09965 handles boolean listener adds mouse enabled drag 
2   0,09124 text area selected selects user input int 
3   0,14501 int line offset screen start count end 
4   0,07821 node creates container widget closes namespace grid 
5   0,05882 key event selection processes viewer extends handles 
6   0,16431 boolean indent list index equals updates modifiers 
7   0,08873 element string starts ends adds document map 
8   0,14141 handles changed message properties mode content loads 
9   0,12078 fold creates plugin marker model handler list 
10  0,11112 action creates invokes edit set token stream 
11  0,11896 option pane inits saves view creates color 
12  0,11379 component layout size adds dockable window removes 
13  0,11022 string converts tostring type char marks segment 
14  0,10636 buffer update handles updates byte status edit 
15  0,11183 evals creates menu callstack error eval reader 
16  0,09098 class path url manager impl classes loader 
17  0,09077 handles variable expression property creates bsh primitive 
18  0,12605 file string search vfs dialog creates literal 
19  0,02491 string parse setvalueat disposedockablewindow getpreviousbuffer buffered rewinds 

[beta: 0,02113] 
<500> LL/token: -6,90397

Total time: 16 seconds

但是,当涉及到最终输出时,它就会出来:

0   0.000   parses (115) string (90) files (53) entity (33) selected (29) 
1   0.000   handles (110) boolean (82) listener (71) mouse (48) adds (44) 
2   0.000   text (230) area (126) selected (61) user (28) selects (27) 
3   0.000   int (588) line (295) offset (67) screen (54) start (49) 
4   0.000   node (71) creates (48) widget (34) closes (33) container (32) 
5   0.000   key (130) event (110) selection (81) processes (67) viewer (17) 
6   0.000   boolean (586) indent (55) index (51) list (51) updates (23) 
7   0.000   element (99) string (76) starts (48) ends (46) adds (43) 
8   0.000   handles (464) changed (153) message (150) properties (96) mode (96) 
9   0.000   fold (108) creates (107) plugin (97) marker (56) model (55) 
10  0.000   action (132) creates (89) invokes (64) set (61) edit (58) 
11  0.000   option (119) pane (118) inits (114) saves (77) view (68) 
12  0.000   component (128) adds (89) layout (87) size (76) dockable (63) 
13  0.000   string (488) converts (114) tostring (65) type (41) char (30) 
14  0.000   buffer (289) update (89) handles (71) updates (49) byte (30) 
15  0.000   evals (157) creates (121) menu (102) callstack (92) error (66) 
16  0.000   class (243) path (76) url (47) manager (42) impl (28) 
17  0.000   handles (134) variable (79) expression (73) creates (47) property (46) 
18  0.000   file (126) string (111) search (89) vfs (64) int (52) 
19  1.000   string (2705) parse (2605) parser.reinittokeninput(in (1) image (1) candidates[i (1) 
0   0.930564405720232

除所有权重均标记为1之外,所有权重均标记为0。谁能解释这是怎么回事?

java nlp topic-modeling mallet
1个回答
1
投票

您指向的代码正在打印出第一个文档的主题分布,这几乎100%分配给主题19。

看起来馆藏很小(3万个字),文档又相当大(最多5k)。如果主题多于文档,那么该模型可以通过将每个文档放在自己的主题中来最大化其目标。

您将从更多文档中获得更好的结果,并且可能需要考虑将文档分成较小的块。当每个段都足够短以至于可以合理地假设其具有同质的主题组合时,LDA的效果最佳。换句话说,您不会期望该段的开头与该段的结尾有所不同。 200-500个字是一个典型范围。 300,000个令牌总数也可能大约是您可以预期达到良好结果的最小值。

© www.soinside.com 2019 - 2024. All rights reserved.