为什么我的bagOfWord天真贝叶斯算法的性能比wekas StringToWordVector差?

问题描述 投票:1回答:1

我正在尝试为1000个正面+负面标记的IMDB评论(txt_sentoken)和weka API for Java构建一个朴素的基于贝叶斯的分类器。

因为我不知道StringToWordVector,它基本上提供了达到80%准确度的BagOfWords模型,所以我自己完成了词汇构建和矢量创建,精度只有75%:(

现在我想知道为什么我的解决方案表现得更糟。

1)从2000年的评论中,我构建了BagOfWords:

Pipeline<String, Void> bagOfWordsChain = Pipeline
        .start(Preprocessing.luceneTokenizer)
        .append(Preprocessing.stopwordFilter)
        .append(Preprocessing.vocabularyBuilder);

vacabularyBuilder获取空白标记化评论,将每个unigram放入hashMap并计算出现次数。只有绝对极性为0.25或以上的词(根据SentiWordNet):

public class VocabularyBuilder implements Pipe<List<String>, Void> {
    public Map<String, Integer> _vocab = new ConcurrentHashMap<String,Integer>();

    @Override
    public Void process(List<String> input)
    {
        for(String token: input){
            double pol = SentiAnalysis.sentiWordNet.getWordPolarity(token);

            if(pol >= 0.25) {
                this._vocab.put(token, this._vocab.getOrDefault(token, 0) + 1);
            }

        }
        return null;
    }

    public void sortAndLimit(int n){
        this._vocab = this._vocab.entrySet()
                .stream()
                .sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
                .limit(n)
                .collect(
                        toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e2,
                                LinkedHashMap::new));
    }

}

生成的hashmap是有序的(desc),并且限制在前~1500个单词中。所以我得到以下向量:

{bad=43245, best=41292, better=28551, wrong=11935, unfortunately=11842, worst=9548, i=7526, wonderful=6727, excellent=5704, scary=5395, all=4361, greatest=4216, hate=4154, like=3688, out=3635, boss=3627, disturbing=3193, inspired=3162, some=2983, poorly=2914, just=2903, creepy=2883, worthy=2883, superb=2883, disappointing=2728, loving=2666, plain=2542, fake=2480, engaging=2449, good=2409, nasty=2387, lucky=2387, cheesy=2294, other=1931, well=1906, very=1863, unfortunate=1860, dirty=1829, mom=1736, delightful=1736, embarrassing=1705, make=1641, outrageous=1581, off=1579, really=1557, gorgeous=1550, alas=1550, troubled=1519, little=1500, over=1446, never=1373, downright=1364, horribly=1364, supernatural=1333, infamous=1333, psychotic=1271, know=1213, doom=1178, great=1148, love=1112, go=1109, awesome=1085, pseudo=1054, still=1047, world=994, paranoid=992, bastard=992, hapless=992, nifty=961, menacing=961, miserable=961, phony=930, worried=899, unrealistic=899, going=887, masterful=868, respected=868, pity=837, admirable=837, ruthless=837, chilling=806, horrific=806, spite=806, right=798, find=782, understated=775, damage=775, bully=744, fright=682, paranoia=682, unsatisfying=682, kudos=651, notorious=651, overacting=651, interesting=638, unoriginal=620, nostalgic=620, must=615, wounded=589, abusive=589, atrocious=589, hard=587, times=568, trying=566, kind=559, complaining=558, trashy=558, hollywood=532, cracking=527, glowing=527, sure=523, together=522, black=521, heartbreaking=496, bonnie=496, letdown=496, regret=496, disastrous=496, faux=496, nerve=496, imitation=496, whole=494, sex=476, less=475, wondrous=465, whiny=465, groan=465, completely=440, wretched=434, sumptuous=434, punk=434, botched=434, smashing=434, small=429, dead=418, evil=417, humor=410, true=410, lost=409, salt=403, idyllic=403, schlock=403, healing=403, momma=403, second=400, problem=396, comic=389, alien=377, reverend=372, inviting=372, horrid=372, righteous=372, props=372, lucrative=372, impressively=372, soil=372, autistic=372, entrapment=372, certainly=361, face=357, despite=352, nice=344, perfect=343, gusto=341, crappy=341, messy=341, undeniable=341, shoddy=341, seeing=338, game=321, entertaining=314, dark=312, protecting=310, frivolous=310, troubling=310, dandy=310, fortunate=310, worrying=310, bush=310, unwilling=310, tingle=310, ashamed=310, short=309, beautiful=299, obvious=292, known=292, classic=289, fine=284, eyes=282, compassionate=279, amusingly=279, formidable=279, infatuated=279, topping=279, dreck=279, protected=279, underdog=279, woefully=279, howling=279, weep=279, question=277, truly=274, boring=270, killer=270, 1=269, fiction=257, yes=256, novel=255, possible=253, stupid=253, murder=250, demonic=248, infuriating=248, joking=248, bullshit=248, faint=248, extremely=248, fumbling=248, insomnia=248, respectability=248, levity=248, disconcerting=248, handicapped=248, wonder=247, sound=245, worse=245, taking=242, none=242, emotional=239, meet=236, elements=236, hell=234, interest=232, battle=229, obviously=228, enjoy=227, needs=221, poor=219, bogus=217, virtuoso=217, swell=217, bliss=217, droll=217, incredulous=217, redneck=217, scummy=217, otherworldly=217, amuse=217, bum=217, success=216, happy=215, important=215, giving=214, recent=211, basically=211, easily=209, apparently=209, cop=201, difficult=200, complete=197, due=196, gone=195, quality=195, dramatic=194, annoying=189, smart=189, dogged=186, crushing=186, finer=186, brag=186, upsetting=186, regrettably=186, nonexistent=186, fab=186, pleaser=186, delayed=186, tumultuous=186, enamored=186, idealized=186, botch=186, absolutely=186, imaginary=186, incongruity=186, sickly=186, mono=186, depraved=186, prejudice=186, successful=186, read=185, intelligent=184, clear=183, amazing=181, low=179, animated=179, villain=178, definitely=177, clever=177, add=177, giant=176, brilliant=176, solid=174, thinking=174, potential=171, unlike=167, perfectly=166, sweet=166, non=166, enjoyable=164, decent=164, cold=161, impossible=161, exciting=160, wanted=159, neither=158, trouble=157, otherwise=156, opulent=155, penniless=155, liked=155, esteemed=155, reprehensible=155, stink=155, congratulations=155, catastrophic=155, battered=155, sterling=155, majestic=155, mangled=155, researcher=155, wonderment=155, outraged=155, outdated=155, forged=155, thorough=155, hurting=155, dumb=155, crummy=155, filthy=155, talented=154, silly=152, truth=151, joke=150, totally=146, terrible=143, fear=142, entirely=142, ability=140, killing=138, cheap=137, background=137, lots=137, disaster=136, highly=136, amusing=135, master=135, beauty=135, further=134, cute=134, wish=134, awful=132, plenty=131, believable=130, charm=130, imagine=128, choice=128, tough=128, somewhere=127, famous=126, okay=125, resentful=124, perplexed=124, deplorable=124, hamming=124, incredibly=124, honored=124, melancholic=124, woeful=124, amiss=124, indignity=124, disabled=124, improving=124, disembodied=124, repugnant=124, cringing=124, workmanlike=124, maddening=124, stinky=124, opportunity=124, seemingly=124, fugly=124, enthralled=124, reputable=124, discernable=124, elegance=124, interested=123, alive=122, ready=120, average=119, apart=118, wasted=118, becoming=117, missing=116, bizarre=115, waiting=114, saving=112, adventure=111, lame=111, camp=109, sad=108, tried=107, gay=106, wise=106, charming=105, rescue=105, fair=105, decided=104, fox=104, sadly=103, horrible=103, winning=102, exception=102, older=102, accident=102, younger=102, bright=101, desperate=101, surprising=100, revenge=100, lose=100, opinion=100, weird=100, apparent=100, epic=99, nicely=98, nowhere=98, crazy=98, buy=98, loved=97, witty=97, please=97, pathetic=97, fantasy=96, kept=96, cameo=96, fascinating=96, spirit=95, superior=95, shame=94, confusing=94, confused=94, untamed=93, shapely=93, bias=93, masterpiece=93, intriguing=93, fiendish=93, superlative=93, jerking=93, indignant=93, mournful=93, despondent=93, troublesome=93, malice=93, sitting=93, speaking=93, sheltered=93, bollocks=93, alienating=93, wail=93, rowdy=93, meddling=93, generalized=93, motormouth=93, swoon=93, banner=93, moralizing=93, propelling=93, damaging=93, smelly=93, crap=93, overjoyed=93, interfering=93, mischief=93, mommy=93, disenchanted=93, crotchety=93, abhorrent=93, squalor=93, condemning=93, matt=92, necessary=92, magic=92, pointless=92, respect=92, pain=90, sub=90, moral=89, touching=89, rob=89, drawn=89, floor=88, color=88, keeping=88, ed=88, phantom=88, unfunny=88, damn=88, suspect=88, criminal=88, emotion=88, anti=87, blame=87, grand=86, ok=86, wonderfully=85, frightening=84, walking=84, utterly=84, attractive=83, rocky=83, honest=82, difference=82, feelings=82, trust=82, safe=82, dying=81, plus=81, dangerous=81, visually=81, shock=80, sexy=80, fail=80, humour=79, goofy=79, constant=79, hidden=78, welcome=78, content=77, fantastic=77, contrived=77, genius=77, sick=77, fit=76, badly=76, forgotten=75, won=75, excuse=75, guilty=75, instance=75, offensive=75, fault=75, super=74, outstanding=74, haunting=74, mediocre=74, laughable=74, nudity=73, bloody=73, slapstick=73, model=73, zero=73, ted=72, responsible=72, hurt=72, fly=72, wondering=72, eccentric=71, beast=71, thoroughly=71, ended=71, ill=71, inevitable=70, extra=70, steal=70, sorry=70, lacking=70, missed=70, ugly=70, mistake=69, hip=69, veteran=69, animal=68, chief=68, fill=68, suit=67, unexpected=67, tragedy=67, disappointment=67, warm=66, armageddon=66, cliched=66, positive=65, learned=65, genuinely=64, moved=64, aware=64, international=63, greater=63, freeman=63, elaborate=63, roughneck=62, unforgiving=62, unappetizing=62, stench=62, premium=62, psychological=62, danger=62, ravaging=62, flash=62, shlock=62, constipated=62, disability=62, woe=62, unworkable=62, lamentable=62, plaintive=62, ornery=62, patchy=62, disingenuous=62, unredeemable=62, remorseful=62, disrespect=62, terror=62, assimilating=62, satisfying=62, bitchiness=62, disappointed=62, derogatory=62, venomous=62, straying=62, stinking=62, woozy=62, conceited=62, ungodly=62, abysmally=62, soiled=62, scowl=62, grime=62, amicable=62, reconstructive=62, jobless=62, disadvantage=62, wearing=62, lugubrious=62, essentially=62, inexorable=62, painful=61, afraid=61, joy=61, threatening=61, drunken=60, suspenseful=60, psycho=60, passion=60, surely=59, suffering=59, warning=59, oddly=59, sole=59, nevertheless=59, climactic=59, lovely=59, bitter=58, wealthy=58, deadly=58, ape=58, patient=58, complicated=57, understanding=57, effectively=57, scare=57, unlikely=57, sudden=56, gag=56, extraordinary=56, quirky=56, beloved=56, 0=55, destroyed=55, frankly=55, chicken=55, unable=55, irritating=55, tragic=55, hype=55, unknown=54, nonetheless=54, promising=54, heroine=54, paid=53, obligatory=53, struggling=53, loser=53, inventive=52, erotic=52, absurd=52, provoking=52, remaining=52, hint=52, artistic=51, darkness=51, magical=51, beautifully=51, hopefully=51, notable=50, homage=50, inept=50, bleak=50, depressing=50, bother=49, cutting=49, flawed=49, obnoxious=49, cole=49, pleasant=49, instinct=49, brutal=49, trick=48, luckily=48, accomplished=48, deserve=48, disbelief=48, trite=48, fortunately=48, corny=47, intellectual=47, spoken=47, stylish=47, cruel=47, innocence=47, delight=47, expensive=47, mighty=47, favor=47, false=47, stereotypical=47, refreshing=47, training=46, honor=46, awkward=46, thrilling=46, legendary=46, gory=45, spots=45, virtual=45, consistently=45, base=45, deserved=45, wan=45, clueless=45, strangely=45, shut=45, endearing=45, expert=45, glory=45, energetic=44, handsome=44, insult=44, blind=44, crucial=44, everywhere=44, negative=44, received=43, bats=43, spirited=43, lifeless=43, fool=43, ironically=43, appropriately=42, confrontation=42, invisible=42, randy=42, legs=41, terrifying=41, shining=41, specifically=41, compare=41, crazed=41, ludicrous=41, ideal=41, charismatic=41, stunt=41, torn=41, diaz=41, eerie=41, vicious=41, campy=41, blank=41, reputation=41, searching=41, caring=41, flashy=41, inane=41, instantly=41, utter=41, flawless=40, nonsense=40, controlled=40, stab=40, wanting=40, foul=40, pat=40, pet=40, confusion=40, uninspired=40, experienced=40, affection=40, eager=39, melodramatic=39, absolute=39, detailed=39, separate=39, correct=39, craven=39, loyal=39, subtlety=39, superficial=39, enjoyment=39, sentimental=39, happiness=39, stranger=39, worthwhile=39, stupidity=38, paying=38, fell=38, altogether=38, advantage=38, honestly=38, sinister=38, manipulative=38, sleazy=38, credible=38, redeeming=38, ruin=38, excessive=38, protect=38, suspicious=38, crack=37, importance=37, cynical=37, shark=37, magnificent=37, sophisticated=37, showdown=37, regardless=37, harsh=37, corrupt=37, wreck=36, controversial=36, mild=36, proper=36, prepared=36, attempted=36, gem=36, changing=36, knight=36, ace=36, duke=36, devoted=36, macho=36, dozens=36, express=36, satan=35, wacky=35, struck=35, elderly=35, reluctant=35, proud=35, combat=35, uncomfortable=35, horrifying=35, nostalgia=35, evident=35, accurate=35, principal=34, uma=34, authentic=34, respective=34, insane=34, nearby=34, grim=34, shine=34, slight=34, busy=34, definite=34, unpleasant=34, firm=34, starred=34, rely=34, caliber=33, absent=33, acceptable=33, brutally=33, rough=33, unconvincing=33, guilt=33, fancy=33, astonishing=33, needless=33, sin=33, sly=33, composed=33, raging=32, lovable=32, nude=32, decidedly=32, aged=32, neat=32, neck=32, clumsy=32, triumph=32, comical=32, admire=32, popularity=32, unpredictable=32, mistaken=32, lively=32, depressed=32, dignity=32, marvelous=32, fable=32, raw=32, tiresome=32, bumbling=32, useless=32, immensely=32, torturer=31, officious=31, detrimental=31, mongrel=31, crackerjack=31, soured=31, unethical=31, mournfulness=31, asthmatic=31, upset=31, untraditional=31, cheating=31, undependable=31, contemptible=31, worn=31, irreplaceable=31, unopposed=31, befouled=31, gristly=31, softhearted=31, paucity=31, alleviated=31, malnutrition=31, crude=31, imponderable=31, secular=31, inconvenience=31, financial=31, singable=31, uncommercial=31, unmarried=31, indigestion=31, unfeasible=31, gloating=31, remarry=31, groveling=31, selfless=31, entranced=31, lecherous=31, trustworthiness=31, ruffian=31, heartbroken=31, interrogator=31, assured=31, placate=31, vertiginous=31, plausible=31, unhappy=31, surly=31, humbug=31, whiney=31, limitlessness=31, dispirited=31, insufficient=31, bungled=31, lovelorn=31, wounding=31, sold=31, irrefutable=31, convenient=31, xenophobic=31, rootless=31, preferable=31, jocular=31, pong=31, pneumonia=31, malaise=31, malevolence=31, detest=31, highbrow=31, cheapjack=31, reek=31, calm=31, egregious=31}

2)对于每个评论,我创建一个包含这1500个单词出现次数的向量:

{exception=1, nicely=0, crappy=0, unconvincing=0, desperate=0, awful=0, wreck=0, satan=0, fumbling=0, ted=0, protected=0, poor=0, wasted=0, legs=0, understanding=0, absent=0, neat=0, inept=0, ashamed=0, unlikely=0, solid=0, inviting=0, excellent=0, younger=0, opulent=0, trashy=0, raw=0, inspired=1, compassionate=0, charismatic=1, apparent=0, 0=0, 1=0, bollocks=0, amusing=0, placate=0, poorly=0, bogus=0, notable=0, vertiginous=0, alienating=0, sentimental=0, plausible=0, catastrophic=0, salt=0, superlative=0, i=0, artistic=0, neck=0, weird=0, stunt=0, destroyed=0, corny=0, exciting=0, obvious=0, dogged=0, sweet=0, novel=0, malaise=0, acceptable=0, ace=0, eager=0, correct=0, moved=0, melancholic=0, jerking=0, woeful=0, good=4, fortunately=0, wish=0, deadly=0, wise=0, tiresome=0, roughneck=0, faint=0, nonexistent=0, add=0, murder=0, unopposed=0, pat=0, fantasy=0, obligatory=0, vicious=0, cruel=0, befouled=0, gristly=0, respect=0, gone=0, faux=0, gorgeous=0, softhearted=0, success=0, indignant=0, wacky=0, smashing=0, cynical=0, trust=0, raging=0, searching=0, wonderment=0, paucity=0, fugly=0, nowhere=0, disturbing=1, sorry=0, spirited=0, happiness=0, responsible=0, hard=0, mistake=0, redneck=0, malevolence=0, sexy=0, caliber=0, lucrative=0, better=0, woefully=0, crap=0, alleviated=0, truth=0, well=0, detest=0, creepy=0, taking=1, terrifying=0, wanting=0, resentful=0, invisible=0, changing=0, moral=0, tried=0, disappointment=0, loved=0, strangely=0, cameo=0, struck=0, hate=0, darkness=0, pet=0, gory=0, protecting=0, disrespect=0, tough=0, loving=0, malnutrition=0, unhappy=0, flawed=0, charming=0, erotic=0, spots=0, demonic=0, animated=0, crazy=0, mighty=0, homage=0, other=2, magnificent=0, highbrow=0, swell=0, crude=0, frankly=0, surly=0, amiss=0, melodramatic=0, wail=0, unforgiving=0, energetic=0, shark=0, famous=0, thoroughly=0, stupidity=0, question=0, honestly=0, worrying=0, spirit=0, imponderable=0, intellectual=0, cheap=0, humbug=0, sickly=0, torturer=0, officious=0, infamous=0, heartbreaking=0, kudos=0, duke=0, cop=0, cheapjack=0, honor=0, supernatural=0, rowdy=0, nasty=0, respectability=0, all=7, terror=0, read=0, plenty=0, less=0, alas=0, adventure=0, idyllic=0, secular=0, scowl=0, shining=0, evil=0, inconvenience=0, infatuated=0, badly=0, shame=0, overjoyed=0, torn=0, chicken=0, entertaining=0, rob=0, interfering=0, assimilating=0, bush=0, elderly=0, financial=0, dumb=0, combat=0, respective=0, trick=0, maddening=0, times=0, extra=0, busy=0, talented=1, detrimental=0, hapless=0, floor=0, idealized=0, wounded=0, guilt=0, stinky=0, chief=0, satisfying=0, despite=4, indignity=0, super=0, groan=0, caring=0, botch=0, fantastic=0, spoken=0, interested=1, bitchiness=0, lame=0, clumsy=0, meddling=0, accurate=0, pity=0, flawless=0, infuriating=0, decided=0, beautiful=0, whiney=0, generalized=0, limitlessness=0, botched=0, ape=0, lovable=0, welcome=0, devoted=0, reek=0, cheesy=0, wanted=1, pathetic=0, untamed=0, difference=0, must=1, deserved=0, flash=0, unoriginal=0, sophisticated=0, perfectly=0, goofy=0, nudity=0, dandy=0, killing=1, penniless=0, singable=0, giving=0, accident=0, excuse=0, drunken=0, humour=0, knight=0, disabled=0, mournful=0, insane=0, worried=0, unappetizing=0, stench=0, pointless=0, triumph=0, perplexed=0, silly=0, black=0, bliss=0, lacking=0, fortunate=0, entirely=0, boring=1, mongrel=0, calm=0, crackerjack=0, classic=1, charm=0, tragedy=0, absolute=0, contrived=0, feelings=0, battered=0, shapely=0, surely=0, becoming=0, wealthy=0, genuinely=0, sterling=0, unable=0, disappointed=0, dispirited=0, dying=0, paying=0, bias=0, sinister=0, brutally=0, basically=0, menacing=0, uncommercial=0, imagine=0, attractive=0, egregious=0, definite=0, superb=2, flashy=0, insufficient=0, uncomfortable=0, unmarried=0, surprising=0, worse=1, camp=0, improving=0, warm=0, guilty=0, embarrassing=0, everywhere=0, worst=0, despondent=0, derogatory=0, blind=0, color=0, hidden=0, indigestion=0, impossible=0, soured=0, showdown=0, complaining=0, non=0, disaster=0, mono=0, negative=0, chilling=0, venomous=0, outrageous=0, painful=0, pain=0, learned=0, wan=0, yes=0, effectively=0, appropriately=0, manipulative=0, stylish=0, genius=0, detailed=1, hype=0, delightful=0, motormouth=0, paid=0, short=0, stranger=0, attempted=0, horrifying=0, fancy=0, notorious=0, innocence=0, fab=0, happy=1, pleaser=0, overacting=0, nearby=0, unfeasible=0, grime=0, struggling=0, specifically=0, controversial=0, truly=0, greater=0, promising=0, okay=0, swoon=0, shlock=0, epic=0, horrid=0, saving=0, rely=0, apparently=0, bungled=0, excessive=0, completely=0, suit=0, bastard=0, damage=0, boss=0, masterful=0, bright=0, harsh=0, clueless=0, alien=0, smart=0, anti=0, unknown=0, diaz=0, bleak=0, premium=0, frivolous=0, gloating=0, low=0, droll=0, tragic=0, amusingly=0, older=0, confusing=0, protect=0, levity=0, mischief=0, comical=0, touching=0, inane=0, unfortunately=0, freeman=0, great=2, wrong=0, beautifully=0, disembodied=0, impressively=0, constipated=0, incredulous=0, choice=0, grim=0, small=3, crushing=0, shut=0, fiction=1, doom=0, disability=0, amicable=0, straying=0, hip=0, wondering=0, totally=0, potential=0, unethical=0, otherwise=0, kind=1, repugnant=0, lifeless=0, important=0, veteran=0, nerve=0, absolutely=0, affection=0, campy=0, psycho=0, wondrous=0, game=0, mommy=0, mournfulness=0, unexpected=0, crucial=0, rocky=0, principal=0, joy=0, patient=0, sad=0, phony=0, imitation=0, visually=0, depressing=0, nostalgia=0, deserve=0, revenge=0, nostalgic=0, clear=0, banner=0, armageddon=0, craven=0, slapstick=0, momma=0, shine=0, favor=0, neither=0, further=0, stupid=0, bad=0, luckily=0, depraved=0, fit=0, crack=0, unsatisfying=0, disenchanted=0, honest=0, giant=0, asthmatic=0, bumbling=0, killer=0, pseudo=0, sure=0, otherworldly=0, going=0, shock=0, loyal=0, mild=0, opportunity=1, reconstructive=0, downright=0, astonishing=0, trying=0, finer=0, stinking=0, hurt=0, average=0, compare=0, unwilling=0, admire=0, dead=0, soil=0, eyes=0, amuse=0, sudden=0, fool=0, unlike=0, popularity=1, brag=0, topping=0, bully=0, crummy=0, outstanding=0, keeping=0, sex=0, emotional=0, outraged=0, right=1, possible=0, battle=0, awesome=0, fly=0, glowing=0, meet=0, complicated=0, masterpiece=0, jobless=0, lovelorn=0, hollywood=0, beauty=0, scare=0, woe=0, needless=0, wounding=0, wretched=0, outdated=0, absurd=0, accomplished=0, unworkable=0, won=0, forgotten=0, useless=0, warning=0, scary=0, ed=0, needs=0, disadvantage=0, sumptuous=0, unpredictable=0, intriguing=0, suspicious=0, confrontation=0, inventive=0, horrific=0, never=0, phantom=0, oddly=0, blame=0, macho=0, nude=0, confusion=0, little=1, lucky=0, some=6, virtual=0, subtlety=0, blank=0, waiting=0, importance=0, uma=0, worthy=0, lamentable=0, training=0, mistaken=0, fox=0, content=0, legendary=0, woozy=0, trouble=0, conceited=0, sin=0, just=0, bloody=0, remarry=0, over=2, sole=0, sold=0, brilliant=0, crazed=0, abusive=0, go=0, wearing=0, false=0, obviously=0, sleazy=0, kept=0, grand=1, insomnia=0, disconcerting=0, endearing=0, decidedly=0, fiendish=0, atrocious=0, ludicrous=0, elaborate=0, very=6, expert=0, irrefutable=0, deplorable=0, provoking=0, delayed=0, sick=0, foul=0, superficial=0, easily=0, model=0, believable=0, autistic=0, fear=0, bonnie=0, disbelief=0, understated=0, letdown=0, plaintive=0, lively=0, crotchety=0, whiny=0, annoying=1, sly=0, ornery=0, upset=0, alive=0, unpleasant=0, majestic=0, abhorrent=0, lugubrious=0, ruthless=0, thinking=0, world=1, known=1, handicapped=0, composed=0, mangled=0, prejudice=0, hopefully=0, ability=0, together=0, delight=0, sadly=0, missed=0, positive=0, obnoxious=0, joking=0, off=0, joke=0, virtuoso=0, scummy=0, troublesome=0, complete=0, undeniable=0, forged=0, constant=0, instance=0, dreck=0, liked=0, second=0, confused=0, esteemed=0, fine=0, find=0, patchy=0, international=0, regardless=0, terrible=0, untraditional=0, ideal=0, pleasant=0, hamming=0, difficult=0, fill=0, cheating=0, plus=0, convenient=0, background=0, true=0, uninspired=0, malice=0, nonetheless=0, handsome=0, dozens=0, dangerous=0, groveling=0, best=2, decent=0, nonsense=0, eerie=0, troubled=0, loser=0, ok=0, make=1, rescue=0, experienced=0, reprehensible=0, highly=0, certainly=0, unfortunate=0, interesting=0, enthralled=0, cringing=0, intelligent=0, master=0, fright=0, extraordinary=0, selfless=0, due=0, howling=0, evident=0, authentic=0, essentially=0, heroine=0, worthwhile=0, undependable=0, sitting=0, psychological=0, credible=0, threatening=0, moralizing=0, bullshit=0, danger=0, somewhere=0, firm=0, extremely=0, speaking=0, starred=0, clever=0, reputable=0, horrible=0, drawn=0, recent=0, xenophobic=0, inevitable=0, horribly=0, unrealistic=0, underdog=0, miserable=0, wonderful=0, received=1, cracking=0, remaining=1, quality=0, glory=0, disastrous=0, propelling=0, disingenuous=0, animal=0, consistently=0, psychotic=0, sub=0, nevertheless=0, entranced=0, workmanlike=0, cute=0, ravaging=0, fell=0, unfunny=0, frightening=0, wonderfully=0, lecherous=0, apart=0, dirty=0, offensive=0, bother=0, righteous=0, necessary=0, thorough=0, beloved=0, reverend=0, controlled=0, face=0, definitely=0, stab=0, afraid=0, marvelous=0, bum=0, respected=0, randy=0, separate=0, suffering=0, instinct=0, buy=0, reputation=0, express=0, zero=0, amazing=0, trustworthiness=0, instantly=0, climactic=0, awkward=0, reluctant=0, passion=0, redeeming=0, ruin=0, formidable=0, admirable=0, please=0, troubling=0, punk=0, hint=0, know=0, eccentric=0, rough=0, proper=0, ruffian=0, cold=0, beast=0, cole=0, gag=0, bitter=0, contemptible=0, bats=0, shoddy=0, interest=0, damaging=0, hurting=0, missing=0, wonder=0, ungodly=0, gay=0, successful=0, lovely=0, brutal=0, corrupt=0, slight=0, winning=0, tumultuous=0, discernable=0, dramatic=0, damn=0, mediocre=0, superior=0, incredibly=0, imaginary=0, base=0, immensely=0, mom=0, whole=0, tingle=0, fable=0, schlock=0, none=0, fair=0, hell=0, quirky=0, humor=0, problem=0, lost=1, depressed=0, still=0, researcher=0, worn=0, lose=0, matt=0, ironically=0, props=0, fail=0, enjoyment=0, enjoyable=0, unredeemable=0, irritating=0, love=0, enjoy=0, gem=0, out=0, laughable=0, seeing=0, dark=0, witty=0, suspenseful=0, gusto=0, rootless=0, entrapment=0, aged=1, fascinating=0, suspect=0, nice=0, stink=0, opinion=0, lots=0, elegance=0, inexorable=0, altogether=0, emotion=0, elements=1, ended=0, cutting=0, fake=0, remorseful=0, squalor=0, upsetting=0, insult=0, magic=0, regrettably=0, villain=0, bizarre=0, perfect=0, utter=0, heartbroken=0, prepared=0, sound=0, preferable=0, healing=0, utterly=0, spite=0, abysmally=0, plain=0, criminal=0, incongruity=0, smelly=0, proud=0, like=5, ill=0, enamored=0, ugly=0, paranoia=0, messy=0, condemning=0, cliched=0, jocular=0, paranoid=0, sheltered=0, safe=0, interrogator=0, honored=0, thrilling=0, trite=0, regret=0, steal=0, irreplaceable=0, congratulations=0, stereotypical=0, pong=0, weep=0, engaging=0, seemingly=0, aware=0, filthy=0, soiled=0, pneumonia=0, ready=0, walking=0, disappointing=0, greatest=0, haunting=0, advantage=0, fault=0, really=0, nifty=0, expensive=0, magical=0, refreshing=0, assured=0, dignity=0, comic=0}

3)我正在使用wekas naive bayes和k = 10交叉折叠评估:

Instances trainingSet = new Instances("Data", features.getAttributes(), 2000);         
trainingSet.setClassIndex(0); 


    // process every review, extract the ~1500 feature values and add them to the training-set
    rvw.getReviews().parallelStream().forEach((review) -> {
        Instance inst = new DenseInstance(features.getNumberOfFeatures()); // = 1501 features

        // get the word vector for this review
        HashMap<String,Integer> wordVector = stringToWordVectorChain.run(review.getReviewText());
        // set the sentiment class to positive or negative label
        features.setClass(inst, review.getPositiveOrNegative())); // sets the class attribute to positive or negative
        features.setFeatureValues(inst, wordVector); // for each feature it will do "setValue" on the instance

        trainingSet.add(inst);
    });


Classifier cModel = (Classifier) new NaiveBayes();
cModel.buildClassifier(trainingSet);

// Test the model
Evaluation eTest = new Evaluation(trainingSet);
eTest.crossValidateModel(cModel, trainingSet, 10, new Random(1));

// print results
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);

我尝试使用不同数量的单词(StringToWordVector使用~1200)和不同的极性阈值,75%是我的解决方案达到的最大精度。

java machine-learning weka
1个回答
2
投票

通过Weka的StringToWordVector documentation阅读,似乎有一些不同于你的实现细节。以下是前两位,根据我认为他们看到性能差异的原因,我认为:

  • 似乎默认情况下,结果向量是布尔值(即注意到单词的存在,而不是出现次数)
  • 如果在向量化文本之前设置了class属性,则为每个类构建单独的字典,然后合并所有字典。

虽然他们中的任何一个(或其他更微小的差异)可能是罪魁祸首,但我的赌注是第二点。

内置类允许设置和取消设置每个选项;您可以尝试使用带有-C选项的StringToWordVector重新运行80%版本以使用出现次数而不是布尔值,并使用-O在两个类中使用单个字典。

这应该允许您验证这些中的任何一个确实是罪魁祸首。

编辑:关于第一点,即计数出现与注意单词存在(也称为伯努利和多项式模型),90年代有几篇学术论文研究了这些差异,例如: herehere。虽然多项式模型通常效果更好,但也有相反的情况,这取决于语料库和分类问题。

© www.soinside.com 2019 - 2024. All rights reserved.