Perl / Moose中朴素贝叶斯计算

问题描述 投票:0回答:1

下面是我编写的一些代码,用于使用Naive Bayes分类器计算标签相对于某些观察到的特征的概率。这是为了计算没有平滑的朴素贝叶斯公式,并且旨在计算实际概率,因此请使用通常省略的分母。我的问题是,对于示例(下面),“好”标签的概率> 1.(1.30612245)任何人都可以帮助我理解那是什么意思吗?这是天真假设的副产品吗?

package NaiveBayes;

use Moose;

has class_counts => (is => 'ro', isa => 'HashRef[Int]', default => sub {{}});
has class_feature_counts => (is => 'ro', isa => 'HashRef[HashRef[HashRef[Num]]]', default => sub {{}});
has feature_counts => (is => 'ro', isa => 'HashRef[HashRef[Num]]', default => sub {{}});
has total_observations => (is => 'rw', isa => 'Num');

sub insert {
    my( $self, $class, $data ) = @_;
    $self->class_counts->{$class}++;
    $self->total_observations( ($self->total_observations||0) + 1 );
    for( keys %$data ){
        $self->feature_counts->{$_}->{$data->{$_}}++;
        $self->class_feature_counts->{$_}->{$class}->{$data->{$_}}++;
    }
    return $self;
}

sub classify {
    my( $self, $data ) = @_;
    my %probabilities;
    my $feature_probability = 1;
    for my $class( keys %{ $self->class_counts } ) {
        my $class_count = $self->class_counts->{$class};
        my $class_probability = $class_count / $self->total_observations;
        my($feature_probability, $conditional_probability) = (1) x 2;
        my( @feature_probabilities, @conditional_probabilities );
        for( keys %$data ){
            my $feature_count = $self->feature_counts->{$_}->{$data->{$_}};
            my $class_feature_count = $self->class_feature_counts->{$_}->{$class}->{$data->{$_}} || 0;
            next unless $feature_count;
            $feature_probability *= $feature_count / $self->total_observations;
            $conditional_probability *= $class_feature_count / $class_count;
        }
        $probabilities{$class} = $class_probability * $conditional_probability / $feature_probability;
     }
     return %probabilities;
}

__PACKAGE__->meta->make_immutable;
1;

例:

#!/usr/bin/env perl

use Moose;
use NaiveBayes;

my $nb = NaiveBayes->new;

$nb->insert('good' , {browser => 'chrome'   ,host => 'yahoo'    ,country => 'us'});
$nb->insert('bad'  , {browser => 'chrome'   ,host => 'slashdot' ,country => 'us'});
$nb->insert('good' , {browser => 'chrome'   ,host => 'slashdot' ,country => 'uk'});
$nb->insert('good' , {browser => 'explorer' ,host => 'google'   ,country => 'us'});
$nb->insert('good' , {browser => 'explorer' ,host => 'slashdot' ,country => 'ca'});
$nb->insert('good' , {browser => 'opera'    ,host => 'google'   ,country => 'ca'});
$nb->insert('good' , {browser => 'firefox'  ,host => '4chan'    ,country => 'us'});
$nb->insert('good' , {browser => 'opera'    ,host => '4chan'    ,country => 'ca'});

my %classes = $nb->classify({browser => 'opera', host => '4chan', country =>'uk'});

my @classes = sort { $classes{$a} <=> $classes{$b} } keys %classes;

for( @classes ){
    printf( "%-20s : %5.8f\n", $_, $classes{$_} );
}

打印:

bad                  : 0.00000000
good                 : 1.30612245

我不太担心0概率,但更多的是“概率”好> 1.我相信这是经典朴素贝叶斯定义的实现。

p(C│F_1 ...F_n )=(p(C)p(F_1 |C)...p(F_n |C))/(p(F_1)...p(F_n))

怎么能> 1?

machine-learning classification naivebayes
1个回答
0
投票

这已经太久了,因为我正确地使用Perl来调试它,但我想我能看出问题出在哪里。特征向量p(f_1 ... f_n)的边际概率不是按照您的方式计算的,这是一个单独的计算,具有单独的参数。相反,如果您的类c_1和c_2具有先验p(c_1)和p(c_2),并且似然项p(f | c_1)和p(f | c_2),则f的边际概率为:

p(c_1)* p(f | c_1)+ p(c_2)* p(f | c_2)

这就是为什么分母经常被抛弃的原因:它只涉及你已经使用的数量之和。您想要了解的关于相对概率的任何内容都可以计算为非标准化分数的比率,因此计算比例常数仅在您明确需要介于0和1之间的数字时才有用。

© www.soinside.com 2019 - 2024. All rights reserved.