PyMC混合模型的AIC&BIC

问题描述 投票:0回答:1

我使用PyMC一些数据拟合一条直线。该数据有异常,所以我适应由杰克Vanderplas他的教科书编写some code(在链路第三个例子)。该方法使用矢量可变qi编码每个个体数据点是否属于前景模型(我们正在装配到线)或背景模型,这是我们不关心。

class lin_fit_ol(object):

    '''
    fit a straight line to one independent variable
        (`xi`, with zero errors) and one dependent variable
        (`yi`, with possibly heteroscedastic errors `dyi`)
    Outliers in `yi` are permitted

    Intended to be a complement to a straight-line fit, for model
        testing purposes

    Modified from Vanderplas's code
        (found at http://www.astroml.\
        org/book_figures/chapter8/fig_outlier_rejection.html)
    '''

    def __init__(self, xi, yi, dyi, value):

        self.xi, self.yi, self.dyi, self.value = xi, yi, dyi, value

        @pymc.stochastic
        def beta(value=np.array([0.5, 1.0])):
            """Slope and intercept parameters for a straight line.
            The likelihood corresponds to the prior probability of the parameters."""
            slope, intercept = value
            prob_intercept = 1 + 0 * intercept
            # uniform prior on theta = arctan(slope)
            # d[arctan(x)]/dx = 1 / (1 + x^2)
            prob_slope = np.log(1. / (1. + slope ** 2))
            return prob_intercept + prob_slope

        @pymc.deterministic
        def model(xi=xi, beta=beta):
            slope, intercept = beta
            return slope * xi + intercept

        # uniform prior on Pb, the fraction of bad points
        Pb = pymc.Uniform('Pb', 0, 1.0, value=0.1)

        # uniform prior on Yb, the centroid of the outlier distribution
        Yb = pymc.Uniform('Yb', -10000, 10000, value=0)

        # uniform prior on log(sigmab), the spread of the outlier distribution
        log_sigmab = pymc.Uniform('log_sigmab', -10, 10, value=5)

        # qi is bernoulli distributed
        # Note: this syntax requires pymc version 2.2
        qi = pymc.Bernoulli('qi', p=1 - Pb, value=np.ones(len(xi)))

        @pymc.deterministic
        def sigmab(log_sigmab=log_sigmab):
            return np.exp(log_sigmab)

        def outlier_likelihood(yi, mu, dyi, qi, Yb, sigmab):
            """likelihood for full outlier posterior"""
            Vi = dyi ** 2
            Vb = sigmab ** 2

            root2pi = np.sqrt(2 * np.pi)

            logL_in = -0.5 * np.sum(
                qi * (np.log(2 * np.pi * Vi) + (yi - mu) ** 2 / Vi))

            logL_out = -0.5 * np.sum(
                (1 - qi) * (np.log(2 * np.pi * (Vi + Vb)) +
                            (yi - Yb) ** 2 / (Vi + Vb)))

            return logL_out + logL_in

        OutlierNormal = pymc.stochastic_from_dist(
            'outliernormal', logp=outlier_likelihood, dtype=np.float,
            mv=True)

        y_outlier = OutlierNormal(
            'y_outlier', mu=model, dyi=dyi, Yb=Yb, sigmab=sigmab, qi=qi,
            observed=True, value=yi)

        self.M = dict(y_outlier=y_outlier, beta=beta, model=model,
                      qi=qi, Pb=Pb, Yb=Yb, log_sigmab=log_sigmab,
                      sigmab=sigmab)

        self.sample_invoked = False

    def sample(self, iter, burn, calc_deviance=True):
        self.S0 = pymc.MCMC(self.M)
        self.S0.sample(iter=iter, burn=burn)
        self.trace = self.S0.trace('beta')
        self.btrace = self.trace[:, 0]
        self.mtrace = self.trace[:, 1]

        self.sample_invoked = True

    def triangle(self):
        assert self.sample_invoked == True, \
            'Must sample first! Use sample(iter, burn)'

        corner(self.trace[:], labels=['$m$', '$b$'])

    def plot(self, xlab='$x$', ylab='$y$'):
        # plot the data points
        plt.errorbar(self.xi, self.yi, yerr=self.dyi, fmt='.k')

        # do some shimmying to get quantile bounds
        xa = np.linspace(self.xi.min(), self.xi.max(), 100)
        A = np.vander(xa, 2)
        # generate all possible lines
        lines = np.dot(self.trace[:], A.T)
        quantiles = np.percentile(lines, [16, 84], axis=0)
        plt.fill_between(xa, quantiles[0], quantiles[1],
                         color="#8d44ad", alpha=0.5)

        # plot circles around points identified as outliers
        qi = self.S0.trace('qi')[:]
        Pi = qi.astype(float).mean(0)
        outlier_x = self.xi[Pi < 0.32]
        outlier_y = self.yi[Pi < 0.32]
        plt.scatter(outlier_x, outlier_y, lw=1, s=400, alpha=0.5,
                    facecolors='none', edgecolors='red')

        plt.xlabel(xlab)
        plt.ylabel(ylab)

    def ICs(self):
        self.MAP = pymc.MAP(self.M)
        self.MAP.fit()

        self.BIC = self.MAP.BIC
        self.AIC = self.MAP.AIC
        self.logp = self.MAP.logp
        self.logp_at_max = self.MAP.logp_at_max
        return self.AIC, self.BIC

所以,当我们计算BIC和AIC使用这个模型,我们得到了非常大的数值(因为有很多点)。这非常有意义。然而,这disfavors有许多数据点,这惹恼了我。此外,大AIC和BIC会让一个漫不经心的观察者相信其他模型(这符合不佳作为异常值的结果)实际上是更好的模型。

我缺少的BIC的微妙和AIC这里,或者是使用混合模型的一个严酷的现实,你总是有使用一堆额外的二进制参数来表示你的数据点的成员?

python bayesian pymc mcmc model-comparison
1个回答
0
投票

我推荐这本书"Introduction to statistical learning"

在212页,可以找到AIC和BIC的公式。在每个这些配方的样品数为分母。因此,结果不应该由样本数量的影响。至少不是在明显的方式。

© www.soinside.com 2019 - 2024. All rights reserved.