Python 中多元核密度估计的条件采样

问题描述 投票:0回答:3

可以使用以下命令创建多元核密度估计 (KDE): scikitlearn (https://scikit-learn.org/stable/modules/ generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity) 和 scipy (https://docs.scipy.org/doc/scipy/reference/ generated/scipy.stats.gaussian_kde.html)

两者都允许从估计分布中进行随机抽样。有没有办法在两个库(或任何其他库)中进行条件采样? 在 2 变量 (x,y) 的情况下,这意味着来自 P(x|y)(或 P(y|x))的样本,因此来自概率函数的横截面(并且该横截面必须重新缩放至其曲线下的单位面积)。

x = np.random.random(100)
y =np.random.random(100)
kde = stats.gaussian_kde([x,y])
# sampling from the whole pdf:
kde.resample()

我正在寻找类似的东西

# sampling y, conditional on x
kde.sample_conditional(x=1.5) #does not exist
python scikit-learn scipy
3个回答
1
投票

此函数根据所有其他列有条件地对一列进行采样。

def conditional_sample(kde, training_data, columns, values, samples = 100):
    if len(values) - len(columns) != 0:
        raise ValueError("length of columns and values should be equal")
    if training_data.shape[1] - len(columns) != 1:
        raise ValueError(f'Expected {training_data.shape[1] - 1} columns/values but {len(columns)} have be given')  
    cols_values_dict = dict(zip(columns, values))
    
    #create array to sample from
    steps = 10000  
    X_test = np.zeros((steps,training_data.shape[1]))
    for i in range(training_data.shape[1]):
        col_data = training_data.iloc[:, i]
        X_test[:, i] = np.linspace(col_data.min(),col_data.max(),steps)
    for col in columns: 
        idx_col_training_data = list(training_data.columns).index(col)
        idx_in_values_list = list(cols_values_dict.keys()).index(col)
        X_test[:, idx_col_training_data] = values[idx_in_values_list] 
        
    #compute probability of each sample
    prob_dist = np.exp(kde.score_samples(X_test))/np.exp(kde.score_samples(X_test)).sum()
    
    #sample using np.random.choice
    return X_test[np.random.choice(range(len(prob_dist)), p=prob_dist, size = (samples))]

0
投票

此函数适用于仅具有一维条件的用例。它的构建方式不需要完全满足条件,但有一定的容忍度。

from scipy import stats
import numpy as np

x = np.random.random(1000)
y = np.random.random(1000)
kde = stats.gaussian_kde([x,y])

def conditional_sample(kde, condition, precision = 0.5,  sample_size = 1000):

    block_size = int(sample_size / (2 * precision / 100))
    sample = kde.resample(block_size)

    lower_limit = np.percentile(sample[-1], stats.percentileofscore(sample[-1], condition) - precision)
    upper_limit = np.percentile(sample[-1], stats.percentileofscore(sample[-1], condition) + precision)

    conditional_sample = sample[:, (upper_limit > sample[-1]) * (sample[-1] > lower_limit)]

    return conditional_sample[0:-1]

conditional_sample(kde=kde, condition=0.6)

不是通用解决方案,但非常适合我的用例。


0
投票

statsmodels KDEMultivariateConditional 实现条件 KDE。它可以计算条件 pdf 和 cdf。虽然它没有内置采样方法,但可以使用逆变换采样的通用方法(请参阅这个答案)。请注意,这可能不是最有效的方法。这是一个 2D 示例:

import numpy as np
from statsmodels.nonparametric.kernel_density import KDEMultivariateConditional
from pylab import plt
from scipy.optimize import brentq

# generate data with some autocorrelation to make it interesting
x_train = np.random.normal(size=1000)
y_train = x_train + np.random.normal(size=1000)

# train conditional KDE
kde = KDEMultivariateConditional(endog=y_train, exog=x_train, dep_type='c', 
                                 indep_type='c', bw='normal_reference')

def cdf_conditional(x, y_target):
    # cdf of kde conditional on x, evaluated at y_target
    return kde.cdf(np.array(y_target).reshape((-1,1)), np.array(x).reshape((-1,1)))

# inverse-transform-sampling
def sample_conditional_single(x):
    # sample conditional on x
    u = np.random.random()
    # 1-d root-finding
    def func(y):
        return cdf_conditional(x, y)- u
    sample_y = brentq(func, -99999999, 99999999)  # read brentq-docs about these constants
                                                # constants need to be sign-changing for the function
    return sample_y

def sample_conditional(x, n):
    return np.array([sample_conditional_single(x) for _ in range(n)])


# now sample at 2 different x values
x1 = 0
x2 = 2
samples1 = sample_conditional(x1, 100)
samples2 = sample_conditional(x2, 100)

plt.figure()
plt.scatter(x_train, y_train)
plt.xlabel('x')
plt.ylabel('y')
plt.title("training data")
plt.figure()
plt.scatter(np.repeat(np.array(x1), len(samples1)), samples1)
plt.scatter(np.repeat(np.array(x2), len(samples2)), samples2)
plt.xlabel('x')
plt.ylabel('y')
plt.title('sampled conditionally')


© www.soinside.com 2019 - 2024. All rights reserved.