使用加权 KDE（核密度估计）绘制加权直方图

Question

我想将两个数据分布绘制为加权直方图，并带有加权核密度估计 (KDE) 图。

数据（DNA 片段的

length

，按分类变量

regions

分割）是

(0, 1e8)

区间内的整数。我可以使用下面的 python 代码毫无问题地绘制默认的、未加权的直方图和 KDE。该代码为

testdata

变量中的输入数据的小示例绘制直方图。请参阅下面的未加权（默认）直方图。

我想生成一个不同的图，其中直方图中的数据由 length（= X 轴数值变量）进行

加权

。我使用了

weights

选项（seaborn.histplot —seaborn 文档）：

weights
：
data
中的向量或键如果提供，请通过这些因素对相应数据点对每个箱中计数的贡献进行加权。

直方图按预期变化（参见下面的加权直方图）。 但是 KDE（核密度估计）线没有改变。

问题：如何更改核密度估计 (KDE) 以反映我正在使用加权直方图的事实？

未加权（默认）直方图：

加权直方图：

具有最小可重现示例的代码：

import io
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

def plot_restriction_digest(df, out_file_base, weights):
     
    # Prevent the python icon from showing in the dock when the script is
    # running:
    matplotlib.use('Agg')
    
    sns.set_theme(style='ticks')
    f, ax = plt.subplots(figsize=(7, 5))
    sns.despine(f)

    hist = sns.histplot(data=df,
                        x='length',
                        hue='regions',

                        weights=weights,
                        
                        # Normalize such that the total area of the histogram
                        # equals 1:
                        stat='density',
                        
                        # stat='count',
                        
                        # Make all histograms visible, otherwise 'captured'
                        # regions histogram is much smaller than 'all' regions
                        # one:
                        common_norm=False,
                        
                        # Default plots too many very thin bins, which are poorly
                        # visible in pdf format (OK in png). Note that 10 bins is
                        # too crude, and 1000 bins makes too many thin bins:
                        bins=100,
                        
                        # X axis log scale:
                        log_scale=True,
                        
                        # Compute a kernel density estimate to smooth the
                        # distribution and show on the plot as lines:
                        kde=True,
                        )
    sns.move_legend(hist, 'upper left')
    plt.savefig(f'{out_file_base}.pdf')
    return

testdata="""
1   all
1   all
2   all
2   all
2   all
3   all
4   captured
4   captured
5   captured
5   captured
5   captured
8   captured
"""

# Default histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny', None)

# Weighted histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny_weighted', 'length')

print('Done.')

备注：

数据的两种分布是两种类型基因组区域的 DNA 片段长度：“全部”和“捕获”，但这与这个具体问题无关。
最小的可重现示例说明了这个问题。真实的数据框有数千万行，因此直方图和 KDE 图更加平滑且有意义。实际数据需要对 X 轴进行对数转换，以更好地区分两个广泛的分布。
我正在使用这些软件包和版本：

Python 3.11.6

matplotlib-base           3.8.2           py311hfdba5f6_0    conda-forge
numpy                     1.26.3          py311h7125741_0    conda-forge
pandas                    2.2.0           py311hfbe21a1_0    conda-forge
seaborn                   0.13.1               hd8ed1ab_0    conda-forge
seaborn-base              0.13.1             pyhd8ed1ab_0    conda-forge

Answer 1

样本量非常小的直方图和 kde 图通常不能很好地指示更合适的样本量情况下事物的表现。在这种情况下，默认带宽对于该特定数据集来说似乎太宽。带宽可以通过

bw_adjust

的

kdeplot

参数进行缩放。创建

histplot

时，可以通过

kde

字典提供

kde_kws

的参数（“关键字”）。

在我的测试中，对于测试权重和对数标度，我使用了一些

值，并给每个值赋予了 1，除了其中两个之外，给了它们一个非常高的权重。

可以通过取 x 值的对数来模拟对数刻度。轴将显示对数值，这将不如原始值直观。对于小数据集和整数权重，

np.repeat

可以创建每个值重复的数据集。由于内存限制，这不适用于非常大的数据集。

为了快速测试真实数据集的设置和行为，可以通过随机采样来减少等待时间（例如

df.sample(10000)

）。

下面的测试代码表明权重和对数刻度似乎都按预期工作。一个区别是两种情况下的默认带宽不同；使用不同的

bw_scale

进行补偿。

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

df = pd.DataFrame({'x': [1, 10, 20, 30, 40, 50, 60],
                   'w': [1, 10, 1, 1, 1, 10, 1]})
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(9, 6))

sns.histplot(df, x='x', bins=10, weights='w', kde=True, log_scale=True, stat='density',
             kde_kws={'bw_adjust': 0.3}, ax=ax1)
ax1.set_xticks(df['x'].to_numpy(), df['x'].to_numpy())
ax1.set_title('sns.histplot with log scale, weights and kde')

sns.histplot(x=np.log(np.repeat(df['x'], df['w'])), bins=10, kde=True, stat='density',
             kde_kws={'bw_adjust': 0.55}, ax=ax2)
ax2.set_title('mimicing scale and weights')

plt.tight_layout()
plt.show()

使用加权 KDE（核密度估计）绘制加权直方图

问题描述投票：0回答：1

1个回答

最新问题

使用加权 KDE（核密度估计）绘制加权直方图

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1