如何使用主成分分析（PCA）来分析由 300 个随海拔高度变化的模型温度数据样本组成的数据集？

Question

我有 300 个温度和海拔数据样本，每个样本的大小为 20x300。每个样本都是使用不同的特征生成的。这 300 个温度分布图随海拔高度而变化，因此很难将数据可视化并确定 300 个样本中的哪个样本引起最大变化，以及哪个样本数对应于与标称温度分布图的最小偏差。此外，我需要识别表现出相似模式或温度曲线的样本。我尝试使用主成分分析将维度减少到 2 个成分，但我不确定如何解释数据以及我是否朝着正确的方向前进。在下面的示例代码中，我复制了 300 个温度和海拔样本数据。

import numpy as np
import matplotlib.pyplot as plt


"""  Create 300 Sample data for Altitude and Temperature"""
# Define the number of rows and columns
num_rows, num_columns = 20, 300

# Create an array of altitudes ranging from 100 km to 500 km
altitudes = np.linspace(100, 500, num_rows)

# Repeat the altitude values across all columns
altitude_array = np.tile(altitudes, (num_columns, 1)).T

# Parameters for generating temperature profiles
altitude_midpoint = (altitudes.min() + altitudes.max()) / 2
temperature_amplitude, temperature_frequency = 250, 0.02

# Create a sine wave to represent temperatures with altitude
temperatures = temperature_amplitude * np.sin(temperature_frequency * (altitudes - altitude_midpoint)) + 750

# Create random noise for slight temperature variations
max_variation = 50
noise = np.random.uniform(-max_variation, max_variation, (num_rows, num_columns))

# Add the noise to the temperatures to create temperature profiles
temperature_array = temperatures[:, np.newaxis] + noise


""" Principal Component Analysis"""

from sklearn.decomposition import PCA

run_number = np.arange(1, 301).reshape(300, 1)

pca = PCA(2)  
principal_components = pca.fit_transform(temperature_array.T)
print(temperature_array.shape)
print(principal_components.shape)


# Create a figure with two subplots
fig, axs = plt.subplots(1, 3, figsize=(12, 5))

# First subplot
# Plot temperature vs. altitude for all 300 profiles
for i in range(num_columns):
    sc1 = axs[0].plot(temperature_array[:, i], altitude_array[:, i], label=f'Profile {i+1}', alpha=0.8)

# Plot the nominal temperature profile with marked lines
nominal_temperature = temperature_amplitude * np.sin(temperature_frequency * (altitudes - altitude_midpoint)) + 750
axs[0].plot(nominal_temperature, altitudes, 'k', linewidth=2, label='Nominal Profile', linestyle='--')

# Set plot labels and title
axs[0].set_ylabel('Altitude (km)')
axs[0].set_xlabel('Temperature (K)')
axs[0].set_title('Temperature vs. Altitude Profiles')

# Second subplot
sc2 = axs[1].scatter(principal_components[:, 0], principal_components[:, 1],
                     c=run_number.T, edgecolor='none', alpha=0.5,
                     cmap=plt.cm.get_cmap('Accent'))
axs[1].set_xlabel('component 1')
axs[1].set_ylabel('component 2')
fig.colorbar(sc2, ax=axs[1])

# Third subplot
sc3 = axs[2].plot(np.cumsum(pca.explained_variance_ratio_))
axs[2].set_xlabel('number of components')
axs[2].set_ylabel('cumulative explained variance')


plt.tight_layout()

Answer 1

您绝对走在正确的道路上。根据您的目标，使用 PCA 作为起点来更好地理解和分析您的数据似乎是一个合理的策略。

通过检查您的代码和生成的数字，您似乎已经在 altitude 维度上应用了降维。但是，您可能打算在 profile 维度上执行 PCA。本质上，您的目标是通过识别一些“潜在”配置文件来了解您的配置文件，这些“潜在”配置文件捕获了在高度维度上观察到的大部分方差。用 sklearn 的语言来说，海拔是你的样本维度，你的温度曲线是特征维度。

需要调整的关键部分是这一行：

principal_components = pca.fit_transform(temperature_array)  # <-- no transpose!

顺便说一句：通过查看主成分分数和解释的方差比，您实际上可以看到您使用了“错误”的维度。

PC 分数：正确的维数会将您的平均温度曲线显示为第一个主成分，表明有意义的降低。相反，您当前的公式主要揭示了噪音。
解释的方差比：在维度正确对齐的情况下，第一个分量应考虑几乎所有方差，从而有效地汇总数据。后续组件只会贡献最小的额外方差，本质上代表噪声。然而，您当前的设置表明所有模式都由噪声主导。

如何使用主成分分析（PCA）来分析由 300 个随海拔高度变化的模型温度数据样本组成的数据集？

问题描述投票：0回答：1

1个回答

最新问题

如何使用主成分分析（PCA）来分析由 300 个随海拔高度变化的模型温度数据样本组成的数据集？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1