如何将线性回归应用于包含 NaN 的大型多维数组中的每个像素?

问题描述 投票:0回答:5

我有一个自变量值的 1D 数组 (

x_array
),它与具有多个时间步长的空间数据 3D numpy 数组中的时间步长相匹配 (
y_array
)。我的实际数据要大得多:300+ 时间步长和高达 3000 * 3000 像素:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

我想计算每个像素的线性回归,并获得

xy
中每个
y_array
像素的 R 平方、P 值、截距和斜率,并将
x_array
中每个时间步长的值作为我的自变量。

我可以重塑以获取表单中的数据,然后将其输入到矢量化且快速的

np.polyfit
中:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

但是,这会忽略包含任何

NaN
值的像素(
np.polyfit
不支持
NaN
值),并且不会计算我需要的统计数据(R 平方、P 值、截距和斜率)。

此处的答案使用

scipy.stats import linregress
计算我需要的统计数据,并建议通过屏蔽这些
NaN
值来避免
NaN
问题。但是,此示例适用于两个一维数组,我无法弄清楚如何将类似的屏蔽方法应用于我的情况,其中
y_array_reshaped
中的每一列将具有一组不同的
NaN
值。

我的问题: 如何以相当快的矢量化方式计算包含许多

NaN
值的大型多维数组 (300 x 3000 x 3000) 中每个像素的回归统计数据?

期望的结果:

y_array
中每个像素的回归统计值(例如R平方)的3 x 3数组,即使该像素在时间序列中的某个点包含
NaN

python arrays numpy scipy python-xarray
5个回答
10
投票

上面评论中提到的这篇博客文章包含一个令人难以置信的快速矢量化函数,用于 Python 中多维数据的互相关、协方差和回归。它生成我需要的所有回归输出,并且在几毫秒内完成,因为它完全依赖于

xarray
中的简单向量化数组运算。

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

我做了一个小小的更改(

#3
之后的第一行)以确保函数正确地考虑每个像素中不同数量的 NaN 值:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time, lat, lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x, y = xr.align(x, y)

#2. Add lag information if any, and shift the data accordingly
if lagx != 0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time=-lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x, y = xr.align(x, y)

if lagy!=0:
    y   = y.shift(time=-lagy).dropna(dim='time')
    x, y = xr.align(x, y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean) * (y - ymean), axis=0) / n

#5. Compute correlation along time axis
cor   = cov / (xstd * ystd)

#6. Compute regression slope and intercept:
slope     = cov / (xstd**2)
intercept = ymean - xmean * slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor * np.sqrt(n-2) / np.sqrt(1 - cor**2)
stderr = slope / tstats

from scipy.stats import t
pval = t.sf(tstats, n-2) * 2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov, cor, slope, intercept, pval, stderr

6
投票

此处提供的答案https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html绝对好,因为它主要利用了

numpy
矢量化和广播的强大功能,但它假设要分析的数据是完整的,但在实际研究周期中通常情况并非如此。上面的一个答案旨在解决丢失数据的问题,但我个人认为需要更新更多代码,因为如果数据中存在 nan,
np.mean()
将返回 nan。幸运的是,
numpy
提供了
nanmean()
nanstd()
等,让我们可以通过忽略数据中的nan来计算平均值、标准误差等。同时,原始博客中的程序针对的是 netCDF 格式的数据。有些人可能不知道这一点,但更熟悉原始的
numpy.array
格式。因此,我在这里提供一个代码示例,展示如何计算两个 3D 维数组(n 维具有相同逻辑)之间的协方差、相关系数等。请注意,为了方便起见,我将
x_array
设为
y_array
第一维的索引,但在实际分析中,
x_array
肯定可以从外部读取。

代码

def linregress_3D(y_array):
    # y_array is a 3-D array formatted like (time,lon,lat)
    # The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
    # Construct x_array indicating time indexes of y_array, namely the independent variable.
    x_array=np.empty(y_array.shape)
    for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
    x_array[np.isnan(y_array)]=np.nan
    # Compute the number of non-nan over each (lon,lat) grid box.
    n=np.sum(~np.isnan(x_array),axis=0)
    # Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
    x_mean=np.nanmean(x_array,axis=0)
    y_mean=np.nanmean(y_array,axis=0)
    x_std=np.nanstd(x_array,axis=0)
    y_std=np.nanstd(y_array,axis=0)
    # Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
    cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
    # Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
    cor=cov/(x_std*y_std)
    # Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
    slope=cov/(x_std**2)
    # Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
    intercept=y_mean-x_mean*slope
    # Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
    tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
    stderr=slope/tstats
    from scipy.stats import t
    p_val=t.sf(tstats,n-2)*2
    # Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
    # r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
    r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
    rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
    # Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
    n=n*1.0 # convert n from integer to float to enable later use of np.nan
    n[n<3]=np.nan
    slope[np.isnan(n)]=np.nan
    intercept[np.isnan(n)]=np.nan
    p_val[np.isnan(n)]=np.nan
    r_square[np.isnan(n)]=np.nan
    rmse[np.isnan(n)]=np.nan
    return n,slope,intercept,p_val,r_square,rmse

样本输出

我使用这个程序测试了两个227x3601x6301像素的3D阵列,它在20分钟内完成了工作,每个阵列不到10分钟。


5
投票

我不确定这将如何扩展(也许你可以使用 dask),但这里有一个非常简单的方法,使用 pandas

DataFrame
使用 apply 方法来做到这一点:

import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)

输出:

                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154

4
投票

在 numpy 级别,您可以使用 np.vectorize

首先定义每组数据的棘手部分:

def compute(x,y):
        mask=~np.isnan(y)
        return linregress(x[mask],y[mask])

然后定义将处理数据的函数:

comp = np.vectorize(compute,signature="(k),(k)->(),(),(),(),()")

然后应用,重新组织数据以遵循广播规则:

res = comp(x_array,rollaxis(y_array,0,3))

最后,

final=np.dstack(res) 

现在

final[i,j]
包含由
linregress
返回的像素
(i,j)
的五个参数。

它与 pandas 方法答案大致相当,但速度快 2.5 倍。
300x(100x100 图像)的问题大约需要 5 秒,所以为你的问题计算一个小时。我认为做得更好并不容易,因为时间基本上都花在了

linregress
函数上。


0
投票

我读了罗比的回答,非常棒;它与我的项目配合得很好。然而,P 值的计算可能需要进行一些小的改进,如下所示:

pval = t.sf(np.abs(tstats['var'].values), n['var'].values - 2) * 2.

这个调整是必要的,因为在较新版本的 xarray 中会出现错误。强调 np.abs(tstats) 对于 P 值的正确计算可能非常重要!不管怎样,这样的修改对我来说效果很好。

© www.soinside.com 2019 - 2024. All rights reserved.