根据时间间隔的观测数据进行合成

Question

我需要根据热力学检索数据制作复合时间高度图。这些检索是十分钟的平均剖面（电位温度、水蒸气混合比等），但偶尔会出现我需要考虑的数据间隙。高度都是相等的，所以这不是问题。

我正在尝试制作一个 4 小时长的合成图，以海风通道为中心。因此，对于每种情况，这个时间都不同，但合成时间应该是 4 小时。由于时间不同，我不能简单地制定一个理想的时间变量（比如 18-23UTC），而是需要一个相对时间变量（-2hr 到 +2hr）。

我目前在处理丢失数据时遇到问题。目前的数据情况是，如果有缺失数据，那么它只会在时间上跳跃（例如数据 16-18，然后是 19-24）。

我相信我需要创建一个虚拟变量，循环访问数据，并用正确的时间索引中的数据填充虚拟变量，但考虑到时间会根据日期而变化，我不确定如何做到这一点。

数据文件示例可在以下位置找到：https://data.nssl.noaa.gov/thredds/catalog/FRDD/CLAMPS/clamps/clamps1/processed/clampstropoe10.aeri_mwr.v1.C1/catalog.html

以下是代码片段，包括读取文件/数据、创建开始/结束时间、查找开始/结束时间的索引以及提取相应的数据。

但是，如果在开始/结束时间发生时出现数据间隙，这就会成为问题。

`"""
Testing for composites
"""

from ipywidgets import interact
from datetime import datetime, date, timedelta
from siphon.catalog import TDSCatalog
import cmocean
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
from netCDF4 import Dataset
import os
import xarray as xr
import glob
import pandas as pd

def find_closest(A, target):
    """
    This function finds the index of the closest value to a 
    target value with a specified array.
    Source: Jeremy Gibbs (NOAA-NSSL)
    
    Args:
        A (array): Data to search for closest value
        target (float): Value being searched for
    Returns:
        idx (int): Index of A with value closest to target.
    """

    idx = A.searchsorted(target) # A must be sorted
    idx = np.clip(idx, 1, len(A)-1)
    left = A[idx-1]
    right = A[idx]
    idx -= target - left < right - target
    
    return idx

def decimal_hours_to_datetime(decimal_hours, years, months, days):
    datetime_list = []
    for i, hours in enumerate(decimal_hours):
        start_time = datetime(year=years[i], month=months[i], day=days[i])
        
        # Calculate the timedelta corresponding to the decimal hours
        timedelta_hours = timedelta(hours=hours)
        
        # Add the timedelta to the start time to get the datetime object
        datetime_obj = start_time + timedelta_hours
        
        datetime_list.append(datetime_obj)
    
    return datetime_list

savelocation = "/Users/michellerosespencer/Documents/phd/"

    # Make lists with all the data needed for c1
input_files_c1 = glob.glob('/Users/michellerosespencer/Documents/phd/pblh_invest/data/sea_breeze_days/clampstropoe10.aeri.v1.C1.*.nc')
input_files_c1.sort()

pblh_files_c1 = glob.glob('/Users/michellerosespencer/Documents/phd/pblh_invest/data/sea_breeze_days/*C1fuz*.nc')
pblh_files_c1.sort()

#%%
    # Create a list of dates for C1 sea-breeze days
    # This list INCLUDES the days after
dt_c11 = [] 
for a in range(5,7):
    dat = date(2022, 6, a).isoformat()
    dt_c11.append(dat)
    
for a in range(8,13):
    dat = date(2022, 6, a).isoformat()
    dt_c11.append(dat)
#%%

    # Turn decimal hours/days into datetime objects for c1
decimal_hours_c1 = [1]
years_c1 = [2022]
months_c1 = [6]
days_c1 = [11]

datetime_objects_c1 = decimal_hours_to_datetime(decimal_hours_c1, years_c1, months_c1, days_c1)
sb_times_c1 = datetime_objects_c1

#%%
    # Create emply lists so we can append our daily, sea-breeze sliced 6 hour window variables to a master 
    # array, which we can then average to create our composites of each variable
comp_temp = []
comp_wv = []
comp_pblh = []

#%%
    # Make common time grid - Master variable
    t_bin = timedelta(minutes=10)
    # t_bin = np.arange(-7195.0, 7805.0, 600)

    # Initialize empty arrays for composites based on dimensions of master time and height arrays
    composite_pblh = np.zeros((len(t_bin), 55))
    composite_temp = np.zeros((len(t_bin), 55))
    composite_wv = np.zeros((len(t_bin), 55))

    for j in range(0, len(dt_c11)):       
        if j == 5:
            # Open the dataset
            nc_thermo1 = Dataset(input_files_c1[j])
            nc_thermo2 = Dataset(input_files_c1[j+1])
            nc_pblh1 = Dataset(pblh_files_c1[j])
            nc_pblh2 = Dataset(pblh_files_c1[j+1])
        
            t_1 = nc_thermo1['base_time'][:] + nc_thermo1['time_offset'][:]
            t_2 = nc_thermo2['base_time'][:] + nc_thermo2['time_offset'][:]
            time_1 = np.append(t_1, t_2)
            hour = np.array([datetime.utcfromtimestamp(e) for e in time_1])
        
            h_1 = nc_thermo1.variables['height'][:]
            h_2 = nc_thermo2.variables['height'][:]
            height = h_1
            temp1 = nc_thermo1.variables['theta'][:,:]
            temp2 = nc_thermo2.variables['theta'][:,:]
            temp = np.append(temp1, temp2, axis = 0)
            wv1 = nc_thermo1.variables['waterVapor'][:,:]
            wv2 = nc_thermo2.variables['waterVapor'][:,:]
            wv = np.append(wv1, wv2, axis = 0)
            cbh1 = nc_thermo1.variables['cbh'][:]
            cbh2 = nc_thermo2.variables['cbh'][:]
            cbh = np.append(cbh1, cbh2, axis = 0)
        
        t_1 = nc_pblh1['t_epoch'][:]
        t_2 = nc_pblh2['t_epoch'][:]
        time_2 = np.append(t_1,t_2)
        hour2 = np.array([datetime.utcfromtimestamp(e) for e in time_2])
        
        pblh1 = nc_pblh1.variables['pblh'][:]
        pblh2 = nc_pblh2.variables['pblh'][:]
        pblh = np.append(pblh1, pblh2, axis = 0)
        
        t1 = nc_thermo1['hour'][:]
        t2 = nc_thermo2['hour'][:]
        t2 = t2 + 24
        t = np.append(t1, t2)
        
        nc_thermo1.close()
        nc_thermo2.close()
        nc_pblh1.close()
        nc_pblh2.close()
        # Create start and end times, so we can index the variables
        t_delta = timedelta(hours=2)
        start_time = sb_times_c1 - t_delta
        end_time = sb_times_c1 + t_delta
        
        # Find indices for start and end times
        start_idx = np.argmin(np.abs(hour2 - start_time))
        end_idx = np.argmin(np.abs(hour2 - end_time))
        
        start_idx_thermo = np.argmin(np.abs(hour - start_time))
        end_idx_thermo = np.argmin(np.abs(hour - end_time))
        
        # Pull data corresponding to the indices
        temp = temp[start_idx_thermo:end_idx_thermo, :]
        wv = wv[start_idx_thermo:end_idx_thermo, :]
        pblh = pblh[start_idx:end_idx]
        cbh = cbh[start_idx_thermo:end_idx_thermo]
        
        # Create ideal time axis/window
        window_time = hour[start_idx_thermo:end_idx_thermo]
        rel_time = window_time - sb_times_c1
        
        # Convert relative time window into integers (seconds)
        for i in range(0, len(rel_time)):
            rel_time[i] = rel_time[i].total_seconds()`

Answer 1

由于代码使用 pandas 模块，因此它将能够利用数据帧原生的方法。

这里有一些您可以立即查看的内容...

df = df.fillna(0) # fills all blanks with 0

# Fill with a specific value based on column
df['column1'] = df['column1'].fillna('missing')

df = df.fillna(method='ffill') # use last valid value
df = df.fillna(method='bfill') # use next valid value

# Fill with mean, median or mode of column
df['column2'].fillna(df['column2'].mean(), inplace=True)

# Fill blanks with value from previous row
df = df.fillna(method='pad')

# Fill based on different methods per column
df = df.fillna({'col1': 'missing', 'col2': df['col2'].median()})

上面还有很多变体......

一个好的建议是将它们放入单独的“数据清理”函数或方法中，该函数或方法接受数据帧对象并返回清理后的数据帧对象...

看起来像这样：

def clean_data(df:pd.DataFrame) -> pd.DataFrame:
    ''' function to clean data and return it neatly '''

    # some code from the above cleaning methods.
    ...

    return df

根据时间间隔的观测数据进行合成

问题描述投票：0回答：1

1个回答

最新问题

根据时间间隔的观测数据进行合成

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1