为什么访问 xarray 中的值这么慢？

Question

我需要访问一堆历史天气数据并使用 ERA5 数据集（> 1 Mio。特定位置点在特定时间戳）。

我通过下载的 GRIB 文件和 xarray 包访问它。不幸的是，该脚本运行速度非常慢：每个点都需要大约 250 到 300 ms 的处理时间。在很大程度上，这是由于访问和计算每日值（130 到 200 毫秒）。

你知道为什么要花这么长时间吗？我可以以某种方式优化它（除了并行化之外）吗？

我已经尝试创建

weatherdata_oneday

的深层副本，这没有改变任何东西（它既不更快也不更慢）。

我的代码是：

import xarray as xr
import pandas as pd
import zoneinfo
import datetime

KELVIN_TO_DEGREE_CELSIUS = -273.15

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                     backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

coords_times = pd.read_csv("coords_times.csv")


all_results = {}


def to_naive(timestamp_with_timezone):
    timestamp_in_utc = timestamp_with_timezone.astimezone(datetime.timezone.utc)
    naive_timestamp = timestamp_in_utc.replace(tzinfo = None)
    return(naive_timestamp)


for index, row in coords_times.iterrows():
    loc = ds.sel(longitude=[row["lon"]], latitude=[row["lat"]], method="nearest")

    # some timestamp handling - just included for the sake of completeness
    timestamp_exact = datetime.datetime.fromisoformat(row["timestamp"])
    timestamp_start_hour = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, timestamp_exact.hour, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_begin = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, 0, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_end = timestamp_current_day_begin + datetime.timedelta(hours = 23)

    weatherdata_oneday = loc.sel(time = slice(to_naive(timestamp_current_day_begin), to_naive(timestamp_current_day_end)))

    # critical code part
    # this takes between 130 and 200 ms
    this_day_results = {
            "day_temperature_min" : weatherdata_oneday.t2m.min().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_max" : weatherdata_oneday.t2m.max().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_mean" : weatherdata_oneday.t2m.mean().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_cloudcover_mean" : weatherdata_oneday.tcc.mean().values,
            "day_cloudcover_hours_low" : (weatherdata_oneday.tcc.values < 0.2).sum()
    }

    this_results = {**this_day_results}
    all_results[row["id"]] = this_results

非常感谢！

Answer 1

好的，我找到原因了。在 xarray 中访问值可能会非常慢，因为数据是“延迟加载”的。

在执行任何访问之前添加以下内容，将创建结果字典所需的时间减少到几乎为 0。

ds.load()

所以生成的代码是：


[...]

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                         backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

ds.load()

[...]

这些资源帮助了我：

Answer 2

0
投票

ds.load() 运行时间长吗？

为什么访问 xarray 中的值这么慢？

问题描述投票：0回答：2

2个回答

最新问题

为什么访问 xarray 中的值这么慢？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2