我想从亚秒级精度时间序列 csv 文件的列中解析时间,但它会返回
NaT
来获取某些时间戳。
数据集的一个怪癖是,每个非整秒都将在
%Y-%m-%d %H:%M:%S.%f
中表示,而每个整秒将在 %Y-%m-%d %H:%M:%S
中表示
我观察到第一行中出现的格式将被转换,另一行将是
NaT
import pandas as pd
# Example data
timestamps_full_first = [
"2023-12-30 00:00:00",
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01"
]
timestamps_sub_first = [
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01",
"2023-12-30 00:00:01.1"
]
# Convert to datetime
datetime_series_full_first = pd.to_datetime(timestamps_full_first, errors='coerce', utc=True)
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, errors='coerce', utc=True)
print(datetime_series_full_first)
print(datetime_series_sub_first)
输出:
DatetimeIndex(['2023-12-30 00:00:00+00:00', 'NaT', 'NaT',
'2023-12-30 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'NaT',
'2023-12-30 00:00:01.100000+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
我最初的解决方案是编写一个自定义的 parse_date 函数,我可以在其中提供多种格式的列表。
def parse_date(self, date_str, formats = ["none"]):
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt, utc=True)
except ValueError:
continue
return pd.NaT # Return 'Not a Time' for unrecognized formats
用途:
data[self.timestamp_col] = data[self.timestamp_col].apply(lambda x: self.parse_date(x, formats = self.timestamp_formats))
它可以工作,但是与 pandas 的内部解析相比,它非常非常慢。
GPT 建议对数据帧进行矢量化,然后使用掩码解析一次,然后使用备用格式第二次解析包含
NaT
的行,这应该会提高性能。
import pandas as pd
import numpy as np
def vectorized_parse_date(date_series, formats):
result_series = pd.Series(np.full(date_series.shape, pd.NaT), index=date_series.index)
for fmt in formats:
mask = result_series.isna() & ~date_series.isna() # Only try to parse where we don't have a result and the date is not NaN
try:
result_series[mask] = pd.to_datetime(date_series[mask], format=fmt, errors='raise', utc=True)
except ValueError:
continue
return result_series
# Usage
data[self.timestamp_col] = vectorized_parse_date(data[self.timestamp_col], formats=self.timestamp_formats)
我还没有尝试过,因为我觉得 GPT 对我的方法有些偏见,并试图找出如何在我的人为限制内做到这一点。
所以也许你们中的一些人会看到另一种使用 pandas 功能的解决方案。
只需使用
format='mixed'
:
out = pd.to_datetime(timestamps_full_first, format='mixed', errors='coerce', utc=True)
输出:
DatetimeIndex([ '2023-12-30 00:00:00+00:00',
'2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'2023-12-30 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
fillna
:
s = pd.Series(timestamps_full_first)
out = (pd.to_datetime(s, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce', utc=True)
.fillna(pd.to_datetime(s, format='%Y-%m-%d %H:%M:%S', errors='coerce', utc=True))
)
输出:
0 2023-12-30 00:00:00+00:00
1 2023-12-30 00:00:00.100000+00:00
2 2023-12-30 00:00:00.900000+00:00
3 2023-12-30 00:00:01+00:00
dtype: datetime64[ns, UTC]
format="ISO8601"
(pandas v2 功能),对我来说效果很好:
datetime_series_full_first = pd.to_datetime(timestamps_full_first, format="ISO8601",
utc=True, errors='coerce')
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, format="ISO8601",
utc=True, errors='coerce')
print(datetime_series_full_first)
print(datetime_series_sub_first)
DatetimeIndex([ '2023-12-30 00:00:00+00:00',
'2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'2023-12-30 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'2023-12-30 00:00:01+00:00',
'2023-12-30 00:00:01.100000+00:00'],
dtype='datetime64[ns, UTC]', freq=None)