我在python中有一个基本的pandas数据帧,它接收数据并绘制线图。每个数据点都涉及一个时间。如果一切都与数据文件运行良好,理想情况下每个时间戳大约相差30分钟。在某些情况下,没有数据超过一小时。在这些时候,我想将这个时间帧标记为“缺失”,并绘制一个不连续的折线图,公然显示数据丢失的位置。
我很难搞清楚如何做到这一点甚至搜索解决方案,因为问题非常具体。数据是“实时”的,它会不断更新,所以我不能只指出某个区域并编辑为变通方法。
看起来像这样的东西:
用于创建日期时间列的代码:
#convert first time columns into one datetime column
df['datetime'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second']])
我已经弄清楚如何计算时差,这涉及到创建一个新列。以下是代码:
df['timediff'] = (df['datetime']-df['datetime'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
基本看看数据帧:
datetime l1 l2 l3
2019-02-03 01:52:16 0.1 0.2 0.4
2019-02-03 02:29:26 0.1 0.3 0.6
2019-02-03 02:48:03 0.1 0.3 0.6
2019-02-03 04:48:52 0.3 0.8 1.4
2019-02-03 05:25:59 0.4 1.1 1.7
2019-02-03 05:44:34 0.4 1.3 2.2
我只是不确定如何创建一个涉及时差的不连续“实时”情节。
提前致谢。
不完全是您想要的,但快速而优雅的解决方案是重新采样您的数据。
df = df.set_index('datetime')
df
l1 l2 l3
datetime
2019-02-03 01:52:16 0.1 0.2 0.4
2019-02-03 02:29:26 0.1 0.3 0.6
2019-02-03 02:48:03 0.1 0.3 0.6
2019-02-03 04:48:52 0.3 0.8 1.4
2019-02-03 05:25:59 0.4 1.1 1.7
2019-02-03 05:44:34 0.4 1.3 2.2
df.resample('30T').mean()['l1'].plot(marker='*')
如果您绝对需要精确绘制每个样本,则可以将连续时间戳之间的差异超过某个阈值的数据拆分,并分别绘制每个块。
from datetime import timedelta
# get difference between consecutive timestamps
dt = df.index.to_series()
td = dt - dt.shift()
# generate a new group index every time the time difference exceeds
# an hour
gp = np.cumsum(td > timedelta(hours=1))
# get current axes, plot all groups on the same axes
ax = plt.gca()
for _, chunk in df.groupby(gp):
chunk['l1'].plot(marker='*', ax=ax)
或者,您可以在数据中注入“漏洞”。
# find samples which occurred more than an hour after the previous
# sample
holes = df.loc[td > timedelta(hours=1)]
# "holes" occur just before these samples
holes.index -= timedelta(microseconds=1)
# append holes to the data, set values to NaN
df = df.append(holes)
df.loc[holes.index] = np.nan
# plot series
df['l1'].plot(marker='*')
编辑:@Igor Raush给出了一个更好的答案,但无论如何我都要离开它,因为可视化有点不同。
看看这对你有帮助:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Track the time delta in seconds
# I used total_seconds() and not seconds as seconds are limited to the amount of secs in one day
df['timediff'] = (df['datetime'] - df['datetime'].shift(1)).dt.total_seconds().cumsum().fillna(0)
# Create a dataframe of all the possible seconds in the time range
all_times_df = pd.DataFrame(np.arange(df['timediff'].min(), df['timediff'].max()), columns=['timediff']).set_index('timediff')
# Join the dataframes and fill nulls with 0s, so the values change only where data has been received
live_df = all_times_df.join(df.set_index('timediff')).ffill()
# Plot only your desired columns
live_df[['l1', 'l3']].plot()
plt.show()
使用我的新timediff列和df.loc函数解决了问题。
df['timediff'] = (df['datetime']-df['datetime'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
有了这个,我就能收集每一行的时差。
然后使用df.loc,我能够在l1和l2列中找到timediff大于一小时的值,然后生成一个nan。结果就是那个时刻的情节中缺少一条线,就像我想要的那样。
missing_l1 = df['l1'].loc[df['timediff'] > timedelta(hours=1)] = np.nan
missing_l2 = df['l2'].loc[df['timediff'] > timedelta(hours=1)] = np.nan