我正在尝试在 colab 的 csv 文件中创建一列,以根据时间戳列的日期来计算类的数量
时间戳/class_id
2021-09-27 06:00:00 / A
2021-09-27 03:00:00 / A
2021-09-27 01:00:00 / A
2021-09-27 08:29:00 / C
2021-05-23 08:08:49 / B
2021-05-23 03:21:49 / B
2021-05-23 01:22:11 / C
预期结果:
计数/时间戳/class_id
1 / 2021-09-27 06:00:00 / A
2 / 2021-09-27 03:00:00 / A
3 / 2021-09-27 01:00:00 / A
1 / 2021-09-27 08:29:00 / C
1 / 2021-05-23 08:08:49 / B
2 / 2021-05-23 03:21:49 / B
1 / 2021-05-23 01:22:11 / C
from google.colab import drive
drive.mount('/content/gdrive')
import pandas as pd
data = pd.read_csv('gdrive/My Drive/Colab_Notebooks/capstoneproject/capstonedata.csv', parse_dates=['visit_date'], index_col='visit_date')
class_id = data['class_id']
count = 0
data['count'] = count
index = -1
for row in data:
index = index + 1
if class_id[index] == class_id[index-1]:
data['count'] = count + 1
elif count == 0:
data['count'] = count + 1
else:
data['count'] = 1
data.head()
这是我到目前为止的代码,但这是输出
时间戳/class_id/计数
2021-09-27 06:00:00 / A / 1
2021-09-27 03:00:00 / A / 1
2021-09-27 01:00:00 / A / 1
2021-09-27 08:29:00 / C / 1
2021-05-23 08:08:49 / B / 1
2021-05-23 03:21:49 / B / 1
2021-05-23 01:22:11 / C / 1
from typing import Generator
import datetime as dt
import pandas as pd
def _get_df() -> pd.DataFrame:
start = dt.datetime(2021, 5, 23, 7, 0)
df = pd.DataFrame(
{
"stamp": [start + dt.timedelta(hours=-i) for i in range(7)],
"class_id": list("AAACBBC"),
}
)
df["count"] = list(_get_counts(df.class_id))
return df
def _get_counts(s: pd.Series) -> Generator[int, None, None]:
prev = None
count = 1
for val in s:
if val == prev:
count += 1
else:
count = 1
yield count
prev = val
if __name__ == "__main__":
print(_get_df())
输出:
stamp class_id count
0 2021-05-23 07:00:00 A 1
1 2021-05-23 06:00:00 A 2
2 2021-05-23 05:00:00 A 3
3 2021-05-23 04:00:00 C 1
4 2021-05-23 03:00:00 B 1
5 2021-05-23 02:00:00 B 2
6 2021-05-23 01:00:00 C 1