多年时间序列数据的数据子集

问题描述 投票:0回答:1

我有 2007 年至 2022 年 15 分钟的多年时间序列数据(总共 16 年)。数据看起来像 this。我想从这些数据中提取所有可能的子集。每个子集应该有一年的值。所以基本上应该是 4(15 分钟)x24(小时)x 365 或 366 天(闰年)= 35,040 行数据或闰年 35,136 个数据。

子集的形成方式应包含不同年份的 12 个月。例如:

2021 年 1 月(一个月中的所有 15 分钟应全部集中在子集中) 2018年2月起 2012年3月 2015年4月起 2009年5月起 2014年6月起 2022年7月起 2010年8月起 2015年9月起 2020年10月起 2018年11月起 2007年12月起 同年有两个月也很好。

请帮助我如何继续前进。

这是我迄今为止读取数据的代码:

import pandas as pd
import numpy as np

columns_to_read = ['DateTime', 'PLANT ENERGY MWh']

df = pd.read_excel(r'C:/Users/97150/Data - 15 mins multiyear -R2.xlsx', skiprows=0, usecols=columns_to_read)

df['DateTime'] = pd.to_datetime(df['DateTime'])

df.dropna(subset=['DateTime'], inplace=True)

df['Month'] = df['DateTime'].dt.month.astype(int)
df['Year'] = df['DateTime'].dt.year.astype(int)


#df['Month'] = df['DateTime'].dt.month
#df['Year'] = df['DateTime'].dt.year

df.set_index('DateTime', inplace=True)
python dataframe subset permutation
1个回答
0
投票

这是我的简单解决方案,没有任何装饰或过于复杂的Pythonic调整......
我假设您希望子集中有随机年份,但必须考虑每对“月+年”,并且必须(仅)出现在一个子数据集中。

结果存储在 16 个 pandas DataFrame 的列表中,并打印到文件中。

希望就是您所寻找的!
如果有什么不清楚的地方请告诉我,Ciao!

import pandas as pd
import random
from itertools import product

# Define the columns to be read from the Excel file
columns_to_read = ['DateTime', 'PLANT ENERGY MWh']
# Read data from the Excel file into a DataFrame (path changed for me) 
df = pd.read_excel(r'./Data_15_mins_multiyear-R2.xlsx', skiprows=0,  
                  usecols=columns_to_read)

# Convert 'DateTime' column to datetime type
df['DateTime'] = pd.to_datetime(df['DateTime'])
# Drop rows where 'DateTime' is missing
df.dropna(subset=['DateTime'], inplace=True)

## Define start and end dates for data filtering
start_date = pd.Timestamp('2007-01-01')
end_date = pd.Timestamp('2022-12-31')
# Filter the DataFrame to include only data within the specified date range
df = df[(df['DateTime'] >= start_date) & (df['DateTime'] <= end_date)]

## Extract 'Month' and 'Year' from the 'DateTime' column
df['Month'] = df['DateTime'].dt.month
df['Year'] = df['DateTime'].dt.year
# Group the DataFrame by 'Month' and 'Year'
grouped = df.groupby(['Month', 'Year'])

# Get the unique years present in the DataFrame
unique_years = df['Year'].unique()

# Create a dictionary to hold data subsets for each year
yearly_data = {}

### Populate the yearly_data dictionary with subsets of data for each year
for year in unique_years:
    subset_df = df[df['Year'] == year].reset_index(drop=True)
    yearly_data[year] = subset_df

N = len(yearly_data)

## Get all possible combinations of months and years and shuffle
all_combinations = list(product(range(1, 13), unique_years))
random.shuffle(all_combinations)

# Create an empty list to hold the final datasets
datasets_list = []

#################### Build all datasets:
for _ in range(N):
    # Create an empty DataFrame to store the current dataset
    curr_dataset = pd.DataFrame()
    
    # Loop through months 1 to 12
    for month in range(1, 13):
        # Retrieves the index from the iterable (the desired combination) 
        # according to the condition (placeholder for the useless year value)
        index_to_pop = next((i for i, (m, _) in enumerate(all_combinations) \
        if m == month), None)
        
        if index_to_pop is not None:
            # Remove the combination from the list and get the associated year
            _, year = all_combinations.pop(index_to_pop)
            
            # Get the df for the selected year
            subset_df = yearly_data[year]
            
            # Filter the subsets to include only rows with the current month
            subset_month = subset_df[subset_df['Month'] == month]
            
            ## Concatenate the subset_month df to the current one.   
            # Avoid to use "append" instead of "concat", since the former is deprecated
            curr_dataset = pd.concat([curr_dataset, subset_month], ignore_index=True)
    
    # Append the subdf to the datasets_list
    datasets_list.append(curr_dataset)

########## Print the result into a file 
## Indicate that all columns and rows must be displayed, without any truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
for i, dataset in enumerate(datasets_list, start=1):
    # Write on a file in update mode
    with open('my_output_file.txt', 'a') as file: 
        print(f"SubDataset {i}:", file=file)
        print(dataset, file=file)
        print()    

==>输出<==
我的输出文件是:

SubDataset 1:
                 DateTime  PLANT ENERGY MWh  Month  Year
0     2013-01-01 00:07:00          0.000000      1  2013
1     2013-01-01 00:22:00          0.000000      1  2013
2     2013-01-01 00:37:00          0.000000      1  2013
3     2013-01-01 00:52:00          0.000000      1  2013
.....     .........      ........      .......... ......
34941 2022-12-30 23:22:00          0.000000     12  2022
34942 2022-12-30 23:37:00          0.000000     12  2022
34943 2022-12-30 23:52:00          0.000000     12  2022

SubDataset 2:
             DateTime  PLANT ENERGY MWh  Month  Year
0     2019-01-01 00:07:00          0.000000      1  2019
1     2019-01-01 00:22:00          0.000000      1  2019
2     2019-01-01 00:37:00          0.000000      1  2019
3     2019-01-01 00:52:00          0.000000      1  2019
4     2019-01-01 01:07:00          0.000000      1  2019
.....     .........      ........      .......... ......
20446 2014-07-31 23:37:00          0.000000      7  2014
20447 2014-07-31 23:52:00          0.000000      7  2014
20448 2015-08-01 00:07:00          0.000000      8  2015
20449 2015-08-01 00:22:00          0.000000      8  2015
20450 2015-08-01 00:37:00          0.000000      8  2015
.....     .........      ........      .......... ......

快速检查每个数据集中一个月x的年份分布…

month_x = []
x = 1
# Loop through each dataset
for dataset in datasets_list:
    # Filter rows with Month equal to x
    row_to_append = dataset.loc[dataset['Month'] == x].iloc[0]
    month_x.append(row_to_append)

month_x 是:

[DateTime            2013-01-01 00:07:00
 PLANT ENERGY MWh                    0.0
 Month                                 1
 Year                               2013
 Name: 0, dtype: object,
 DateTime            2019-01-01 00:07:00
 PLANT ENERGY MWh                    0.0
 Month                                 1
 Year                               2019
 Name: 0, dtype: object,
 DateTime            2018-01-01 00:07:00
 PLANT ENERGY MWh                    0.0
 Month                                 1
 Year                               2018
 Name: 0, dtype: object,
........................................
........................................
© www.soinside.com 2019 - 2024. All rights reserved.