我试图将加速度计数据(以 100Hz 的频率采样)分类为 4 种不同的运输模式(0、1、2、3)。我有 41 个不同的 CSV 文件,每个文件代表一个时间序列。我将每个文件存储在一个名为“主题”的列表中。每个 CSV 文件如下所示:
# Check if the label mapping worked
test = subjects[0]
print(test.head())
print(test.info())
print(len(test))
x y z label
0 -0.154881 0.383397 -0.653029 0
1 -0.189302 0.410185 -0.597840 0
2 -0.202931 0.408217 -0.490296 0
3 -0.205011 0.407853 -0.360820 0
4 -0.196665 0.430047 -0.147033 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128628 entries, 0 to 128627
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 128628 non-null float64
1 y 128628 non-null float64
2 z 128628 non-null float64
3 label 128628 non-null int64
dtypes: float64(3), int64(1)
memory usage: 3.9 MB
None
128628
首先,我想从实现随机森林算法开始。但是我不确定如何为此创建训练和测试数据集,因为我有不同的 CSV 文件。
如何为此任务创建训练和测试文件?起初我考虑将所有 CSV 文件连接在一起,但由于每个文件代表一个时间序列,我不确定这是否是正确的方法。
预先感谢您的帮助!
这是您想要执行的操作的一个粗略示例:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# concat the list of your dataframes
df = pd.concat(list_of_your_dataframes)
df = **your data**
# Split the data into features (X) and target labels (y)
X = df[['x', 'y', 'z']]
y = df['label']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", report)