无法让我的机器学习算法处理我的数据

问题描述 投票:0回答:0

我正在尝试构建一个机器学习算法以根据我的数据进行预测,但我无法让它工作(以退出代码 0 结束,但没有输出)。我正在使用 Python 和 scikit-learn,我的数据在 CSV 文件中,格式如下:

数据格式是一个表格,包含大学篮球队的信息、他们的表现和其他相关统计数据。表格的标题位于第一行,第二行提供有关每个标题的附加信息。其余行包含每个团队的实际数据。

我也尝试过使用 HistGradientBoostingRegressor。我想知道我在数据预处理中是否遗漏了什么,或者我是否应该使用更好的算法。这是相关代码:

def train_predict_model():
    # Read csv files into a list of dataframes
    dir_path = os.path.join(os.path.expanduser("~"), "Downloads", "sportsdata")
    dfs = []
    for year in range(2013, 2022):
        file_name = f"all_ncaa_data.csv"
        file_path = os.path.join(dir_path, file_name)
        df = pd.read_csv(file_path)
        dfs.append(df)
    df = pd.concat(dfs, ignore_index=True)

    headers_1 = ['', '', 'Overall', 'Overall', 'Overall', 'Overall', 'Overall', 'Overall', '', 'Conf.', 'Conf.', '', 'Home', 'Home', '', 'Away', 'Away', '', 'Points', 'Points', '', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'School_Advanced', 'year']
    headers_2 = ['Rk', 'School', 'G', 'W', 'L', 'W-L%', 'SRS', 'SOS', '', 'W', 'L', '', 'W', 'L', '', 'W', 'L', '', 'Tm.', 'Opp.', '', 'Pace', 'ORtg', 'FTr', '3PAr', 'TS%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'eFG%', 'TOV%', 'ORB%', 'FT/FGA', 'year']
    headers = headers_1 + headers_2

    columns = pd.MultiIndex.from_arrays([headers_1, headers_2])
    df.columns = columns

    # Assuming 'df' is your DataFrame
    df = df.sort_index(axis=1, level=[0, 1])

    # Reset the index of the dataframe to the default integer index
    df = df.reset_index(drop=True)

    target = df[('Points', 'Tm.')]
    features = df.drop(['Tm.'], axis=1)

    # One-hot encode the remaining features
    features = pd.get_dummies(features)

    x_train = features.head(2000)
    y_train = target.head(2000)

    x_test = features.tail(5000)
    y_test = target.tail(5000)

    # Define the imputer and model
    imputer = SimpleImputer()
    model = RandomForestRegressor()

    # Create a pipeline to preprocess the data and fit the model
    pipe = make_pipeline(imputer, model)
    pipe.fit(x_train, y_train)

    x_test = x_test.fillna(0)
    x_test = x_test.dropna()
    y_test = y_test[x_test.index]

    # Train the model
    RF = RandomForestRegressor()
    RF.fit(x_train, y_train)

    # Make predictions on the training set
    RF_predict_train = pipe.predict(x_train)
    RMSE_train = sqrt(mean_squared_error(y_train, RF_predict_train))
    print(f"RMSE (training) for RF: {RMSE_train:.10f}")

    # Make predictions on the test set
    RF_predict_test = pipe.predict(x_test)
    RMSE_test = sqrt(mean_squared_error)
    print(f"RMSE (test) for RF: {RMSE_test:.10f}")
python machine-learning regression random-forest data-preprocessing
© www.soinside.com 2019 - 2024. All rights reserved.