我正在尝试从数据集中预测 ufc 比赛。我的模型想要 51 个特征,但得到了 3099 个。
我尝试从数据集中提取数值特征和分类特征,然后将它们组合起来。我最终得到的模型预计有 51 个特征,但当我尝试训练它时却得到了 3099 个特征。
下面我将显示我的 csv 文件、代码和错误消息。实际的错误位于底部,有一些警告,然后是一些可能对您有帮助的测试输出,然后是错误消息。
代码:
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('ufc_data.csv')
#defines a logistical regression model
model = tf.keras.Sequential([
# in input_shape, we have the number of our features
tf.keras.layers.Dense(units=1, input_shape=(51,), activation='sigmoid')
])
#compiles the model and specifies our metrics for the optimizer, loss, and accuracy metrics
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# add finish, finish_details, finish round, etc. later. columns DE in the csv file
#X_train = df[['R_fighter', 'B_fighter', 'R_odds', 'B_odds', 'weight_class','gender','no_of_rounds', 'B_current_lose_streak','B_current_win_streak','B_avg_SIG_STR_landed','B_avg_SIG_STR_pct','B_avg_SUB_ATT','B_avg_TD_landed','B_avg_TD_pct','B_longest_win_streak', 'B_losses','B_total_rounds_fought','B_total_title_bouts','B_win_by_Decision_Majority','B_win_by_Decision_Split','B_win_by_Decision_Unanimous','B_win_by_KO/TKO','B_win_by_Submission','B_win_by_TKO_Doctor_Stoppage','B_wins','B_Stance','B_Height_cms','B_Reach_cms', 'R_current_lose_streak','R_current_win_streak','R_avg_SIG_STR_landed','R_avg_SIG_STR_pct','R_avg_SUB_ATT','R_avg_TD_landed','R_avg_TD_pct','R_longest_win_streak', 'R_losses','R_total_rounds_fought','R_total_title_bouts','R_win_by_Decision_Majority','R_win_by_Decision_Split','R_win_by_Decision_Unanimous','R_win_by_KO/TKO','R_win_by_Submission','R_win_by_TKO_Doctor_Stoppage','R_wins','R_Stance','R_Height_cms','R_Reach_cms', 'B_match_weightclass_rank','R_match_weightclass_rank']]
#y_train =df[['Winner']]
numerical_features = df[['R_odds', 'B_odds', 'no_of_rounds',
'B_current_lose_streak', 'B_current_win_streak', 'B_avg_SIG_STR_landed', 'B_avg_SIG_STR_pct', 'B_avg_SUB_ATT', 'B_avg_TD_landed', 'B_avg_TD_pct', 'B_longest_win_streak', 'B_losses', 'B_total_rounds_fought', 'B_total_title_bouts', 'B_Height_cms', 'B_Reach_cms',
'R_current_lose_streak', 'R_current_win_streak', 'R_avg_SIG_STR_landed', 'R_avg_SIG_STR_pct', 'R_avg_SUB_ATT', 'R_avg_TD_landed', 'R_avg_TD_pct', 'R_longest_win_streak', 'R_losses', 'R_total_rounds_fought', 'R_total_title_bouts', 'R_Height_cms', 'R_Reach_cms',
'B_Weight_lbs', 'R_Weight_lbs']]
categorical_features = ['R_fighter', 'B_fighter', 'weight_class', 'gender',
'B_Stance', 'R_Stance',
'B_win_by_Decision_Majority', 'B_win_by_Decision_Split', 'B_win_by_Decision_Unanimous', 'B_win_by_KO/TKO', 'B_win_by_Submission', 'B_win_by_TKO_Doctor_Stoppage',
'R_win_by_Decision_Majority', 'R_win_by_Decision_Split', 'R_win_by_Decision_Unanimous', 'R_win_by_KO/TKO', 'R_win_by_Submission', 'R_win_by_TKO_Doctor_Stoppage']
encoded_data = pd.DataFrame() # A placeholder for encoded data
for col in categorical_features:
encoder = OneHotEncoder(handle_unknown='ignore')
encoded_col = encoder.fit_transform(df[[col]]).toarray()
encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col)], axis=1)
# Combine encoded and numerical features
X = pd.concat([encoded_data, numerical_features], axis=1)
y = df['Winner'].replace({'Red': 0, 'Blue': 1})
#
# Train-Test Split
# this splits the data into training and testing sets. test_size = 0.2 would be an 80/20 split of training to testing data. random_state makes the split reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#'fit' method trains the model data.
#first param is , second param is, 3rd , 4th,
#An epoch is one complete pass through the entire dataset during the training process.
'''
Batch size == to the number of training examples utilized in one iteration. Instead of updating the model's parameters
after every single example (which would be extremely slow), we update them after a batch of examples. The batch size determines how many examples are processed simultaneously before the model's parameters are updated.
'''
#train_test_split outputs both X and y features as dataFrames, but the model is expecting NumPy arrays
# So, we convert DataFrames to NumPy arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
print(X_train.shape) # Should output something like (Num_Samples, 51 + Num_Encoded_Features)
print(y_train.shape) # Should output something like (Num_Samples,)
print("\n\nnumerical features print: \n")
print(numerical_features)
print("\n\n df.head()print: \n")
print(df.head())
model.fit(X_train, y_train, epochs=20, batch_size=128)
警告:
2024-02-17 19:39:24.645119: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2024-02-17 19:39:28.791632: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
c:\coding_n_such\UFC_ELO_ML\3rd_try\lesgo.py:47: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
y = df['Winner'].replace({'Red': 0, 'Blue': 1})
(3916, 3099)
(3916,)
数字特征打印:
R_odds B_odds no_of_rounds B_current_lose_streak ... R_Height_cms R_Reach_cms B_Weight_lbs R_Weight_lbs
0 -150.0 130 5 0 ... 187.96 193.04 205 205
1 170.0 -200 3 2 ... 180.34 193.04 170 170
2 110.0 -130 3 1 ... 190.50 195.58 185 205
3 -675.0 475 3 1 ... 175.26 182.88 155 155
4 -135.0 115 3 0 ... 175.26 177.80 145 155
... ... ... ... ... ... ... ... ... ...
4891 -155.0 135 3 0 ... 177.80 177.80 145 170
4892 -210.0 175 3 0 ... 170.18 180.34 170 170
4893 -260.0 220 3 1 ... 193.04 198.12 265 245
4894 -420.0 335 3 0 ... 172.72 177.80 170 170
4895 140.0 -160 3 1 ... 190.50 190.50 205 185
[4896 rows x 31 columns]
df.head()
打印:
R_fighter B_fighter R_odds B_odds R_ev ... r_sub_odds b_sub_odds r_ko_odds b_ko_odds fight_id
0 Thiago Santos Johnny Walker -150.0 130 66.666667 ... 2000.0 1600.0 -110.0 175.0 4896
1 Alex Oliveira Niko Price 170.0 -200 170.000000 ... 700.0 1100.0 550.0 175.0 4895
2 Misha Cirkunov Krzysztof Jotko 110.0 -130 110.000000 ... 275.0 1400.0 600.0 175.0 4894
3 Alexander Hernandez Mike Breeden -675.0 475 14.814815 ... 500.0 3500.0 110.0 175.0 4893
4 Joe Solecki Jared Gordon -135.0 115 74.074074 ... 400.0 1200.0 900.0 175.0 4892
[5 rows x 120 columns]
错误:
Epoch 1/20
Traceback (most recent call last):
File "c:\coding_n_such\UFC_ELO_ML\3rd_try\lesgo.py", line 83, in <module>
model.fit(X_train, y_train, epochs=20, batch_size=128)
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\grine\AppData\Local\Temp\__autograph_generated_filep7si5w_l.py", line 15, in tf__train_function
retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
^^^^^
ValueError: in user code:
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1401, in train_function *
return step_function(self, iterator)
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1384, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1373, in run_step **
outputs = model.train_step(data)
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1150, in train_step
y_pred = self(x, training=True)
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\input_spec.py", line 298, in assert_input_compatibility
raise ValueError(
ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 51), found shape=(None, 3099)
主要问题是你使用 OneHotEncoder 来传输分类变量,这会改变变量的形状。例如,R_fighter 有 1348 个唯一值,因此 OneHotEncoder 将此特征编码为大小为 1348 的向量,导致此特征具有 1348 个而不是 1 列。对于所有其他猫变量来说都是如此,并且增加了特征的形状。有必要了解编码器的作用并选择适合您任务的编码器。
一个简单的解决方案是使用LableEncoder,用数字替换分类值。我还用“na”填充 None 值。
#import LabelEncoder
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
# fill None variables
df.fillna('na', inplace = True)
# transform features
for col in categorical_features:
# encoder = OrdinalEncoder(handle_unknown='error')
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(df[[col]])
encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col, columns=[col])], axis=1)
您可以在此处阅读有关所有类型的分类变量的信息。但最重要的是 OneHotEncoder、LabelEncoder 和 OrdinalEncoder。当特征是序数时应该使用最后一个。
我还发现 X_train 有 49 个特征。所以你的代码应该是这样的:
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('ufc_data.csv')
#defines a logistical regression model
model = tf.keras.Sequential([
# in input_shape, we have the number of our features
tf.keras.layers.Dense(units=1, input_shape=(49,), activation='sigmoid')
])
#compiles the model and specifies our metrics for the optimizer, loss, and accuracy metrics
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# add finish, finish_details, finish round, etc. later. columns DE in the csv file
#X_train = df[['R_fighter', 'B_fighter', 'R_odds', 'B_odds', 'weight_class','gender','no_of_rounds', 'B_current_lose_streak','B_current_win_streak','B_avg_SIG_STR_landed','B_avg_SIG_STR_pct','B_avg_SUB_ATT','B_avg_TD_landed','B_avg_TD_pct','B_longest_win_streak', 'B_losses','B_total_rounds_fought','B_total_title_bouts','B_win_by_Decision_Majority','B_win_by_Decision_Split','B_win_by_Decision_Unanimous','B_win_by_KO/TKO','B_win_by_Submission','B_win_by_TKO_Doctor_Stoppage','B_wins','B_Stance','B_Height_cms','B_Reach_cms', 'R_current_lose_streak','R_current_win_streak','R_avg_SIG_STR_landed','R_avg_SIG_STR_pct','R_avg_SUB_ATT','R_avg_TD_landed','R_avg_TD_pct','R_longest_win_streak', 'R_losses','R_total_rounds_fought','R_total_title_bouts','R_win_by_Decision_Majority','R_win_by_Decision_Split','R_win_by_Decision_Unanimous','R_win_by_KO/TKO','R_win_by_Submission','R_win_by_TKO_Doctor_Stoppage','R_wins','R_Stance','R_Height_cms','R_Reach_cms', 'B_match_weightclass_rank','R_match_weightclass_rank']]
#y_train =df[['Winner']]
numerical_features = df[['R_odds', 'B_odds', 'no_of_rounds',
'B_current_lose_streak', 'B_current_win_streak', 'B_avg_SIG_STR_landed', 'B_avg_SIG_STR_pct', 'B_avg_SUB_ATT', 'B_avg_TD_landed', 'B_avg_TD_pct', 'B_longest_win_streak', 'B_losses', 'B_total_rounds_fought', 'B_total_title_bouts', 'B_Height_cms', 'B_Reach_cms',
'R_current_lose_streak', 'R_current_win_streak', 'R_avg_SIG_STR_landed', 'R_avg_SIG_STR_pct', 'R_avg_SUB_ATT', 'R_avg_TD_landed', 'R_avg_TD_pct', 'R_longest_win_streak', 'R_losses', 'R_total_rounds_fought', 'R_total_title_bouts', 'R_Height_cms', 'R_Reach_cms',
'B_Weight_lbs', 'R_Weight_lbs']]
categorical_features = ['R_fighter', 'B_fighter', 'weight_class', 'gender',
'B_Stance', 'R_Stance',
'B_win_by_Decision_Majority', 'B_win_by_Decision_Split', 'B_win_by_Decision_Unanimous', 'B_win_by_KO/TKO', 'B_win_by_Submission', 'B_win_by_TKO_Doctor_Stoppage',
'R_win_by_Decision_Majority', 'R_win_by_Decision_Split', 'R_win_by_Decision_Unanimous', 'R_win_by_KO/TKO', 'R_win_by_Submission', 'R_win_by_TKO_Doctor_Stoppage']
encoded_data = pd.DataFrame() # A placeholder for encoded data
# fill None values
df.fillna('na', inplace = True)
for col in categorical_features:
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(df[[col]])
encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col, columns=[col])], axis=1)
# Combine encoded and numerical features
X = pd.concat([encoded_data, numerical_features], axis=1)
y = df['Winner'].replace({'Red': 0, 'Blue': 1})
#
# Train-Test Split
# this splits the data into training and testing sets. test_size = 0.2 would be an 80/20 split of training to testing data. random_state makes the split reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#'fit' method trains the model data.
#first param is , second param is, 3rd , 4th,
#An epoch is one complete pass through the entire dataset during the training process.
'''
Batch size == to the number of training examples utilized in one iteration. Instead of updating the model's parameters
after every single example (which would be extremely slow), we update them after a batch of examples. The batch size determines how many examples are processed simultaneously before the model's parameters are updated.
'''
#train_test_split outputs both X and y features as dataFrames, but the model is expecting NumPy arrays
# So, we convert DataFrames to NumPy arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
print(X_train.shape) # Should output something like (Num_Samples, 51 + Num_Encoded_Features)
print(y_train.shape) # Should output something like (Num_Samples,)
print("\n\nnumerical features print: \n")
print(numerical_features)
print("\n\n df.head()print: \n")
print(df.head())
model.fit(X_train, y_train, epochs=20, batch_size=128)