尝试创建具有 51 个特征的 ML 模型，最终创建了具有 3099 个特征的模型

Question

我正在尝试从数据集中预测 ufc 比赛。我的模型想要 51 个特征，但得到了 3099 个。

我尝试从数据集中提取数值特征和分类特征，然后将它们组合起来。我最终得到的模型预计有 51 个特征，但当我尝试训练它时却得到了 3099 个特征。

下面我将显示我的 csv 文件、代码和错误消息。实际的错误位于底部，有一些警告，然后是一些可能对您有帮助的测试输出，然后是错误消息。

csv 文件： https://www.dropbox.com/scl/fi/5ozqq3dla0jgsg8fc1q3s/ufc_data.csv?rlkey=jx98vw0hwp3ipyxvageaw6caj&dl=0

代码：

import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

df = pd.read_csv('ufc_data.csv')

#defines a logistical regression model
model = tf.keras.Sequential([
    # in input_shape, we have the number of our features
    tf.keras.layers.Dense(units=1, input_shape=(51,), activation='sigmoid')
])

#compiles the model and specifies our metrics for the optimizer, loss, and accuracy metrics
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# add finish, finish_details, finish round, etc. later.  columns DE in the csv file 
#X_train = df[['R_fighter', 'B_fighter', 'R_odds', 'B_odds', 'weight_class','gender','no_of_rounds', 'B_current_lose_streak','B_current_win_streak','B_avg_SIG_STR_landed','B_avg_SIG_STR_pct','B_avg_SUB_ATT','B_avg_TD_landed','B_avg_TD_pct','B_longest_win_streak', 'B_losses','B_total_rounds_fought','B_total_title_bouts','B_win_by_Decision_Majority','B_win_by_Decision_Split','B_win_by_Decision_Unanimous','B_win_by_KO/TKO','B_win_by_Submission','B_win_by_TKO_Doctor_Stoppage','B_wins','B_Stance','B_Height_cms','B_Reach_cms', 'R_current_lose_streak','R_current_win_streak','R_avg_SIG_STR_landed','R_avg_SIG_STR_pct','R_avg_SUB_ATT','R_avg_TD_landed','R_avg_TD_pct','R_longest_win_streak', 'R_losses','R_total_rounds_fought','R_total_title_bouts','R_win_by_Decision_Majority','R_win_by_Decision_Split','R_win_by_Decision_Unanimous','R_win_by_KO/TKO','R_win_by_Submission','R_win_by_TKO_Doctor_Stoppage','R_wins','R_Stance','R_Height_cms','R_Reach_cms', 'B_match_weightclass_rank','R_match_weightclass_rank']]
#y_train =df[['Winner']]


numerical_features = df[['R_odds', 'B_odds', 'no_of_rounds', 
                         'B_current_lose_streak', 'B_current_win_streak', 'B_avg_SIG_STR_landed', 'B_avg_SIG_STR_pct', 'B_avg_SUB_ATT', 'B_avg_TD_landed', 'B_avg_TD_pct', 'B_longest_win_streak', 'B_losses', 'B_total_rounds_fought', 'B_total_title_bouts', 'B_Height_cms', 'B_Reach_cms',
                         'R_current_lose_streak', 'R_current_win_streak', 'R_avg_SIG_STR_landed', 'R_avg_SIG_STR_pct', 'R_avg_SUB_ATT', 'R_avg_TD_landed', 'R_avg_TD_pct', 'R_longest_win_streak', 'R_losses', 'R_total_rounds_fought', 'R_total_title_bouts',  'R_Height_cms', 'R_Reach_cms',
                         'B_Weight_lbs', 'R_Weight_lbs']]

categorical_features = ['R_fighter', 'B_fighter', 'weight_class', 'gender', 
                        'B_Stance', 'R_Stance',
                        'B_win_by_Decision_Majority', 'B_win_by_Decision_Split', 'B_win_by_Decision_Unanimous', 'B_win_by_KO/TKO', 'B_win_by_Submission', 'B_win_by_TKO_Doctor_Stoppage',
                        'R_win_by_Decision_Majority', 'R_win_by_Decision_Split', 'R_win_by_Decision_Unanimous', 'R_win_by_KO/TKO', 'R_win_by_Submission', 'R_win_by_TKO_Doctor_Stoppage']

encoded_data = pd.DataFrame() # A placeholder for encoded data

for col in categorical_features:
    encoder = OneHotEncoder(handle_unknown='ignore')
    encoded_col = encoder.fit_transform(df[[col]]).toarray()
    encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col)], axis=1)


# Combine encoded and numerical features 
X = pd.concat([encoded_data, numerical_features], axis=1) 

y = df['Winner'].replace({'Red': 0, 'Blue': 1})

#
# Train-Test Split
# this splits the data into training and testing sets. test_size = 0.2 would be an 80/20 split of training to testing data. random_state makes the split reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



#'fit' method trains the model data. 
#first param is , second param is, 3rd , 4th, 
#An epoch is one complete pass through the entire dataset during the training process.

'''
Batch size == to the number of training examples utilized in one iteration. Instead of updating the model's parameters 
after every single example (which would be extremely slow), we update them after a batch of examples. The batch size determines how many examples are processed simultaneously before the model's parameters are updated.
'''

#train_test_split outputs both X and y features as dataFrames, but the model is expecting NumPy arrays
# So, we convert DataFrames to NumPy arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

print(X_train.shape)  # Should output something like (Num_Samples, 51 + Num_Encoded_Features)
print(y_train.shape)  # Should output something like (Num_Samples,)
print("\n\nnumerical features print: \n")
print(numerical_features)

print("\n\n df.head()print: \n")
print(df.head())


model.fit(X_train, y_train, epochs=20, batch_size=128)

警告：

2024-02-17 19:39:24.645119: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-17 19:39:28.791632: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

c:\coding_n_such\UFC_ELO_ML\3rd_try\lesgo.py:47: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  y = df['Winner'].replace({'Red': 0, 'Blue': 1})
(3916, 3099)
(3916,)

数字特征打印：

      R_odds  B_odds  no_of_rounds  B_current_lose_streak  ...  R_Height_cms  R_Reach_cms  B_Weight_lbs  R_Weight_lbs
0     -150.0     130             5                      0  ...        187.96       193.04           205           205
1      170.0    -200             3                      2  ...        180.34       193.04           170           170
2      110.0    -130             3                      1  ...        190.50       195.58           185           205
3     -675.0     475             3                      1  ...        175.26       182.88           155           155
4     -135.0     115             3                      0  ...        175.26       177.80           145           155
...      ...     ...           ...                    ...  ...           ...          ...           ...           ...
4891  -155.0     135             3                      0  ...        177.80       177.80           145           170
4892  -210.0     175             3                      0  ...        170.18       180.34           170           170
4893  -260.0     220             3                      1  ...        193.04       198.12           265           245
4894  -420.0     335             3                      0  ...        172.72       177.80           170           170
4895   140.0    -160             3                      1  ...        190.50       190.50           205           185

[4896 rows x 31 columns]

df.head()

打印：

             R_fighter        B_fighter  R_odds  B_odds        R_ev  ...  r_sub_odds b_sub_odds r_ko_odds b_ko_odds fight_id
0        Thiago Santos    Johnny Walker  -150.0     130   66.666667  ...      2000.0     1600.0    -110.0     175.0     4896  
1        Alex Oliveira       Niko Price   170.0    -200  170.000000  ...       700.0     1100.0     550.0     175.0     4895  
2       Misha Cirkunov  Krzysztof Jotko   110.0    -130  110.000000  ...       275.0     1400.0     600.0     175.0     4894  
3  Alexander Hernandez     Mike Breeden  -675.0     475   14.814815  ...       500.0     3500.0     110.0     175.0     4893  
4          Joe Solecki     Jared Gordon  -135.0     115   74.074074  ...       400.0     1200.0     900.0     175.0     4892  

[5 rows x 120 columns]

错误：

Epoch 1/20
Traceback (most recent call last):
  File "c:\coding_n_such\UFC_ELO_ML\3rd_try\lesgo.py", line 83, in <module>
    model.fit(X_train, y_train, epochs=20, batch_size=128)
  File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\grine\AppData\Local\Temp\__autograph_generated_filep7si5w_l.py", line 15, in tf__train_function
    retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
    ^^^^^
ValueError: in user code:

    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1401, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1384, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1373, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1150, in train_step
        y_pred = self(x, training=True)
    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "C:\Users\grine\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\input_spec.py", line 298, in assert_input_compatibility
        raise ValueError(

    ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 51), found shape=(None, 3099)

Answer 1

主要问题是你使用 OneHotEncoder 来传输分类变量，这会改变变量的形状。例如，R_fighter 有 1348 个唯一值，因此 OneHotEncoder 将此特征编码为大小为 1348 的向量，导致此特征具有 1348 个而不是 1 列。对于所有其他猫变量来说都是如此，并且增加了特征的形状。有必要了解编码器的作用并选择适合您任务的编码器。

一个简单的解决方案是使用LableEncoder，用数字替换分类值。我还用“na”填充 None 值。

#import LabelEncoder
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# fill None variables
df.fillna('na', inplace = True)

# transform features
for col in categorical_features:
    # encoder = OrdinalEncoder(handle_unknown='error')
    encoder = LabelEncoder()
    encoded_col = encoder.fit_transform(df[[col]])
    encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col, columns=[col])], axis=1)

您可以在此处阅读有关所有类型的分类变量的信息。但最重要的是 OneHotEncoder、LabelEncoder 和 OrdinalEncoder。当特征是序数时应该使用最后一个。

我还发现 X_train 有 49 个特征。所以你的代码应该是这样的：

import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

df = pd.read_csv('ufc_data.csv')

#defines a logistical regression model
model = tf.keras.Sequential([
    # in input_shape, we have the number of our features
    tf.keras.layers.Dense(units=1, input_shape=(49,), activation='sigmoid')
])

#compiles the model and specifies our metrics for the optimizer, loss, and accuracy metrics
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# add finish, finish_details, finish round, etc. later.  columns DE in the csv file 
#X_train = df[['R_fighter', 'B_fighter', 'R_odds', 'B_odds', 'weight_class','gender','no_of_rounds', 'B_current_lose_streak','B_current_win_streak','B_avg_SIG_STR_landed','B_avg_SIG_STR_pct','B_avg_SUB_ATT','B_avg_TD_landed','B_avg_TD_pct','B_longest_win_streak', 'B_losses','B_total_rounds_fought','B_total_title_bouts','B_win_by_Decision_Majority','B_win_by_Decision_Split','B_win_by_Decision_Unanimous','B_win_by_KO/TKO','B_win_by_Submission','B_win_by_TKO_Doctor_Stoppage','B_wins','B_Stance','B_Height_cms','B_Reach_cms', 'R_current_lose_streak','R_current_win_streak','R_avg_SIG_STR_landed','R_avg_SIG_STR_pct','R_avg_SUB_ATT','R_avg_TD_landed','R_avg_TD_pct','R_longest_win_streak', 'R_losses','R_total_rounds_fought','R_total_title_bouts','R_win_by_Decision_Majority','R_win_by_Decision_Split','R_win_by_Decision_Unanimous','R_win_by_KO/TKO','R_win_by_Submission','R_win_by_TKO_Doctor_Stoppage','R_wins','R_Stance','R_Height_cms','R_Reach_cms', 'B_match_weightclass_rank','R_match_weightclass_rank']]
#y_train =df[['Winner']]


numerical_features = df[['R_odds', 'B_odds', 'no_of_rounds', 
                         'B_current_lose_streak', 'B_current_win_streak', 'B_avg_SIG_STR_landed', 'B_avg_SIG_STR_pct', 'B_avg_SUB_ATT', 'B_avg_TD_landed', 'B_avg_TD_pct', 'B_longest_win_streak', 'B_losses', 'B_total_rounds_fought', 'B_total_title_bouts', 'B_Height_cms', 'B_Reach_cms',
                         'R_current_lose_streak', 'R_current_win_streak', 'R_avg_SIG_STR_landed', 'R_avg_SIG_STR_pct', 'R_avg_SUB_ATT', 'R_avg_TD_landed', 'R_avg_TD_pct', 'R_longest_win_streak', 'R_losses', 'R_total_rounds_fought', 'R_total_title_bouts',  'R_Height_cms', 'R_Reach_cms',
                         'B_Weight_lbs', 'R_Weight_lbs']]

categorical_features = ['R_fighter', 'B_fighter', 'weight_class', 'gender', 
                        'B_Stance', 'R_Stance',
                        'B_win_by_Decision_Majority', 'B_win_by_Decision_Split', 'B_win_by_Decision_Unanimous', 'B_win_by_KO/TKO', 'B_win_by_Submission', 'B_win_by_TKO_Doctor_Stoppage',
                        'R_win_by_Decision_Majority', 'R_win_by_Decision_Split', 'R_win_by_Decision_Unanimous', 'R_win_by_KO/TKO', 'R_win_by_Submission', 'R_win_by_TKO_Doctor_Stoppage']

encoded_data = pd.DataFrame() # A placeholder for encoded data

# fill None values
df.fillna('na', inplace = True)


for col in categorical_features:
    encoder = LabelEncoder()
    encoded_col = encoder.fit_transform(df[[col]])
    encoded_data = pd.concat([encoded_data, pd.DataFrame(encoded_col, columns=[col])], axis=1)


# Combine encoded and numerical features 
X = pd.concat([encoded_data, numerical_features], axis=1) 

y = df['Winner'].replace({'Red': 0, 'Blue': 1})

#
# Train-Test Split
# this splits the data into training and testing sets. test_size = 0.2 would be an 80/20 split of training to testing data. random_state makes the split reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



#'fit' method trains the model data. 
#first param is , second param is, 3rd , 4th, 
#An epoch is one complete pass through the entire dataset during the training process.

'''
Batch size == to the number of training examples utilized in one iteration. Instead of updating the model's parameters 
after every single example (which would be extremely slow), we update them after a batch of examples. The batch size determines how many examples are processed simultaneously before the model's parameters are updated.
'''

#train_test_split outputs both X and y features as dataFrames, but the model is expecting NumPy arrays
# So, we convert DataFrames to NumPy arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

print(X_train.shape)  # Should output something like (Num_Samples, 51 + Num_Encoded_Features)
print(y_train.shape)  # Should output something like (Num_Samples,)
print("\n\nnumerical features print: \n")
print(numerical_features)

print("\n\n df.head()print: \n")
print(df.head())


model.fit(X_train, y_train, epochs=20, batch_size=128)

尝试创建具有 51 个特征的 ML 模型，最终创建了具有 3099 个特征的模型

问题描述投票：0回答：1

1个回答

最新问题

尝试创建具有 51 个特征的 ML 模型，最终创建了具有 3099 个特征的模型

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1