我是编程新手,我现在才刚刚开始学习,而且我正在使用各种免费工具来完成它,所以我对编程还不太了解
我正在尝试编写一个用于自学习的神经网络
含义如下:我有3个文件。在第一个(类别)中,有 1 列,包含 37 个值,列名称为类别
第二个(例如)有 2 列。第一列称为categ,包含 785 行。第二列称为“fix”,包含 785 行
在第三个文件(match)中,名为 match 的 1 列包含 3543 行。
我需要匹配文件来获取第二列,并根据 Excel 文件中的数据将类别文件中的值添加到其每个值。
目前,我有这个代码
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Reading downloaded Excel files
# File with categories
from google.colab import files
upload = files.upload()
!ls
df_categories = pd.read_excel(open('categ.xlsx', 'rb'))
df_categories = pd.read_excel('categ.xlsx', index_col=None)
print(df_categories.columns)
# File with examples
from google.colab import files
upload = files.upload()
!ls
df_examples = pd.read_excel(open('ex.xlsx', 'rb'))
df_examples = pd.read_excel('ex.xlsx', index_col=None)
print(df_examples.columns)
# File with values for distribution
from google.colab import files
upload = files.upload()
!ls
df_to_distribute = pd.read_excel(open('match.xlsx', 'rb'))
df_to_distribute = pd.read_excel('match.xlsx', index_col=None)
print(df_to_distribute.columns)
# Data preprocessing
categories = df_categories['categ'].tolist()
values = df_examples['fix'].tolist()
to_distribute = df_to_distribute['match'].tolist()
categories = [str(category) for category in categories]
values = [str(value) for value in values]
to_distribute = [str(item) for item in to_distribute]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(categories + values + to_distribute)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(categories + values + to_distribute)
category_sequences = tokenizer.texts_to_sequences(categories)
value_sequences = tokenizer.texts_to_sequences(values)
to_distribute_sequences = tokenizer.texts_to_sequences(to_distribute)
max_length = max(len(seq) for seq in category_sequences + value_sequences + to_distribute_sequences)
padded_category_sequences = pad_sequences(category_sequences, maxlen=max_length, padding='post')
padded_value_sequences = pad_sequences(value_sequences, maxlen=max_length, padding='post')
padded_to_distribute_sequences = pad_sequences(to_distribute_sequences, maxlen=max_length, padding='post')
# Creating a model
input_layer = Input(shape=(max_length,))
embedding_layer = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64)(input_layer)
lstm_layer = LSTM(64)(embedding_layer)
output_layer = Dense(36, activation='softmax')(lstm_layer)
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Model training
#model.fit(padded_to_distribute_sequences, padded_category_sequences, epochs=10, batch_size=32, validation_split=0.2)
#model.fit(np.array(data_list), np.array(y), verbose=0, epochs=100)
model.fit(np.array(padded_to_distribute_sequences), np.array(padded_category_sequences), verbose=0, epochs=100)
目前我收到以下错误,但我不知道如何修复它
ValueError Traceback (most recent call last)
<ipython-input-18-4e982bc70a7f> in <cell line: 37>()
35 #model.fit(padded_to_distribute_sequences, padded_category_sequences, epochs=10, batch_size=32, validation_split=0.2)
36 #model.fit(np.array(data_list), np.array(y), verbose=0, epochs=100)
---> 37 model.fit(np.array(padded_to_distribute_sequences), np.array(padded_category_sequences), verbose=0, epochs=100)
1 frames
/usr/local/lib/python3.10/dist-packages/keras/src/engine/data_adapter.py in _check_data_cardinality(data)
1958 )
1959 msg += "Make sure all arrays contain the same number of samples."
-> 1960 raise ValueError(msg)
1961
1962
ValueError: Data cardinality is ambiguous:
x sizes: 3549
y sizes: 36
Make sure all arrays contain the same number of sample
我尝试根据网站和论坛的建议更改代码行,但还没有帮助。我会很高兴得到你的帮助!
我正在 Google colab 中编写代码
不幸的是,我无法丢弃我使用的原始文件,因为它们包含个人数据,但我可以分享一个简短的摘要,以便我的行为逻辑清晰。我把它附在描述的最后
我认为问题在于目标数据
padded_category_sequences
和输入数据padded_to_distribute_sequences
具有不同的样本数量,这导致了ValueError
。
在“数据处理”之后添加:-
target_data = np.tile(padded_category_sequences, (len(padded_to_distribute_sequences) // len(padded_category_sequences), 1))
我假设 padded_category_sequences 是您的目标数据