我正在尝试为文本分类训练一个模型,该模型采用从文章中嵌入的最多300个整数的列表。该模型可以毫无问题地进行训练,但精度几乎不会提高。
目标由41个类别组成,从0到41编码为int,然后进行归一化。
表格看起来像这样
<< img src =“ https://image.soinside.com/eyJ1cmwiOiAiaHR0cHM6Ly9pLmltZ3VyLmNvbS9LZTNwNWY5LnBuZyJ9” alt =“ Table1”>
而且,我不知道我的模型应该是什么样子,因为我参考了以下两个不同的示例
我已经尝试基于两个模型修改我的模型,但是模型的准确性不会改变,甚至每个时期都会降低
我应该在模型中添加更多层还是做一些我没有意识到的愚蠢行为?
注意:如果'df.pickle'下载链接断开,请使用this link
from sklearn.model_selection import train_test_split from urllib.request import urlopen from os.path import exists from os import mkdir import tensorflow as tf import pandas as pd import pickle # Define dataframe path df_path = 'df.pickle' # Check if local dataframe exists if not exists(df_path): # Download binary from dropbox content = urlopen('https://ucd92a22d5e0d4d29b8edb608305.dl.dropboxusercontent.com/cd/0/get/Askx_25n3JI-jmnZsWXmMmRgd4O2EH1w9l0U6zCMq7xdSXs_IN_i2zuUviseqa9N7-WrReFbGhQi8CeseV5cNsFTO8dzRmSdxjr-MWEDQNpPaZ8Ik29E_58YAjY57qTc4CA/file#').read() # Write to file with open(df_path, 'wb') as file: file.write(content) # Load the dataframe from bytes df = pickle.loads(content) # If the file exists (aka. downloaded) else: # Load the dataframe from file df = pickle.load(open(df_path, 'rb')) # Normalize the category df['Category_Code'] = df['Category_Code'].apply(lambda x: x / 41) train_df, test_df = [pd.DataFrame() for _ in range(2)] x_train, x_test, y_train, y_test = train_test_split(df['Content_Parsed'], df['Category_Code'], test_size=0.15, random_state=8) train_df['Content_Parsed'], train_df['Category_Code'] = x_train, y_train test_df['Content_Parsed'], test_df['Category_Code'] = x_test, y_test # Variable containing the number of words we want to keep in our vocabulary NUM_WORDS = 10000 # Input/Token length SEQ_LEN = 300 # Create tokenizer for our data tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS, oov_token='<UNK>') tokenizer.fit_on_texts(train_df['Content_Parsed']) # Convert text data to numerical indexes train_seqs=tokenizer.texts_to_sequences(train_df['Content_Parsed']) test_seqs=tokenizer.texts_to_sequences(test_df['Content_Parsed']) # Pad data up to SEQ_LEN (note that we truncate if there are more than SEQ_LEN tokens) train_seqs=tf.keras.preprocessing.sequence.pad_sequences(train_seqs, maxlen=SEQ_LEN, padding="post") test_seqs=tf.keras.preprocessing.sequence.pad_sequences(test_seqs, maxlen=SEQ_LEN, padding="post") # Create Models folder if not exists if not exists('Models'): mkdir('Models') # Define local model path model_path = 'Models/model.pickle' # Check if model exists/pre-trained if not exists(model_path): # Define word embedding size EMBEDDING_SIZE = 16 # Create new model ''' model = tf.keras.Sequential([ tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE), tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)), # tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) ''' model = tf.keras.Sequential([ tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE), # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)), tf.keras.layers.GlobalAveragePooling1D(), tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) # Compile the model model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] ) # Stop training when a monitored quantity has stopped improving. es = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', patience=1) # Define batch size (Can be tuned to improve model accuracy) BATCH_SIZE = 16 # Define number or cycle to train EPOCHS = 20 # Using GPU (If error means you don't have GPU. Use CPU instead) with tf.device('/GPU:0'): # Train/Fit the model history = model.fit( train_seqs, train_df['Category_Code'].values, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=0.2, validation_steps=30, callbacks=[es] ) # Evaluate the model model.evaluate(test_seqs, test_df['Category_Code'].values) # Save the model into a file with open(model_path, 'wb') as file: file.write(pickle.dumps(model)) else: # Load the model model = pickle.load(open(model_path, 'rb')) # Check the model model.summary()
我正在尝试为文本分类训练一个模型,该模型采用从文章中嵌入的最多300个整数的列表。该模型可以毫无问题地进行训练,但精度几乎不会提高。 ...
经过2天的调整和理解更多示例,我发现了this网站,该网站很好地解释了多类别分类。
我所做的更改的详细信息包括: