我的文本分类器模型无法在多个类中得到改善

问题描述 投票:1回答:1

我正在尝试为文本分类训练一个模型,该模型采用从文章中嵌入的最多300个整数的列表。该模型可以毫无问题地进行训练,但精度几乎不会提高。

目标由41个类别组成,从0到41编码为int,然后进行归一化。

表格看起来像这样

<< img src =“ https://image.soinside.com/eyJ1cmwiOiAiaHR0cHM6Ly9pLmltZ3VyLmNvbS9LZTNwNWY5LnBuZyJ9” alt =“ Table1”>

而且,我不知道我的模型应该是什么样子,因为我参考了以下两个不同的示例

  • 具有一个输入列和一个输出列Example 1的二进制分类器>
  • 具有多个列作为输入Example 2的多个类分类器>
  • 我已经尝试基于两个模型修改我的模型,但是模型的准确性不会改变,甚至每个时期都会降低

    我应该在模型中添加更多层还是做一些我没有意识到的愚蠢行为?

注意:如果'df.pickle'下载链接断开,请使用this link

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from os.path import exists
from os import mkdir
import tensorflow as tf
import pandas as pd
import pickle

# Define dataframe path
df_path = 'df.pickle'

# Check if local dataframe exists
if not exists(df_path):
  # Download binary from dropbox
  content = urlopen('https://ucd92a22d5e0d4d29b8edb608305.dl.dropboxusercontent.com/cd/0/get/Askx_25n3JI-jmnZsWXmMmRgd4O2EH1w9l0U6zCMq7xdSXs_IN_i2zuUviseqa9N7-WrReFbGhQi8CeseV5cNsFTO8dzRmSdxjr-MWEDQNpPaZ8Ik29E_58YAjY57qTc4CA/file#').read()

  # Write to file
  with open(df_path, 'wb') as file: file.write(content)

  # Load the dataframe from bytes
  df = pickle.loads(content)
# If the file exists (aka. downloaded)
else:
  # Load the dataframe from file
  df = pickle.load(open(df_path, 'rb'))

# Normalize the category
df['Category_Code'] = df['Category_Code'].apply(lambda x: x / 41)

train_df, test_df = [pd.DataFrame() for _ in range(2)]

x_train, x_test, y_train, y_test = train_test_split(df['Content_Parsed'], df['Category_Code'], test_size=0.15, random_state=8)
train_df['Content_Parsed'], train_df['Category_Code'] = x_train, y_train
test_df['Content_Parsed'], test_df['Category_Code'] = x_test, y_test

# Variable containing the number of words we want to keep in our vocabulary
NUM_WORDS = 10000
# Input/Token length
SEQ_LEN = 300

# Create tokenizer for our data
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS, oov_token='<UNK>')
tokenizer.fit_on_texts(train_df['Content_Parsed'])

# Convert text data to numerical indexes
train_seqs=tokenizer.texts_to_sequences(train_df['Content_Parsed'])
test_seqs=tokenizer.texts_to_sequences(test_df['Content_Parsed'])

# Pad data up to SEQ_LEN (note that we truncate if there are more than SEQ_LEN tokens)
train_seqs=tf.keras.preprocessing.sequence.pad_sequences(train_seqs, maxlen=SEQ_LEN, padding="post")
test_seqs=tf.keras.preprocessing.sequence.pad_sequences(test_seqs, maxlen=SEQ_LEN, padding="post")

# Create Models folder if not exists
if not exists('Models'): mkdir('Models')

# Define local model path
model_path = 'Models/model.pickle'

# Check if model exists/pre-trained
if not exists(model_path):
  # Define word embedding size
  EMBEDDING_SIZE = 16

  # Create new model
  '''
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
    # tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  '''
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
      # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
      tf.keras.layers.GlobalAveragePooling1D(),
      tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  # Compile the model
  model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
  )

  # Stop training when a monitored quantity has stopped improving.
  es = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', patience=1)

  # Define batch size (Can be tuned to improve model accuracy)
  BATCH_SIZE = 16
  # Define number or cycle to train
  EPOCHS = 20

  # Using GPU (If error means you don't have GPU. Use CPU instead)
  with tf.device('/GPU:0'):
    # Train/Fit the model
    history = model.fit(
      train_seqs, 
      train_df['Category_Code'].values, 
      batch_size=BATCH_SIZE, 
      epochs=EPOCHS, 
      validation_split=0.2,
      validation_steps=30,
      callbacks=[es]
    )

  # Evaluate the model
  model.evaluate(test_seqs, test_df['Category_Code'].values)

  # Save the model into a file
  with open(model_path, 'wb') as file: file.write(pickle.dumps(model))
else:
  # Load the model
  model = pickle.load(open(model_path, 'rb'))

# Check the model
model.summary()

我正在尝试为文本分类训练一个模型,该模型采用从文章中嵌入的最多300个整数的列表。该模型可以毫无问题地进行训练,但精度几乎不会提高。 ...

python pandas tensorflow machine-learning text-classification
1个回答
0
投票

经过2天的调整和理解更多示例,我发现了this网站,该网站很好地解释了多类别分类。

我所做的更改的详细信息包括:

© www.soinside.com 2019 - 2024. All rights reserved.