token_ids = []
for tweet in tweets:
# Remove unwanted characters and symbols
tweet = re.sub(r'[^\w\s]', '', tweet)
# Tokenize the tweet
tokens = bert_tokenizer.tokenize([tweet])
# Convert tokens to token IDs
ids = tf.squeeze(bert_tokenizer.convert_tokens_to_ids(tokens))
token_ids.append(ids)
input_ids = tf.ragged.constant(token_ids)
我试图对推文进行预处理和标记化,但它给出了:
TypeError: expected string or bytes-like object