我有一个文本分类问题。
我对 sparknlp 中的这个嵌入模型特别感兴趣,因为我有一个来自维基百科的“sq”语言数据集。我需要将我的数据集的句子转换成嵌入。
我通过 WordEmbeddingsModel 这样做,但是,在生成嵌入之后,我不知道如何准备它们以准备好作为使用 keras 和 tensorflow 的 RNN 模型的输入。
我的数据集有两列“文本”和“标签”,直到现在我能够执行以下步骤:
# start spark session
spark = sparknlp.start(gpu=True)
# convert train df into spark df
spark_train_df=spark.createDataFrame(train)`
+--------------------+-----+
| text|label|
+--------------------+-----+
|Joy Adowaa Buolam...| 0|
|Ajo themeloi "Alg...| 1|
|Buolamwini lindi ...| 1|
|Kur ishte 9 vjeç,...| 0|
|Si një studente u...| 1|
+--------------------+-----+
# define sparknlp pipeline
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(\["document"\]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel\
.pretrained("w2v_cc_300d","sq")\
.setInputCols(\["document", "token"\])\
.setOutputCol("embeddings")
pipeline = Pipeline(stages=\[document, tokenizer, embeddings\])
# fit the pipeline to the training data
model = pipeline.fit(spark_train_df)
# apply the pipeline to the training data
result = model.transform(spark_train_df)
result.show()
+--------------------+-----+--------------------+--------------------+--------------------+
| text|label| document| token| embeddings|
+--------------------+-----+--------------------+--------------------+--------------------+
|Joy Adowaa Buolam...| 0|[{document, 0, 13...|[{token, 0, 2, Jo...|[{word_embeddings...|
|Ajo themeloi "Alg...| 1|[{document, 0, 13...|[{token, 0, 2, Aj...|[{word_embeddings...|
|Buolamwini lindi ...| 1|[{document, 0, 94...|[{token, 0, 9, Bu...|[{word_embeddings...|
|Kur ishte 9 vjeç,...| 0|[{document, 0, 12...|[{token, 0, 2, Ku...|[{word_embeddings...|
|Si një studente u...| 1|[{document, 0, 15...|[{token, 0, 1, Si...|[{word_embeddings...|
|Buolamwini diplom...| 1|[{document, 0, 11...|[{token, 0, 9, Bu...|[{word_embeddings...|
+--------------------+-----+--------------------+--------------------+--------------------+
结果的模式是:
result.printSchema()
root
|-- text: string (nullable = true)
|-- label: long (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- token: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- embeddings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
我收到的输出:
result.schema["embeddings"].dataType
是:
ArrayType(StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), False), StructField('end', IntegerType(), False), StructField('result', StringType(), True), StructField('metadata', MapType(StringType(), StringType(), True), True), StructField('embeddings', ArrayType(FloatType(), False), True)]), True)
要提取从 SparkNLP WordEmbeddingsModel 为 Keras 和 TensorFlow 中的 RNN 模型生成的嵌入,将 Spark DataFrame 转换为 Pandas DataFrame,使用
iloc
检索嵌入,将它们转换为 numpy 数组,将数据集拆分为训练集和测试集,使用 Keras 和 TensorFlow 定义 RNN 模型,在训练集上训练模型,并评估模型在测试集上的性能。
要提取从 SparkNLP WordEmbeddingsModel 生成的嵌入以使用 Keras 和 TensorFlow 提供 RNN 模型,您可以将 Spark DataFrame 转换为 Pandas DataFrame,然后使用它来训练 Keras RNN 模型。以下是你如何做到这一点:
将 Spark DataFrame 转换为 Pandas DataFrame:
pandas_df = result.toPandas()
从 Pandas DataFrame 的嵌入列中提取嵌入:
embeddings = pandas_df['embeddings'].apply(lambda x: [list(i) for i in x[0]])
将标签转换为单热编码向量:
from keras.utils import to_categorical
labels = to_categorical(pandas_df['label'])
使用提取的嵌入训练 Keras RNN 模型:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
model = Sequential()
model.add(Embedding(input_dim=len(embeddings), output_dim=300, input_length=maxlen, weights=[embeddings]))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(embeddings, labels, epochs=10, batch_size=32)
请注意,您需要将 maxlen 设置为输入序列的最大长度,并调整 RNN 模型的参数以满足您的需要。此外,如果您还没有安装所需的库(Keras、TensorFlow 等),则可能需要安装。