如何使用 keras 和 tensorflow 提取从 sparknlp WordEmbeddingsModel 生成的嵌入以提供 RNN 模型

问题描述 投票:0回答:2

我有一个文本分类问题。

我对 sparknlp 中的这个嵌入模型特别感兴趣,因为我有一个来自维基百科的“sq”语言数据集。我需要将我的数据集的句子转换成嵌入。

我通过 WordEmbeddingsModel 这样做,但是,在生成嵌入之后,我不知道如何准备它们以准备好作为使用 keras 和 tensorflow 的 RNN 模型的输入。

我的数据集有两列“文本”和“标签”,直到现在我能够执行以下步骤:

# start spark session
spark = sparknlp.start(gpu=True)

# convert train df into spark df

spark_train_df=spark.createDataFrame(train)`

+--------------------+-----+
|                text|label|
+--------------------+-----+
|Joy Adowaa Buolam...|    0|
|Ajo themeloi "Alg...|    1|
|Buolamwini lindi ...|    1|
|Kur ishte 9 vjeç,...|    0|
|Si një studente u...|    1|
+--------------------+-----+

# define sparknlp pipeline

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(\["document"\]) \
.setOutputCol("token")

embeddings = WordEmbeddingsModel\
.pretrained("w2v_cc_300d","sq")\
.setInputCols(\["document", "token"\])\
.setOutputCol("embeddings")

pipeline = Pipeline(stages=\[document, tokenizer, embeddings\])

# fit the pipeline to the training data

model = pipeline.fit(spark_train_df)

# apply the pipeline to the training data

result = model.transform(spark_train_df)
result.show()


+--------------------+-----+--------------------+--------------------+--------------------+
|                text|label|            document|               token|          embeddings|
+--------------------+-----+--------------------+--------------------+--------------------+
|Joy Adowaa Buolam...|    0|[{document, 0, 13...|[{token, 0, 2, Jo...|[{word_embeddings...|
|Ajo themeloi "Alg...|    1|[{document, 0, 13...|[{token, 0, 2, Aj...|[{word_embeddings...|
|Buolamwini lindi ...|    1|[{document, 0, 94...|[{token, 0, 9, Bu...|[{word_embeddings...|
|Kur ishte 9 vjeç,...|    0|[{document, 0, 12...|[{token, 0, 2, Ku...|[{word_embeddings...|
|Si një studente u...|    1|[{document, 0, 15...|[{token, 0, 1, Si...|[{word_embeddings...|
|Buolamwini diplom...|    1|[{document, 0, 11...|[{token, 0, 9, Bu...|[{word_embeddings...|
+--------------------+-----+--------------------+--------------------+--------------------+

结果的模式是:

result.printSchema()



root
 |-- text: string (nullable = true)
 |-- label: long (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

我收到的输出:

result.schema["embeddings"].dataType
是:

ArrayType(StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), False), StructField('end', IntegerType(), False), StructField('result', StringType(), True), StructField('metadata', MapType(StringType(), StringType(), True), True), StructField('embeddings', ArrayType(FloatType(), False), True)]), True)
tensorflow keras pyspark embedding johnsnowlabs-spark-nlp
2个回答
0
投票

要提取从 SparkNLP WordEmbeddingsModel 为 Keras 和 TensorFlow 中的 RNN 模型生成的嵌入,将 Spark DataFrame 转换为 Pandas DataFrame,使用

iloc
检索嵌入,将它们转换为 numpy 数组,将数据集拆分为训练集和测试集,使用 Keras 和 TensorFlow 定义 RNN 模型,在训练集上训练模型,并评估模型在测试集上的性能。


0
投票

要提取从 SparkNLP WordEmbeddingsModel 生成的嵌入以使用 Keras 和 TensorFlow 提供 RNN 模型,您可以将 Spark DataFrame 转换为 Pandas DataFrame,然后使用它来训练 Keras RNN 模型。以下是你如何做到这一点:

将 Spark DataFrame 转换为 Pandas DataFrame:

pandas_df = result.toPandas()

从 Pandas DataFrame 的嵌入列中提取嵌入:

embeddings = pandas_df['embeddings'].apply(lambda x: [list(i) for i in x[0]])

将标签转换为单热编码向量:

from keras.utils import to_categorical

labels = to_categorical(pandas_df['label'])

使用提取的嵌入训练 Keras RNN 模型:

from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(input_dim=len(embeddings), output_dim=300, input_length=maxlen, weights=[embeddings]))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(embeddings, labels, epochs=10, batch_size=32)

请注意,您需要将 maxlen 设置为输入序列的最大长度,并调整 RNN 模型的参数以满足您的需要。此外,如果您还没有安装所需的库(Keras、TensorFlow 等),则可能需要安装。

© www.soinside.com 2019 - 2024. All rights reserved.