我正在开发一个项目,需要使用
PySpark
对大规模数据集执行 K 均值聚类。该数据集由数百万行组成,并具有数千个特征列。我已成功将数据加载到 PySpark
DataFrame
,但在尝试为 K 均值聚类准备数据时遇到问题。
我尝试遵循各种教程和示例,但我不断遇到与列解析、列索引或其他转换相关的错误。主要挑战似乎是将特征列转换为 K 均值聚类所需的适当格式。数据的维度如下:(6883, 9995)。正如你所看到的,这是一个非常大的数据集,我很困惑为什么 K 均值拟合和预测失败。
from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()
file_location = "/user/dorwayne/bgc_features_part0001.tsv"
file_type = "tsv"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
bgc = spark.read.format("csv").option("inferSchema", "False").option("delimiter", "\\t").option("header","true").load("dbfs:/user/dorwayne/bgc_features_part0001.tsv") \
assembler = VectorAssembler(inputCols = bgc.columns, outputCol = "features")
assembled_data = assembler.transform(bgc)
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(assembled_data)
predictions = model.transform(assembled_data)
但是输出的误差很大:
Py4JJavaError: An error occurred while calling o8473.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 43.0 failed 4 times, most recent failure: Lost task 2.3 in stage 43.0 (TID 147) (ip-10-131-129-26.ec2.internal executor driver): java.lang.AssertionError: assertion failed