这是我下面的代码,我收到此错误“pyspark.errors.exceptions.captured.IllegalArgumentException:输出列功能已存在”我检查了其他帖子,但我不确定需要做什么,任何人都可以在这里提供帮助。
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LinearSVC
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Create a SparkSession
spark = SparkSession.builder.appName("").getOrCreate()
# Load your TSV file into a DataFrame
data = spark.read.csv("sleep.tsv", sep='\t', header=True, inferSchema=True)
input_cols = ["V0", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12"]
# Concatenate input columns into a single column named "features"
assembler = VectorAssembler(inputCols=input_cols, outputCol= "features")
data_assembled = assembler.transform(data)
# Renamed target column to label
data_assembled = data_assembled.withColumnRenamed("target", "label")
### Split data into training and testing sets
(trainingData, testData) = new_data.randomSplit([0.8, 0.2], seed=16)
### Fit the pipeline to training data
model = pipeline.fit(trainingData)
出现错误“IllegalArgumentException:输出列功能已存在”是因为在转换过程中多次创建“功能”列。要解决此问题,您应该确保“features”列仅创建一次。
您可以按如下方式更新代码:
1-删除使用 VectorAssembler 将数据转换为“features”列的行,因为您在后续步骤中不会使用它:
data_assembled = assembler.transform(data)
2- 将 VectorAssembler 输出列名称更新为唯一的名称,例如“assembled_features”:
assembler = VectorAssembler(inputCols=input_cols, outputCol="assembled_features")
data_assembled = assembler.transform(data)
3- 更新后续代码以反映此更改:
data_assembled = data_assembled.withColumnRenamed("target", "label")
# Define your pipeline including the VectorAssembler step
pipeline = Pipeline(stages=[assembler, ...]) # Add your other stages
# Split data into training and testing sets
(trainingData, testData) = data_assembled.randomSplit([0.8, 0.2], seed=16)
# Fit the pipeline to training data
model = pipeline.fit(trainingData)