pyspark.errors.exceptions.captured.IllegalArgumentException:输出列功能已存在

问题描述 投票:0回答:1

这是我下面的代码,我收到此错误“pyspark.errors.exceptions.captured.IllegalArgumentException:输出列功能已存在”我检查了其他帖子,但我不确定需要做什么,任何人都可以在这里提供帮助。

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LinearSVC
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create a SparkSession
spark = SparkSession.builder.appName("").getOrCreate()

# Load your TSV file into a DataFrame
data = spark.read.csv("sleep.tsv", sep='\t', header=True, inferSchema=True)

input_cols = ["V0", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12"]

# Concatenate input columns into a single column named "features"
assembler = VectorAssembler(inputCols=input_cols, outputCol= "features")
data_assembled = assembler.transform(data)

# Renamed target column to label
data_assembled = data_assembled.withColumnRenamed("target", "label")

### Split data into training and testing sets
(trainingData, testData) = new_data.randomSplit([0.8, 0.2], seed=16)

### Fit the pipeline to training data
model = pipeline.fit(trainingData)
python apache-spark pyspark
1个回答
0
投票

出现错误“IllegalArgumentException:输出列功能已存在”是因为在转换过程中多次创建“功能”列。要解决此问题,您应该确保“features”列仅创建一次。

您可以按如下方式更新代码:

1-删除使用 VectorAssembler 将数据转换为“features”列的行,因为您在后续步骤中不会使用它:

   data_assembled = assembler.transform(data)

2- 将 VectorAssembler 输出列名称更新为唯一的名称,例如“assembled_features”:

 assembler = VectorAssembler(inputCols=input_cols, outputCol="assembled_features")
data_assembled = assembler.transform(data)

3- 更新后续代码以反映此更改:

将目标列重命名为标签

data_assembled = data_assembled.withColumnRenamed("target", "label")

# Define your pipeline including the VectorAssembler step
pipeline = Pipeline(stages=[assembler, ...])  # Add your other stages

# Split data into training and testing sets
(trainingData, testData) = data_assembled.randomSplit([0.8, 0.2], seed=16)

# Fit the pipeline to training data
model = pipeline.fit(trainingData)
© www.soinside.com 2019 - 2024. All rights reserved.