我正试图解决这个问题:[Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)但未在Pyspark中使用UDF
我在Scala中看到很多选择,但Pyspark没有。
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
(Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
(Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
vs = VectorSlicer(inputCol="features", outputCol="sliced", indices=[1, 4])
print(vs.transform(df).head().sliced)
DenseVector([2.3, 1.0]) # elements in 1 and 4 position of first 'features' vector in Dataframe