查找值在PySpark Dataframe中特定列之间的所有列的列表

问题描述 投票:1回答:1

我有Spark DF,它由20列组成,我想从中查找哪个列的值介于HighLow列值之间。

Time,8,7,6,5,4,3,2,1,0,-1,-2,-3,-4,-5,-6,-7,-8,High,Low
09:16,930.9476296,927.4296671,924.1894385,923.2636589,921.6898335,920.578898,919.4679625,918.171871,915.95,913.728129,912.4320375,911.321102,910.2101665,908.6363411,907.7105615,904.4703329,900.9523704,919.95,917.65

我在下面的命令中尝试过,但它给出了错误:

joineddata.withColumn('RR', map(lambda x: [x], ((F.col(x) >= (F.col('Low')) & (F.col(x) <= (F.col('High')) for x in joineddata.columns[1:18]))))).show()

错误

TypeError:列不可迭代

所需结果

我想有一个新列,该列是列名称的列表,其值介于HighLow列之间。

Time,8,7,6,5,4,3,2,1,0,-1,-2,-3,-4,-5,-6,-7,-8,High,Low,RR
09:16,930.9476296,927.4296671,924.1894385,923.2636589,921.6898335,920.578898,919.4679625,918.171871,915.95,913.728129,912.4320375,911.321102,910.2101665,908.6363411,907.7105615,904.4703329,900.9523704,919.95,917.65,[2,1]
python apache-spark pyspark pyspark-sql pyspark-dataframes
1个回答
0
投票

只需使用whenwhen收集数组中的列名,以检查列是否满足条件,然后过滤结果数组以删除空值(不满足条件的列):

between

请注意,在第二步中使用了between功能,该功能仅在Spark 2.4+中可用。对于较旧的版本,可以使用UDF。

© www.soinside.com 2019 - 2024. All rights reserved.