我在HIVE中有这样的表格:
A | B | C | value
key1 |NULL|NULL| v1
NULL | key2 |NULL| v2
NULL |NULL| key3 | v3
NULL | key4 |NULL| v4
将其转换为这样的键值表的最简单方法是什么:
key_type | key_value | value
A | key1 | v1
B | key2 | v2
C | key3 | v3
B | key4 | v4
使用Hive-SQL或Spark Dataframe转换(PySpark)?谢谢您的帮助。
您可以使用union all
:
select t.key_type, t.key_value, t.value
from ( (select 'a' as key_type, a as key_value, value from t) union all
(select 'b' as key_type, b as key_value, value from t) union all
(select 'c' as key_type, c as key_value, value from t)
) t
where t.key_type is not null
order by t.value;
使用pyspark
,可以在过滤所需的列并在列值不为null时返回列名称之后使用greatest
:
cols = [i for i in df.columns if i!='value'] #['A','B','C']
df.select(F.greatest(*[F.when(F.col(i).isNotNull(),i).alias(i)
for i in cols]).alias("key_type")
,F.greatest(*[F.col(i) for i in cols]).alias("key_Value"),"value").show()
+--------+---------+-----+
|key_type|key_Value|value|
+--------+---------+-----+
| A| key1| v1|
| B| key2| v2|
| C| key3| v3|
| B| key4| v4|
+--------+---------+-----+