如何操作数据框以获取原始列广告数据中的数据

问题描述 投票:0回答:1

我从这里报告的示例开始“https://stackoverflow.com/questions/19664313/how-to-have-query-return-samples-of-row-values-as-columns”

实际上,我在其他帖子中找到解决方案没有问题,但在这种情况下则不然。

我有以下数据库:

| Name | Gene |
|A     | gene1|
|A     | gene2|  
|A     | gene3| 
|B     | gene1|
|B     | gene2|
|C     | gene1|

C基因3

如何更改我的数据框:

|Name | Gene1 | Gene2 | Gene3 | 
|A    |gene1  | gene2 | gene3 |
|B    |gene1  | gene2 | ----- |
|C    |gene1  | ----- | gene3 |

获取每个样本的原始数据也可能有用。

提前非常感谢

我尝试应用如何让查询将行值的样本作为列返回?中报告的代码,但没有成功。

apache-spark-sql
1个回答
0
投票

您可以使用PIVOT功能。

SELECT
        name,
        (CASE WHEN gene1 IS NULL THEN '-----' ELSE gene1 END) AS gene1,
        (CASE WHEN gene2 IS NULL THEN  '-----' ELSE gene2 END) AS gene2,
        (CASE WHEN gene3 IS NULL THEN '-----' ELSE gene3 END) AS gene3
    FROM table_name
    PIVOT ( 
        FIRST(gene) AS gene 
        FOR (gene) IN ('gene1' as gene1, 'gene2' as gene2, 'gene3' as gene3)
    );
+----+-----+-----+-----+
|name|gene1|gene2|gene3|
+----+-----+-----+-----+
|A   |gene1|gene2|gene3|
|B   |gene1|gene2|-----|
|C   |gene1|-----|gene3|
+----+-----+-----+-----+

scala> df.printSchema
root
 |-- name: string (nullable = true)
 |-- gene: string (nullable = true)
df
.groupBy("name")
.pivot("gene")
.agg(
     when(
         first("gene").isNotNull, 
         first("gene")
     ).otherwise("-----")
).show(false)

+----+-----+-----+-----+
|name|gene1|gene2|gene3|
+----+-----+-----+-----+
|A   |gene1|gene2|gene3|
|B   |gene1|gene2|-----|
|C   |gene1|-----|gene3|
+----+-----+-----+-----+
© www.soinside.com 2019 - 2024. All rights reserved.