将表转换为单列值的one-hot编码

问题描述 投票:0回答:4

我有一个有两列的表格:

+---------+--------+
| keyword | color  |
+---------+--------+
| foo     | red    |
| bar     | yellow |
| fobar   | red    |
| baz     | blue   |
| bazbaz  | green  |
+---------+--------+

我需要在 PostgreSQL 中进行某种 one-hot 编码和转换表:

+---------+-----+--------+-------+------+
| keyword | red | yellow | green | blue |
+---------+-----+--------+-------+------+
| foo     |   1 |      0 |     0 |    0 |
| bar     |   0 |      1 |     0 |    0 |
| fobar   |   1 |      0 |     0 |    0 |
| baz     |   0 |      0 |     0 |    1 |
| bazbaz  |   0 |      0 |     1 |    0 |
+---------+-----+--------+-------+------+

如何仅使用 SQL 进行此转换?

sql postgresql pivot-table
4个回答
30
投票

如果我理解正确的话,你需要条件聚合:

select keyword,
count(case when color = 'red' then 1 end) as red,
count(case when color = 'yellow' then 1 end) as yellow
-- another colors here
from t
group by keyword

2
投票

在测试用例中实现目标的另一种方法是使用

tablefunc
扩展和
COALESCE()
来填充所有 NULL 字段
:

postgres=# create table t(keyword varchar,color varchar);
CREATE TABLE
postgres=# insert into t values ('foo','red'),('bar','yellow'),('fobar','red'),('baz','blue'),('bazbaz','green');
INSERT 0 5
postgres=# SELECT keyword, COALESCE(red,0) red, 
 COALESCE(blue,0) blue, COALESCE(green,0) green, 
 COALESCE(yellow,0) yellow 
 FROM crosstab(                         
  $$select keyword, color, COALESCE('1',0) as onehot from test01
    group by 1, 2 order by 1, 2$$,
  $$select distinct color from test01 order by 1$$)
 AS result(keyword varchar, blue int, green int, red int, yellow int);
 keyword | red | blue | green | yellow 
---------+-----+------+-------+--------
 bar     |   0 |    0 |     0 |      1
 baz     |   0 |    1 |     0 |      0
 bazbaz  |   0 |    0 |     1 |      0
 fobar   |   1 |    0 |     0 |      0
 foo     |   1 |    0 |     0 |      0
(5 rows)

postgres=# 

如果您只是为了获得

psql
下的结果:

postgres=# select keyword, color, COALESCE('1',0) as onehot from t
  --group by 1, 2 order by 1, 2
  \crosstabview keyword color
 keyword | red | yellow | blue | green 
---------+-----+--------+------+-------
 foo     |   1 |        |      |      
 bar     |     |      1 |      |      
 fobar   |   1 |        |      |      
 baz     |     |        |    1 |      
 bazbaz  |     |        |      |     1
(5 rows)

postgres=# 

1
投票

要在具有大量列的表上使用此代码,请使用 Python 生成查询:

1) 创建一个包含您想要作为列名称的唯一变量的列表,并将其导入到 Python,如下所示:

list

for item in list:
 print('count(case when item=' +str(item)+ 'then 1 end) as is_'+str(item)+',')

2)复制输出(减去最后一行的最后一个逗号)

3)然后:

select keyword,

OUTPUT FROM PYTHON

from t
group by keyword

0
投票

您可以运行相同的操作,但以编程方式创建案例。 在您的示例中,您有一个简短的颜色列表,但如果您有许多不同的类别,您可以运行类似的命令:

SET SESSION group_concat_max_len = 10000;
SET @cases = NULL;

SELECT 
GROUP_CONCAT(
    DISTINCT
    CONCAT(
        0xd0a,
        "COUNT(CASE WHEN color = '",
        color,
        "' THEN 1 END) AS ",
        color
    )
    SEPARATOR ','
) INTO @cases
FROM table_name t;


SET @sql = CONCAT(
"SELECT 
t.keyword,",
@cases,
"
FROM table_name t
GROUP BY keyword
"
);

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

 
© www.soinside.com 2019 - 2024. All rights reserved.