SQL/Bigquery 文本分类

问题描述 投票:0回答:4

我需要使用正则表达式实现一个简单的文本分类,为此我想应用一个简单的 CASE WHEN 语句,但我想迭代所有 CASE,而不是满足第 1 个条件。

例如

with `table` as( SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text ) SELECT CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI' WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering' WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning' END as topic, text FROM `table`
通过此查询,文本被分类为 AI,因为这是满足的第一个条件,但它应该被分类为数组中或 3 个不同行中的 AI、工程和深度学习,因为所有 3 个条件都满足。

如何应用所有正则表达式/条件对文本进行分类?

sql text google-bigquery text-mining mining
4个回答
1
投票
我觉得下面是最通用和可重用的解决方案(BigQuery Standard SQL)

#standardSQL with `table` as( select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text ), classification as ( select 'ai' term, 'AI' topic union all select 'computational power', 'Engineering' union all select 'deep learning', 'Deep Learning' ), pattern as ( select r'(?i)' || string_agg(term, '|') as regexp_pattern from classification ) select array_to_string(array( select distinct topic from unnest(regexp_extract_all(lower(text), regexp_pattern)) term join classification using(term) ), ', ') topics, text from `table`, pattern
有输出


1
投票
一种方法是字符串连接:

SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END, CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END, CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END ) as topics, text FROM `table`;
实际上,这构造了一个字符串。您可以使用类似的逻辑来构造一个数组。


1
投票
以下适用于 BigQuery 标准 SQL

#standardSQL select array_to_string(array(select distinct lower(topic) from unnest(regexp_extract_all(text, r'(?i)ai|computational power|deep learning')) topic ), ', ') topics, text from `table`
如果适用于您问题中的样本数据 - 输出是


0
投票
如何为 Redshift 编写类似的 SQL?

© www.soinside.com 2019 - 2024. All rights reserved.