我需要使用正则表达式实现一个简单的文本分类,为此我想应用一个简单的 CASE WHEN 语句,但我想迭代所有 CASE,而不是满足第 1 个条件。
例如
with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT
CASE
WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
END as topic,
text
FROM `table`
通过此查询,文本被分类为 AI,因为这是满足的第一个条件,但它应该被分类为数组中或 3 个不同行中的 AI、工程和深度学习,因为所有 3 个条件都满足。如何应用所有正则表达式/条件对文本进行分类?
#standardSQL
with `table` as(
select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
), classification as (
select 'ai' term, 'AI' topic union all
select 'computational power', 'Engineering' union all
select 'deep learning', 'Deep Learning'
), pattern as (
select r'(?i)' || string_agg(term, '|') as regexp_pattern
from classification
)
select
array_to_string(array(
select distinct topic
from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
join classification using(term)
), ', ') topics,
text
from `table`, pattern
有输出
SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END,
CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END
) as topics, text
FROM `table`;
实际上,这构造了一个字符串。您可以使用类似的逻辑来构造一个数组。