我有一个Clickhouse表格,其中一个字段包含文字说明(约300个字)。
例如评论:
Rev_id Place_id Stars Category Text
1 12 3 Food Nice food but a bad dirty place.
2 31 4 Sport Not bad, they have everything.
3 55 1 Bar Poor place,bad audience.
我想进行一些字数分析,例如一般的字数统计(每个字出现了多少次)或每个类别的前K个字。
在示例中:
word count
bad 3
place 2
...有没有一种方法可以完全在ClickHouse中完成而不涉及编程语言?
SELECT
arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
count()
FROM
(
SELECT 'Nice food but a bad dirty place.' AS x
UNION ALL
SELECT 'Not bad, they have everything.'
UNION ALL
SELECT 'Poor place,bad audience.'
)
GROUP BY w
ORDER BY count() DESC
┌─w──────────┬─count()─┐
│ │ 4 │
│ bad │ 3 │
│ place │ 2 │
│ have │ 1 │
│ Poor │ 1 │
│ food │ 1 │
│ Not │ 1 │
│ they │ 1 │
│ audience │ 1 │
│ Nice │ 1 │
│ but │ 1 │
│ dirty │ 1 │
│ a │ 1 │
│ everything │ 1 │
└────────────┴─────────┘