使用 S3Cluster 函数而不是 S3 读取镶木地板时发生 Clickhouse 错误

Question

我正在使用 Google Cloud Storage parquets 将数据从 BigQuery 传输到 Clickhouse。我使用 BQ 端的 EXPORT DATA 命令导出到镶木地板，如下所示（只是隐藏变量值，下面的脚本仅用于演示目的，我导出到镶木地板没有问题）：

    EXPORT DATA  OPTIONS (
             uri = '{gs_uri}*{full_parquet_filename}.{gs_format.lower()}',
             format = '{gs_format}' ,
             overwrite  = true)
             AS ( SELECT * FROM `{bq_project_id}.{bq_dataset}. 
   {bq_table}_intraday_{ts_date_str_bq_format}`
    WHERE event_date = '{ts_date_str_bq_format}');

导出后，我可以在 Clickhouse 端使用以下 sql 脚本获得镶木地板内的一组列：

   describe table   s3(
    's3url',
    'access_key_id',
    'secret_access_key' 
    )

镶木地板内的其中一根柱子具有以下类型：

列名称 - 用户属性

列类型-

Array(Tuple(key Nullable(String), value Tuple(string_value Nullable(String), int_value Nullable(Int64), float_value Nullable(Float64), double_value Nullable(Float64), set_timestamp_micros Nullable(Int64))))

使用以下公式，我生成一列字符串类型，可以解析为 json（在 Clickhouse 端），而且我可以仅使用 Clickhouse 端的 VisitParamExtractString 函数获取任何值。公式为：

select 

            arrayMap(x -> 'user_properties_'||(tupleElement(x, 1)) , user_properties ) as us_pr_key 
            ,arrayMap(x -> tupleElement(tupleElement(x, 2),1) , user_properties ) as us_pr_value_string 
            ,arrayMap(x -> tupleElement(tupleElement(x, 2),2) , user_properties ) as us_pr_value_int 
            ,arrayMap(x -> tupleElement(tupleElement(x, 2),3) , user_properties ) as us_pr_value_float
            ,arrayMap(x -> tupleElement(tupleElement(x, 2),4) , user_properties ) as us_pr_value_double

            ,arrayMap((a,b,c,d) -> 
                                coalesce(toString(a),toString(b),toString(c),toString(d)) ,
                                                   us_pr_value_string,
                                                   us_pr_value_int,
                                                   us_pr_value_float,
                                                   us_pr_value_double )                                as us_pr_filled_value
            ,arrayMap((a,b) -> 
                                ('{'||'"'||toString(a)||'"'||':'||'"'||toString(b)||'"'||'}') ,
                                                   us_pr_key,
                                                   us_pr_filled_value )                                as us_pr_key_value 
            ,'{'||arrayStringConcat(us_pr_key_value,', ')||'}' as us_pr_json
from 
s3(
    's3url',
    'access_key_id',
    'secret_access_key' 
    )

生成值的示例如下：

{{"user_properties_user_pseudo_id":"123122131241234"},{"user_properties_custom_client_id":"23123124332432"}}

问题是，当我想调整 Clickhouse 集群的公式以使用 S3Cluster 函数而不是 S3 提高插入速度时，我收到错误：

SQL Error [8] [07000]: Code: 8. DB::Exception: Received from server_name.com:9000. DB::Exception: Column 'user_properties.key' is not presented in input data.: While executing ParquetBlockInputFormat: While executing S3. (THERE_IS_NO_COLUMN) (version 23.7.3.14 (official build))
, server ClickHouseNode [uri=http://server_name.com:8123/default, options={socket_timeout=30000000,use_server_time_zone=false,use_time_zone=false}]@-1742689150

这与标准 S3 功能配合良好，可处理数百万行，从未出现任何问题。有没有不编辑脚本的解决方案？

我使用 s3Cluster 的方式与指定公式相同，只是我指定了如下所示的集群名称：

..... FROM
    s3Cluster(
    'clickhouse_cluster_name',
    's3url',
    'access_key_id',
    'secret_access_key' 
    )

Answer 1

尝试禁用

flatten_nested

设置，在 23.8 之前的版本中

Array(Tuple)

在某些情况下可能会被展平为

Nested

类型，并且可能会导致一些问题。也值得尝试新版本，如 23.8 及更高版本。

使用 S3Cluster 函数而不是 S3 读取镶木地板时发生 Clickhouse 错误

问题描述投票：0回答：1

1个回答

最新问题

使用 S3Cluster 函数而不是 S3 读取镶木地板时发生 Clickhouse 错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1