Cloud Storage BigQuery Data Transfer Service 中是否可以有动态“data_path_template”？

Question

我有一个 Google 存储桶，其中包含使用以下路径模式上传的带有时间戳的 JSONL 文件：

<year>/<month>/<day>/<hour>/<id>/<start_minute>:<start_second>-<end_minute>:<end_second>.jsonl

举个例子：

2023/05/19/21/A9887e4d2f6fc1acb01/15:54-16:04.jsonl

我想将这些加载到 BigQuery 中，形成每小时分区的表。

我正在尝试为此目的设置一个计划的云存储 BigQuery 数据传输服务（使用 Terraform，请参见下文），但有点困惑如何指定

data_path_template

以便它只会全局显示前一小时的文件。

或者如果使用上次修改时间可能并不重要（如此处所解释，下面粘贴的相关段落）？

从 Cloud Storage 传输默认将写入首选项参数设置为 APPEND。在此模式下，未修改的文件只能加载到 BigQuery 中一次。如果文件的最后修改时间属性被更新，则文件将被重新加载。

我有点担心随着存储桶的增长而进行的额外扫描，但我是否理解正确，BigQuery 数据传输服务无论如何都不需要云存储？或者我是否仍需要为这些扫描支付不断增加的操作费？

resource "google_bigquery_data_transfer_config" "query_config" {
  display_name           = "sensor-data-ingestion"
  location               = var.region
  data_source_id         = "google_cloud_storage"
  schedule               = "every 1 hours from 00:10 to 23:10"
  destination_dataset_id = google_bigquery_dataset.main.dataset_id
  params = {
    data_path_template              = "gs://${google_storage_bucket.sensor-data.name}/*.jsonl"
    destination_table_name_template = "sensor_data_partitioned"
    file_format                     = "JSON"
    write_disposition               = "WRITE_APPEND"
  }
}

Answer 1

在这上面花了太多时间，但发现答案是肯定的！

显然，存储桶 URI 和表分区都有一个模板系统，记录在here，对于 Cloud Storage（也可用于 S3 和 Blob 存储）来说，它看起来像这样：

数据来源	参数化 URI 或数据路径	参数化目标表名称	评估的 URI 或数据路径	评估的目标表名称
云存储	gs://bucket/events-{run_time\|"%Y%m%d"}/*.csv	mytable${run_time\|"%Y%m%d"}	gs://bucket/events-20180215/*.csv	我的桌子$20180215

更好的是，它还支持时间和传输运行时间的偏移！参数格式记录在此处，而日期和时间部分记录在此处，请参见示例：

运行时间（UTC）	模板化参数	输出目标表名
2018-02-15 00:00:00	我的桌子	我的桌子
2018-02-15 00:00:00	mytable_{run_time\|"%Y%m%d"}	我的表_20180215
2018-02-15 00:00:00	mytable_{run_time+25h\|"%Y%m%d"}	我的表_20180216
2018-02-15 00:00:00	mytable_{run_time-1h\|"%Y%m%d"}	我的表_20180214
2018-02-15 00:00:00	mytable_{run_time+1.5h\|"%Y%m%d%H"} 或 mytable_{run_time+90m\|"%Y%m%d%H"}	我的表_2018021501
2018-02-15 00:00:00	{运行时间+97s\|"%Y%m%d"}_mytable_{运行时间+97s\|"%H%M%S"}	20180215_mytable_000137

因此，一个有效的 Terraform 示例将是（请注意，“和 $ 都需要转义）：

resource "google_bigquery_data_transfer_config" "sensor-data-ingestion" {
  depends_on             = [google_storage_bucket.sensor-data, google_project_iam_member.sensor-data-ingestion-token-creator, google_project_iam_member.sensor-data-ingestion-data-editor, google_bigquery_table.sensor-data-main-partitioned, google_service_account.sensor-data-ingestion-service-account]
  display_name           = "sensor-data-transfer"
  location               = var.region
  data_source_id         = "google_cloud_storage"
  schedule               = "every 1 hours from 00:10 to 23:10"
  destination_dataset_id = google_bigquery_dataset.main.dataset_id
  service_account_name   = google_service_account.sensor-data-ingestion-service-account.email
  params = {
    data_path_template              = "gs://${google_storage_bucket.sensor-data.name}/{run_time-1h|\"%Y%m%d%H\"}*.jsonl"
    destination_table_name_template = "${google_bigquery_table.sensor-data-main-partitioned.table_id}$${run_time-1h|\"%Y%m%d%H\"}"
    file_format                     = "JSON"
    write_disposition               = "APPEND"
  }
}

我浪费了几天时间尝试使用

使通配符匹配路径，但不得不删除它们并使用

作为分隔符，所以最终选择了像

2023060615_A9887e4d2f6fc1acb01_50_51-50_51.jsonl

这样的文件名。

Cloud Storage BigQuery Data Transfer Service 中是否可以有动态“data_path_template”？

问题描述投票：0回答：1

1个回答

最新问题

Cloud Storage BigQuery Data Transfer Service 中是否可以有动态“data_path_template”？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1