我必须使用 AWS Glue 爬网程序通过爬网 AWS RDS MariaDB 数据库来创建和填充 Glue 目录表。 创建后,我想将 AWS Glue 数据质量规则集添加到该新表中。 我使用 Terraform 来构建基础设施。 我如何设置才能在一次更新中创建爬网程序和数据质量规则集并准备使用?
这是我到目前为止所拥有的(在模块中):
variable "glue_connection_name" {
type = string
description = "Name of the Glue Connection to use."
}
variable "crawler_iam_role_arn" {
type = string
description = "ARN of the IAM role to use for the table Crawlers."
}
variable "database_name" {
type = string
description = "Name of the Glue Catalog Database to use."
}
variable "table_name" {
type = string
description = "The name of the table to check as in the RDS database."
}
variable "ruleset" {
type = list(string)
description = "The rules to check. Quotes must be escaped: `\\\"`"
}
variable "qa_crawlers_prefix" {
default = "qa_"
type = string
description = "The prefix QA Glue Crawlers shall set before the automatically generated Glue Catalog table name."
}
resource "aws_glue_crawler" "qa_table_crawler" {
database_name = var.database_name
name = "qa_${var.database_name}-${var.table_name}"
role = var.crawler_iam_role_arn
description = "Loads the ${var.database_name}.${var.table_name} table into the Glue Catalog."
table_prefix = var.qa_crawlers_prefix
jdbc_target {
connection_name = var.glue_connection_name
path = "${var.database_name}/${var.table_name}"
}
tags = {
Name = "glue"
Function = "data processing"
}
}
resource "aws_glue_data_quality_ruleset" "table_rules" {
name = var.table_name
description = "Checks for the ${var.table_name} table in the ${var.database_name} database"
ruleset = "Rules = [${join(",", var.ruleset)}]"
target_table {
database_name = var.database_name
table_name = "${var.qa_crawlers_prefix}${var.database_name}_${var.table_name}" # This is the name of the table as it will be created by the Glue Crawler
}
tags = {
Name = "glue-qa-${var.table_name}"
Function = "data processing"
}
depends_on = [aws_glue_crawler.qa_table_crawler]
}
我尝试过同时使用资源和数据源
aws_glue_catalog_table
,但是资源意味着爬虫会创建一个带有随机后缀的同名表,使得数据质量规则集配置无用,数据源抛出错误terraform plan
期间,因为资源尚不存在。
按上述方式设置名称可以使
terraform plan
工作,但 terraform apply
失败,因为找不到表。
我也尝试将
target_table
参数保留在 aws_glue_data_quality_ruleset
资源设置之外,但我不知道规则集最终在控制台上的位置,这使得它有点无用。
这就是我最终所做的,看起来效果很好,至少在第一次设置爬虫和规则集时是这样。
我使用
null_resource
首先为数据质量规则集创建一个表,然后在创建表的规则集后销毁它。
应用 Terraform 计划后,我将进入 AWS 控制台并运行新创建的爬网程序。
它再次创建我在其中创建数据质量规则集的表。
爬网程序完成后,规则集将出现在新创建和填充的表中。
这是我使用的代码:
variable "glue_data_quality_iam_role_arn" {
type = string
description = "ARN of the IAM role to use for Glue Data Quality runs."
}
variable "database_name" {
description = "Name of the Glue Catalog Database to use."
type = string
}
variable "table_name" {
description = "The name of the table to check as in the RDS database."
type = string
}
variable "ruleset" {
description = "The rules to check. Quotes must be escaped: `\\\"`"
type = list(string)
}
variable "qa_crawlers_prefix" {
description = "The prefix QA Glue Crawlers shall set before the automatically generated Glue Catalog table name."
type = string
default = "qa_"
}
locals {
crawler_table_name = "${aws_glue_crawler.jdbc_qa.table_prefix}${var.database_name}_${var.table_name}"
}
resource "aws_glue_crawler" "jdbc_qa" {
database_name = var.database_name
name = "qa_${var.database_name}-${var.table_name}"
role = var.glue_data_quality_iam_role_arn
table_prefix = var.qa_crawlers_prefix
jdbc_target {
connection_name = var.glue_connection_name
path = "${var.database_name}/${var.table_name}"
}
tags = {
Name = "${var.database_name}.${var.table_name} Crawler"
Function = "data processing"
}
}
resource "null_resource" "create_table" {
triggers = {
crawler_prefix = aws_glue_crawler.jdbc_qa.table_prefix
crawler_target = aws_glue_crawler.jdbc_qa.jdbc_target[0].path
}
provisioner "local-exec" {
command = "aws glue create-table --database-name ${var.database_name} --table-input '{\"Name\": \"${local.crawler_table_name}\"}'"
}
}
resource "aws_glue_data_quality_ruleset" "qa_table" {
name = "qa_${var.table_name}"
ruleset = "Rules = [\n\t${join(",\n\t", var.ruleset)}\n]" # newlines and tabs for readability in the AWS Console
target_table {
database_name = var.database_name
table_name = local.crawler_table_name
}
depends_on = [null_resource.create_table]
}
resource "null_resource" "delete_table" {
triggers = {
crawler_prefix = aws_glue_crawler.jdbc_qa.table_prefix
crawler_target = aws_glue_crawler.jdbc_qa.jdbc_target[0].path
}
provisioner "local-exec" {
command = "aws glue delete-table --database-name ${var.database_name} --name ${local.crawler_table_name}"
}
depends_on = [aws_glue_data_quality_ruleset.qa_table]
}
请注意,表的名称必须是 Glue Crawler 将创建的具体名称,AWS 没有对此进行记录。 我只是尝试了会发生什么并使用一致的命名模式。