如何设置在一个 terraform 配置中由爬网程序创建的目录表上运行的数据质量规则集？

Question

我必须使用 AWS Glue 爬网程序通过爬网 AWS RDS MariaDB 数据库来创建和填充 Glue 目录表。创建后，我想将 AWS Glue 数据质量规则集添加到该新表中。我使用 Terraform 来构建基础设施。我如何设置才能在一次更新中创建爬网程序和数据质量规则集并准备使用？

这是我到目前为止所拥有的（在模块中）：

variable "glue_connection_name" {
  type = string
  description = "Name of the Glue Connection to use."
}

variable "crawler_iam_role_arn" {
  type = string
  description = "ARN of the IAM role to use for the table Crawlers."
}

variable "database_name" {
  type = string
  description = "Name of the Glue Catalog Database to use."
}

variable "table_name" {
  type    = string
  description = "The name of the table to check as in the RDS database."
}

variable "ruleset" {
  type = list(string)
  description = "The rules to check. Quotes must be escaped: `\\\"`"
}

variable "qa_crawlers_prefix" {
  default     = "qa_"
  type        = string
  description = "The prefix QA Glue Crawlers shall set before the automatically generated Glue Catalog table name."
}

resource "aws_glue_crawler" "qa_table_crawler" {
  database_name = var.database_name
  name          = "qa_${var.database_name}-${var.table_name}"
  role          = var.crawler_iam_role_arn
  description   = "Loads the ${var.database_name}.${var.table_name} table into the Glue Catalog."
  table_prefix  = var.qa_crawlers_prefix

  jdbc_target {
    connection_name = var.glue_connection_name
    path            = "${var.database_name}/${var.table_name}"
  }

  tags = {
    Name     = "glue"
    Function = "data processing"
  }
}

resource "aws_glue_data_quality_ruleset" "table_rules" {
  name        = var.table_name
  description = "Checks for the ${var.table_name} table in the ${var.database_name} database"

  ruleset = "Rules = [${join(",", var.ruleset)}]"

  target_table {
    database_name = var.database_name
    table_name    = "${var.qa_crawlers_prefix}${var.database_name}_${var.table_name}"  # This is the name of the table as it will be created by the Glue Crawler
  }

  tags = {
    Name     = "glue-qa-${var.table_name}"
    Function = "data processing"
  }
  depends_on = [aws_glue_crawler.qa_table_crawler]
}

我尝试过同时使用资源和数据源

aws_glue_catalog_table

，但是资源意味着爬虫会创建一个带有随机后缀的同名表，使得数据质量规则集配置无用，数据源抛出错误

terraform plan

期间，因为资源尚不存在。

按上述方式设置名称可以使

terraform plan

工作，但

terraform apply

失败，因为找不到表。

我也尝试将

target_table

参数保留在

aws_glue_data_quality_ruleset

资源设置之外，但我不知道规则集最终在控制台上的位置，这使得它有点无用。

Answer 1

这就是我最终所做的，看起来效果很好，至少在第一次设置爬虫和规则集时是这样。

我使用

null_resource

首先为数据质量规则集创建一个表，然后在创建表的规则集后销毁它。应用 Terraform 计划后，我将进入 AWS 控制台并运行新创建的爬网程序。它再次创建我在其中创建数据质量规则集的表。爬网程序完成后，规则集将出现在新创建和填充的表中。

这是我使用的代码：

variable "glue_data_quality_iam_role_arn" {
  type        = string
  description = "ARN of the IAM role to use for Glue Data Quality runs."
}

variable "database_name" {
  description = "Name of the Glue Catalog Database to use."
  type        = string
}

variable "table_name" {
  description = "The name of the table to check as in the RDS database."
  type        = string
}

variable "ruleset" {
  description = "The rules to check. Quotes must be escaped: `\\\"`"
  type        = list(string)
}

variable "qa_crawlers_prefix" {
  description = "The prefix QA Glue Crawlers shall set before the automatically generated Glue Catalog table name."
  type        = string
  default     = "qa_"
}

locals {
  crawler_table_name = "${aws_glue_crawler.jdbc_qa.table_prefix}${var.database_name}_${var.table_name}"
}

resource "aws_glue_crawler" "jdbc_qa" {
  database_name = var.database_name
  name          = "qa_${var.database_name}-${var.table_name}"
  role          = var.glue_data_quality_iam_role_arn
  table_prefix  = var.qa_crawlers_prefix

  jdbc_target {
    connection_name = var.glue_connection_name
    path            = "${var.database_name}/${var.table_name}"
  }

  tags = {
    Name     = "${var.database_name}.${var.table_name} Crawler"
    Function = "data processing"
  }
}

resource "null_resource" "create_table" {
  triggers = {
    crawler_prefix = aws_glue_crawler.jdbc_qa.table_prefix
    crawler_target = aws_glue_crawler.jdbc_qa.jdbc_target[0].path
  }
  provisioner "local-exec" {
    command = "aws glue create-table --database-name ${var.database_name} --table-input '{\"Name\": \"${local.crawler_table_name}\"}'"
  }
}

resource "aws_glue_data_quality_ruleset" "qa_table" {
  name        = "qa_${var.table_name}"

  ruleset = "Rules = [\n\t${join(",\n\t", var.ruleset)}\n]" # newlines and tabs for readability in the AWS Console

  target_table {
    database_name = var.database_name
    table_name    = local.crawler_table_name
  }
  depends_on = [null_resource.create_table]
}

resource "null_resource" "delete_table" {
  triggers = {
    crawler_prefix = aws_glue_crawler.jdbc_qa.table_prefix
    crawler_target = aws_glue_crawler.jdbc_qa.jdbc_target[0].path
  }
  provisioner "local-exec" {
    command = "aws glue delete-table --database-name ${var.database_name} --name ${local.crawler_table_name}"
  }
  depends_on = [aws_glue_data_quality_ruleset.qa_table]
}

请注意，表的名称必须是 Glue Crawler 将创建的具体名称，AWS 没有对此进行记录。我只是尝试了会发生什么并使用一致的命名模式。

如何设置在一个 terraform 配置中由爬网程序创建的目录表上运行的数据质量规则集？

问题描述投票：0回答：1

1个回答

最新问题

如何设置在一个 terraform 配置中由爬网程序创建的目录表上运行的数据质量规则集？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1