如何在 Glue ETL 作业中实现跨账户架构更改?

问题描述 投票:0回答:1

我有一个简单的 Glue ETL 作业:

  • 源 = 关系数据库表(使用 JDBC Glue 连接)
  • 目标 = S3 存储桶
  • 更新选项 =“在数据目录中创建一个表,并在后续运行中更新架构并添加新分区”

当所有内容都在同一个帐户中时,此 ETL 作业就会起作用。但是,我想修改它,以便 S3 存储桶和目标数据目录位于单独的帐户中。我怎样才能做到这一点?

我尝试将资源策略附加到 S3 存储桶和 Glue 数据目录。我可以在目标 S3 存储桶中成功创建文件,但我没有看到 Glue 数据库中创建新表。运行状态为“成功”。

相关日志

24/01/17 16:55:17 INFO HadoopDataSink: Failed to create table customer in database my_database after job run with catalogId 
com.amazonaws.services.glue.model.EntityNotFoundException: Database my_database not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException;...

ETL 脚本

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Source = Relational Database Table
source_db_table = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "useConnectionProperties": "true",
        "dbtable": "public.customer",
        "connectionName": "my_db_connection",
    },
    transformation_ctx="source_db_table",
)

# Target = S3 Bucket
target_s3_bucket = glueContext.getSink(
    path="s3://my-bucket/data/customer/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    enableUpdateCatalog=True,
    transformation_ctx="target_s3_bucket",
)
target_s3_bucket.setCatalogInfo(
    catalogDatabase="my_database", catalogTableName="customer"
)
target_s3_bucket.setFormat("glueparquet", compression="snappy")
target_s3_bucket.writeFrame(source_db_table)
job.commit()

Glue 数据目录资源策略

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/my-role"
      },
      "Action": [
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:UpdateTable"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:222222222222:catalog",
        "arn:aws:glue:us-east-1:222222222222:database/my_database",
        "arn:aws:glue:us-east-1:222222222222:table/my_database/*"
      ]
    }
  ]
}
aws-glue
1个回答
0
投票

您必须设置

catalogId
才能跨帐户访问:

target_s3_bucket.setCatalogInfo(
    catalogDatabase="my_database", catalogTableName="customer", catalogId="222222222222"
)

根据 AWS:

某些 AWS Glue PySpark 和 Scala API 具有目录 ID 字段。如果已授予启用跨账户访问所需的所有权限,则 ETL 作业可以通过在目录 ID 字段中传递目标 AWS 账户 ID 来跨账户对 API 操作进行 PySpark 和 Scala 调用,以访问目标账户中的数据目录资源.

如果未提供目录 ID 值,AWS Glue 默认使用调用者自己的账户 ID,并且调用不是跨账户的。

© www.soinside.com 2019 - 2024. All rights reserved.