我有一个简单的 Glue ETL 作业:
当所有内容都在同一个帐户中时,此 ETL 作业就会起作用。但是,我想修改它,以便 S3 存储桶和目标数据目录位于单独的帐户中。我怎样才能做到这一点?
我尝试将资源策略附加到 S3 存储桶和 Glue 数据目录。我可以在目标 S3 存储桶中成功创建文件,但我没有看到 Glue 数据库中创建新表。运行状态为“成功”。
相关日志
24/01/17 16:55:17 INFO HadoopDataSink: Failed to create table customer in database my_database after job run with catalogId
com.amazonaws.services.glue.model.EntityNotFoundException: Database my_database not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException;...
ETL 脚本
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Source = Relational Database Table
source_db_table = glueContext.create_dynamic_frame.from_options(
connection_type="postgresql",
connection_options={
"useConnectionProperties": "true",
"dbtable": "public.customer",
"connectionName": "my_db_connection",
},
transformation_ctx="source_db_table",
)
# Target = S3 Bucket
target_s3_bucket = glueContext.getSink(
path="s3://my-bucket/data/customer/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
enableUpdateCatalog=True,
transformation_ctx="target_s3_bucket",
)
target_s3_bucket.setCatalogInfo(
catalogDatabase="my_database", catalogTableName="customer"
)
target_s3_bucket.setFormat("glueparquet", compression="snappy")
target_s3_bucket.writeFrame(source_db_table)
job.commit()
Glue 数据目录资源策略
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111111111111:role/my-role"
},
"Action": [
"glue:CreateTable",
"glue:DeleteTable",
"glue:GetPartitions",
"glue:GetTable",
"glue:UpdateTable"
],
"Resource": [
"arn:aws:glue:us-east-1:222222222222:catalog",
"arn:aws:glue:us-east-1:222222222222:database/my_database",
"arn:aws:glue:us-east-1:222222222222:table/my_database/*"
]
}
]
}
您必须设置
catalogId
才能跨帐户访问:
target_s3_bucket.setCatalogInfo(
catalogDatabase="my_database", catalogTableName="customer", catalogId="222222222222"
)
某些 AWS Glue PySpark 和 Scala API 具有目录 ID 字段。如果已授予启用跨账户访问所需的所有权限,则 ETL 作业可以通过在目录 ID 字段中传递目标 AWS 账户 ID 来跨账户对 API 操作进行 PySpark 和 Scala 调用,以访问目标账户中的数据目录资源.
如果未提供目录 ID 值,AWS Glue 默认使用调用者自己的账户 ID,并且调用不是跨账户的。