我有一个用 Scala 编写的 Glue Spark 作业。然后我需要从RDS数据库(PostgreSQL)获取数据源。我在 aws UI 中创建了连接并对其进行了测试。它有效,因此我可以确认与 RDS 的 Glue 连接设置正确(角色、安全组)。
当我在 Glue Spark 作业中添加此源时,我在控制台上收到此错误
"INFO 2024-04-15T07:26:25,251 245857 com.amazonaws.services.glue.connectors.NativeConnectorService$ [main] Glue connectors: Copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
"INFO 2024-04-15T07:26:25,251 245857 com.amazonaws.services.glue.connectors.NativeConnectorService$ [main] Glue connectors: Copy is finished
"Glue ETL Marketplace - Start ETL connector activation process...
"Glue ETL Marketplace - downloading jars for following connections: List(my_glue_connection) using command: List(python3, -u, -m, docker.unpack_docker_image, --connections, my_glue_connection, --result_path, jar_paths, --region, eu-west-1, --endpoint, https://glue.eu-west-1.amazonaws.com, --proxy, xx.xx.xx.xx:8888)
"2024-04-15 07:26:31,431 - __main__ - INFO - Glue ETL Marketplace - Start downloading connector jars for connection: my_glue_connection
"2024-04-15 07:26:32,492 - __main__ - INFO - Glue ETL Marketplace - using region: eu-west-1, proxy: xx.xx.xx.xx:8888 and glue endpoint: https://glue.eu-west-1.amazonaws.com to get connection: my_glue_connection
"2024-04-15 07:26:32,651 - __main__ - WARNING - Glue ETL Marketplace - Connection my_glue_connection is not a CUSTOM or Marketplace connection, skip jar downloading for it
"2024-04-15 07:26:32,651 - __main__ - INFO - Glue ETL Marketplace - successfully wrote jar paths to ""jar_paths""
"Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.
Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
...
SdkClientException occurred : com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to aws-glue-assets-xxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com:443 [aws-glue-assets-XXXXX-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx0, aws-glue-assets-xxxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx] failed: connect timed out
3 Retry(s) left
我的 Spark 作业尝试连接为:
val jdbcUrl = s"jdbc:postgresql://$jdbcHostname:$jdbcPort/$jdbcDatabase"
val connectionProperties = new java.util.Properties()
connectionProperties.put("Driver", "org.postgresql.Driver")
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword)
val dataFrame = spark.read.jdbc(jdbcUrl, "table-name", connectionProperties)
dataFrame.show()
日志中的一条奇怪消息是
copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
。但我从未设置过任何与 Redshift 相关的连接和 Glue Spark 作业。我的粘合连接(用 Terraform 编写)是 JDBC 连接。
resource "aws_glue_connection" "my_glue_connection" {
name = "my_glue_connection"
connection_type = "JDBC"
connection_properties = {
JDBC_CONNECTION_URL = "jdbc:postgresql://${var.rds_jdbc_hostname}:${var.rds_jdbc_port}/${var.rds_jdbc_db}"
PASSWORD = var.rds_jdbc_password
USERNAME = var.rds_jdbc_username
}
physical_connection_requirements {
subnet_id = "subnet-xxxx"
availability_zone = "xx-xx-xx"
security_group_id_list = [aws_security_group.my_glue_connection_sg.id]
}
}
我能找到的最接近的问题是在 AWS Glue 中下载 Glue ETL Marketplace 连接器时出错:“启动错误”但尚未提供答案。
我检查了此页面https://repost.aws/knowledge-center/glue-marketplace-connector-errors并添加了
AmazonEC2ContainerRegistryReadOnly
,但没有效果。
我解决了这个问题。为了完整起见,在此分享。我缺少 2 个配置。为它们添加 terraform 配置
1 - Glue 连接需要一个到 RDS 的 VPC 端点。本例中为接口端点。 https://repost.aws/knowledge-center/glue-connect-time-out-error
resource "aws_vpc_endpoint" "my_glue_connection_endpoint" {
vpc_id = "vpc-XXXXX"
service_name = "com.amazonaws.${var.aws_region}.glue"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = ["subnet-XXXXX"]
security_group_ids = [aws_security_group.my_glue_connection_sg.id]
}
2 - 资源组还需要允许所有出口流量。我只允许来自自引用 SG 的流量,但也有必要允许所有流量。
resource "aws_vpc_security_group_egress_rule" "my_glue_connection_sg_egress_all" {
description = "security group egress rule to allow all traffic from Glue connection to RDS"
security_group_id = aws_security_group.my_glue_connection_sg.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1" # all traffic
}