Glue Spark (Scala) 作业未连接到 postgresql RDS

问题描述 投票:0回答:1

我有一个用 Scala 编写的 Glue Spark 作业。然后我需要从RDS数据库(PostgreSQL)获取数据源。我在 aws UI 中创建了连接并对其进行了测试。它有效,因此我可以确认与 RDS 的 Glue 连接设置正确(角色、安全组)。

当我在 Glue Spark 作业中添加此源时,我在控制台上收到此错误

"INFO 2024-04-15T07:26:25,251 245857  com.amazonaws.services.glue.connectors.NativeConnectorService$  [main]  Glue connectors: Copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
"INFO 2024-04-15T07:26:25,251 245857  com.amazonaws.services.glue.connectors.NativeConnectorService$  [main]  Glue connectors: Copy is finished
"Glue ETL Marketplace - Start ETL connector activation process...
"Glue ETL Marketplace - downloading jars for following connections: List(my_glue_connection) using command: List(python3, -u, -m, docker.unpack_docker_image, --connections, my_glue_connection, --result_path, jar_paths, --region, eu-west-1, --endpoint, https://glue.eu-west-1.amazonaws.com, --proxy, xx.xx.xx.xx:8888)
"2024-04-15 07:26:31,431 - __main__ - INFO - Glue ETL Marketplace - Start downloading connector jars for connection: my_glue_connection
"2024-04-15 07:26:32,492 - __main__ - INFO - Glue ETL Marketplace - using region: eu-west-1, proxy: xx.xx.xx.xx:8888 and glue endpoint: https://glue.eu-west-1.amazonaws.com to get connection: my_glue_connection
"2024-04-15 07:26:32,651 - __main__ - WARNING - Glue ETL Marketplace - Connection my_glue_connection is not a CUSTOM or Marketplace connection, skip jar downloading for it
"2024-04-15 07:26:32,651 - __main__ - INFO - Glue ETL Marketplace - successfully wrote jar paths to ""jar_paths""
"Glue ETL Marketplace - Retrieved no ETL connector jars, this may be due to no marketplace/custom connection attached to the job or failure of downloading them, please scroll back to the previous logs to find out the root cause. Container setup continues.
Glue ETL Marketplace - ETL connector activation process finished, container setup continues...
...
SdkClientException occurred : com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to aws-glue-assets-xxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com:443 [aws-glue-assets-XXXXX-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx0, aws-glue-assets-xxxxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx, aws-glue-assets-xxxxxxx-eu-west-1.s3.eu-west-1.amazonaws.com/xx.xx.xx.xx] failed: connect timed out
3 Retry(s) left

我的 Spark 作业尝试连接为:

    val jdbcUrl = s"jdbc:postgresql://$jdbcHostname:$jdbcPort/$jdbcDatabase"
    val connectionProperties = new java.util.Properties()
    connectionProperties.put("Driver", "org.postgresql.Driver")
    connectionProperties.put("user", jdbcUsername)
    connectionProperties.put("password", jdbcPassword)

    val dataFrame = spark.read.jdbc(jdbcUrl, "table-name", connectionProperties)
    dataFrame.show()

日志中的一条奇怪消息是

copy connector /connectors/redshift/new/redshift-jdbc42-2.1.0.16.jar to /opt/aws_glue_connectors/selected/redshift/redshift-jdbc42-2.1.0.16.jar
。但我从未设置过任何与 Redshift 相关的连接和 Glue Spark 作业。我的粘合连接(用 Terraform 编写)是 JDBC 连接。

resource "aws_glue_connection" "my_glue_connection" {
  name                  = "my_glue_connection"
  connection_type       = "JDBC"
  connection_properties = {
    JDBC_CONNECTION_URL = "jdbc:postgresql://${var.rds_jdbc_hostname}:${var.rds_jdbc_port}/${var.rds_jdbc_db}"
    PASSWORD            = var.rds_jdbc_password
    USERNAME            = var.rds_jdbc_username
  }

  physical_connection_requirements {
    subnet_id              = "subnet-xxxx"
    availability_zone      = "xx-xx-xx"
    security_group_id_list = [aws_security_group.my_glue_connection_sg.id]
  }
}

我能找到的最接近的问题是在 AWS Glue 中下载 Glue ETL Marketplace 连接器时出错:“启动错误”但尚未提供答案。

我检查了此页面https://repost.aws/knowledge-center/glue-marketplace-connector-errors并添加了

AmazonEC2ContainerRegistryReadOnly
,但没有效果。

aws-glue aws-glue-connection
1个回答
0
投票

我解决了这个问题。为了完整起见,在此分享。我缺少 2 个配置。为它们添加 terraform 配置

1 - Glue 连接需要一个到 RDS 的 VPC 端点。本例中为接口端点。 https://repost.aws/knowledge-center/glue-connect-time-out-error

resource "aws_vpc_endpoint" "my_glue_connection_endpoint" {
  vpc_id              = "vpc-XXXXX"
  service_name        = "com.amazonaws.${var.aws_region}.glue"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = ["subnet-XXXXX"]
  security_group_ids  = [aws_security_group.my_glue_connection_sg.id]
}

2 - 资源组还需要允许所有出口流量。我只允许来自自引用 SG 的流量,但也有必要允许所有流量。

resource "aws_vpc_security_group_egress_rule" "my_glue_connection_sg_egress_all" {
  description       = "security group egress rule to allow all traffic from Glue connection to RDS"
  security_group_id = aws_security_group.my_glue_connection_sg.id
  cidr_ipv4         = "0.0.0.0/0"
  ip_protocol       = "-1" # all traffic
}
© www.soinside.com 2019 - 2024. All rights reserved.