用 Glue 插入 Snowflake 会抛出“IllegalArgumentException:没有名称为 <host> 的组”

问题描述 投票:0回答:1

我有一个 Glue 作业,将数据从 RDS 加载到 Snowflake:

此作业用于在此 Snowflake 实例存在之前插入到 S3。现在尝试使用 Snowflake 作为接收器运行它会返回以下错误:“IllegalArgumentException:没有名称为 的组”

来自驱动程序日志:

23/03/29 09:45:32 ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Last Executed Line number from script job-rds-to-snowflake-visual.py: 50
23/03/29 09:45:32 ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] {"Event":"GlueETLJobExceptionEvent","Timestamp":1680083132028,"Failure Reason":"Traceback (most recent call last):\n  File \"/tmp/job-rds-to-snowflake-visual.py\", line 50, in <module>\n    transformation_ctx=\"SnowflakeDataCatalog_node1680082896733\",\n  File \"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py\", line 819, in from_catalog\n    return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)\n  File \"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py\", line 386, in write_dynamic_frame_from_catalog\n    makeOptions(self._sc, additional_options), catalog_id)\n  File \"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 1305, in __call__\n    answer, self.gateway_client, self.target_id, self.name)\n  File \"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 117, in deco\n    raise converted from None\npyspark.sql.utils.IllegalArgumentException: No group with name <host>","Stack Trace":[{"Declaring Class":"deco","Method Name":"raise converted from None","File Name":"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py","Line Number":117},{"Declaring Class":"__call__","Method Name":"answer, self.gateway_client, self.target_id, self.name)","File Name":"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py","Line Number":1305},{"Declaring Class":"write_dynamic_frame_from_catalog","Method Name":"makeOptions(self._sc, additional_options), catalog_id)","File Name":"/opt/amazon/lib/python3.6/site-packages/awsglue/context.py","Line Number":386},{"Declaring Class":"from_catalog","Method Name":"return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)","File Name":"/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py","Line Number":819},{"Declaring Class":"<module>","Method Name":"transformation_ctx=\"SnowflakeDataCatalog_node1680082896733\",","File Name":"/tmp/job-rds-to-snowflake-visual.py","Line Number":50}],"Last Executed Line number":50,"script":"job-rds-to-snowflake-visual.py"}
23/03/29 09:45:32 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
  File "/tmp/job-rds-to-snowflake-visual.py", line 50, in <module>
    transformation_ctx="SnowflakeDataCatalog_node1680082896733",
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 819, in from_catalog
    return self._glue_context.write_dynamic_frame_from_catalog(frame, db, table_name, redshift_tmp_dir, transformation_ctx, additional_options, catalog_id)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 386, in write_dynamic_frame_from_catalog
    makeOptions(self._sc, additional_options), catalog_id)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.IllegalArgumentException: No group with name <host>
23/03/29 09:45:31 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: sf_audit_db, tableName: auditlog_dev_public_rds_auditlog, isRegisteredWithLF: false
23/03/29 09:45:26 WARN SharedState: URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
23/03/29 09:45:24 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.JDBCDataSource
23/03/29 09:45:24 INFO GlueContext: Glue secret manager integration: secretId is not provided.
23/03/29 09:45:24 INFO GlueContext: nameSpace: pg_audit_db, tableName: supportdatabase_public_audit_log_condensed, connectionName conn-rds-pg-auditdb, vendor: postgresql
23/03/29 09:45:24 INFO GlueContext: getCatalogSource: transactionId: <not-specified> asOfTime: <not-specified> catalogPartitionIndexPredicate: <not-specified> 
23/03/29 09:45:24 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: pg_audit_db, tableName: supportdatabase_public_audit_log_condensed, isRegisteredWithLF: false, isGoverned: false, isRowFilterEnabled: false, useAdvancedFiltering: false, isTableFromSchemaRegistry: false
23/03/29 09:45:22 INFO GlueContext: GlueMetrics configured and enabled
23/03/29 09:45:19 INFO Utils: Successfully started service 'sparkDriver' on port 42465.

我没有触及生成的脚本,因为我们希望将作业保持在可视模式下。这是脚本(如果有帮助的话):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame


def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
    for alias, frame in mapping.items():
        frame.toDF().createOrReplaceTempView(alias)
    result = spark.sql(query)
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node RDS (Data Catalog)
RDSDataCatalog_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="pg_audit_db",
    table_name="supportdatabase_public_audit_log_condensed",
    transformation_ctx="RDSDataCatalog_node1",
)

# Script generated for node SQL Query
SqlQuery0 = """
SELECT 
    *
FROM
      webapirequestlog
"""
SQLQuery_node1679649943271 = sparkSqlQuery(
    glueContext,
    query=SqlQuery0,
    mapping={"webapirequestlog": RDSDataCatalog_node1},
    transformation_ctx="SQLQuery_node1679649943271",
)

# Script generated for node Snowflake (Data Catalog)
SnowflakeDataCatalog_node1680082896733 = glueContext.write_dynamic_frame.from_catalog(
    frame=SQLQuery_node1679649943271,
    database="sf_audit_db",
    table_name="auditlog_dev_public_rds_auditlog",
    transformation_ctx="SnowflakeDataCatalog_node1680082896733",
)

job.commit()

我尝试用谷歌搜索错误,但没有任何有用的结果。有什么想法要检查什么吗?

python amazon-web-services pyspark snowflake-cloud-data-platform aws-glue
1个回答
0
投票

问题在于,当您为Snowflake定义JDBC连接时,它可以用作爬虫的数据源,但不能在您的ETL作业中使用。您必须在 ETL 作业中使用雪花连接类型,不幸的是,它不能用作爬虫的数据源,至少到目前为止是这样。

这是文档的链接:https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html

© www.soinside.com 2019 - 2024. All rights reserved.