Amazon Glue - 使用小数字段加载到 Redshift 失败

问题描述 投票:0回答:1

我有一个非常简单的 Glue 作业,将数据从 S3 加载到 Redshift,中间有一个 Transform 来重命名字段并更改其类型:

首次执行运行(几乎)没有问题 - 数据加载到 Redshift 中。 以下所有执行均失败。原因是,Glue 正确创建了 Redshift 表(首次加载),但在它已经存在时错误地处理它。

所有转换为十进制的字段都会发生这种情况(但没有测试所有其他类型)。

CSV 文件:

Text value,Average whatever,Another string,Just a number
A1,2.2,test,5
A2,5,test2,7

转换(更改架构):

生成的代码(我没有编辑代码,它仍然是一个“视觉”工作):

...
# Script generated for node Amazon S3
AmazonS3_node1710618800725 = glueContext.create_dynamic_frame.from_options(format_options={"quoteChar": "\"", "withHeader": True, "separator": ","}, connection_type="s3", format="csv", connection_options={"paths": ["s3://<source-s3-bucket>/test/gonna_fail/data.csv"]}, transformation_ctx="AmazonS3_node1710618800725")

# Script generated for node Change Schema
ChangeSchema_node1710691042153 = ApplyMapping.apply(frame=AmazonS3_node1710618800725, mappings=[("Text value", "string", "text_value", "string"), ("Average whatever", "string", "average_whatever", "decimal"), ("Another string", "string", "another_string", "string"), ("Just a number", "string", "just_a_number", "decimal")], transformation_ctx="ChangeSchema_node1710691042153")

# Script generated for node Amazon Redshift
AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, connection_type="redshift", connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", "useConnectionProperties": "true", "dbtable": "raw_data.gonna_fail", "connectionName": "serverless-redshift", "preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);"}, transformation_ctx="AmazonRedshift_node1710618808047")
  1. 第一次执行(所有查询都来自Redshift的日志):
7:26:26 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

表已正确创建。之后出现错误 - “预期命令状态开始,已创建表” - (如何避免它?)但作业在 30 秒后重试成功:

7:26:56 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN 
7:26:56 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:26:56 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL) 
7:26:56 PM  COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/63e30430-67f0-4ab2-b539-22180ae2920b/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS '' 
  1. 第二次执行:

对于每个十进制字段,都会创建一个新字段,其名称由名称和数据类型连接而成:

7:29:19 PM  ALTER TABLE raw_data.gonna_fail add "average_whatever_decimal(10,2)" DECIMAL(10,2) default NULL; 
7:29:19 PM   ALTER TABLE raw_data.gonna_fail add "just_a_number_decimal(10,2)" DECIMAL(10,2) default NULL; 

此加载也失败(没有检查原因)并在 30 秒后重试:

创建表被执行(不知道为什么创建语句被执行两次,“自动”,并使用预操作):

7:29:54 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2), "just_a_number_decimal(10,2)" DECIMAL(10,2), "average_whatever_decimal(10,2)" DECIMAL(10,2)) DISTSTYLE EVEN 

准备工作:

7:29:54 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:29:54 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL) 

不正确的复制声明:

7:29:54 PM 复制“raw_data”。“gonna_fail”(“text_value”,“average_whatever”,“another_string”,“just_a_number”,“just_a_number_decimal(10,2)”,“average_whatever_decimal(10,2)”) FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/d46ca4ae-86cc-4444-addd-6c54c376a2a1/manifest.json' 格式为 CSV NULL AS '@NULL@' 清单凭证 ''

此操作失败,Spark重试3次,加载失败。 胶水中可见错误:

Caused by: com.amazon.redshift.util.RedshiftException: ERROR: column "just_a_number_decimal(10,2)" of relation "gonna_fail" does not exist

我在框架的 .schema().fields 中没有找到那些附加/不正确的字段。

amazon-redshift aws-glue amazon-redshift-serverless
1个回答
0
投票

我认为视觉工具能做的不多。克隆作业并更新脚本如下:

AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, 
connection_type="redshift", 
connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", 
"dbtable": "raw_data.temp_gonna_fail", 
"connectionName": "serverless-redshift", 
"preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);",
"postactions": "BEGIN; INSERT INTO raw_data.gonna_fail SELECT  * from raw_data.temp_gonna_fail; drop table if exists raw_data.temp_gonna_fail; END;"
}, 
transformation_ctx="AmazonRedshift_node1710618808047"), 

在connection_options中添加了“postactions”并删除了“useConnectionProperties”:“true”(如果脚本因此失败,请再次添加)

© www.soinside.com 2019 - 2024. All rights reserved.