AWS Glue 作业使用 DynamoDB 和 MySQL 成功,但数据不存在

问题描述 投票:0回答:1

我在 DynamoDB 中有数据,想要提取它并将其放入 MySQL 数据库 (AWS RDS)。我已经设置了 AWS Glue,其中包含 DynamoDB 表的爬网程序和 MySQL 数据库表的爬网程序。这些都成功并在 DataCatalog 中创建表。

当我运行一个 ETL 作业,获取数据目录中的 DynamoDB 表,然后将其映射并放入 MySQL 数据库时,它似乎成功了(日志如下),但当我查看 MySQL 数据库时,那里什么也没有。

我用作测试的表中有 c.20 行,因此日志似乎能够收集 DynamoDB 数据,并尝试将其插入 MySQL 数据库 - 但没有成功。

由于担心数据中可能存在恶意值,我删除了除最普通的列之外的所有列(“标题”包含短字符串),但它仍然生成相同的结果。

如果有人可以从下面的日志中提供建议或任何见解,我将不胜感激。我已经删除了 IP 地址和任何类似于哈希/标识符的内容。

唯一的其他观察结果是,当我检查 MySQL 实例的错误日志时,我看到一行又一行,这似乎暗示内存不足问题:“(oscar_oom.cc:1210)”。所以我重新启动了实例并将其升级到更大的大小只是为了检查。这也没有什么区别。

23/08/07 16:29:05 INFO LogPusher: stopping
23/08/07 16:29:05 INFO ProcessLauncher: postprocessing
23/08/07 16:29:05 INFO DAGScheduler: Job 0 finished: save at JDBCUtils.scala:978, took 23.890680 s
23/08/07 16:29:05 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
23/08/07 16:29:05 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
23/08/07 16:29:05 INFO TaskSetManager: Finished task 11.0 in stage 1.0 (TID 12) in 5626 ms on x.x.x.48 (executor 3) (20/20)
23/08/07 16:29:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
23/08/07 16:29:05 INFO DAGScheduler: ResultStage 1 (save at JDBCUtils.scala:978) finished in 5.692 s
23/08/07 16:29:05 INFO TaskSetManager: Finished task 15.0 in stage 1.0 (TID 16) in 5621 ms on x.x.x.48 (executor 3) (19/20)
23/08/07 16:29:05 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 8) in 5624 ms on x.x.x.48 (executor 3) (18/20)
23/08/07 16:29:05 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 4) in 5624 ms on x.x.x.48 (executor 3) (17/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 10) in 5248 ms on x.x.x.78 (executor 4) (16/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 13.0 in stage 1.0 (TID 14) in 5244 ms on x.x.x.78 (executor 4) (15/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 6) in 5245 ms on x.x.x.78 (executor 4) (14/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 5245 ms on x.x.x.78 (executor 4) (13/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 12.0 in stage 1.0 (TID 13) in 5056 ms on x.x.x.75 (executor 2) (12/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 5053 ms on x.x.x.75 (executor 2) (11/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 5055 ms on x.x.x.75 (executor 2) (9/20)
23/08/07 16:29:04 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 9) in 5049 ms on x.x.x.75 (executor 2) (10/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 18.0 in stage 1.0 (TID 19) in 185 ms on x.x.x.100 (executor 1) (8/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 19.0 in stage 1.0 (TID 20) in 147 ms on x.x.x.100 (executor 1) (7/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 17.0 in stage 1.0 (TID 18) in 185 ms on x.x.x.100 (executor 1) (6/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 16.0 in stage 1.0 (TID 17) in 180 ms on x.x.x.100 (executor 1) (5/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 10.0 in stage 1.0 (TID 11) in 3130 ms on x.x.x.100 (executor 1) (4/20)
23/08/07 16:29:02 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 20) (x.x.x.100, executor 1, partition 19, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:29:02 INFO TaskSetManager: Finished task 14.0 in stage 1.0 (TID 15) in 3097 ms on x.x.x.100 (executor 1) (3/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 7) in 3100 ms on x.x.x.100 (executor 1) (2/20)
23/08/07 16:29:02 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 3101 ms on x.x.x.100 (executor 1) (1/20)
23/08/07 16:29:02 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 19) (x.x.x.100, executor 1, partition 18, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:29:02 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 18) (x.x.x.100, executor 1, partition 17, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:29:02 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 17) (x.x.x.100, executor 1, partition 16, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:29:01 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to x.x.x.48:33922
23/08/07 16:29:01 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to x.x.x.78:35480
23/08/07 16:29:01 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to x.x.x.75:48354
23/08/07 16:29:00 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to x.x.x.100:52348
23/08/07 16:28:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on x.x.x.48:46011 (size: 25.9 KiB, free: 5.8 GiB)
23/08/07 16:28:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on x.x.x.75:38021 (size: 25.9 KiB, free: 5.8 GiB)
23/08/07 16:28:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on x.x.x.78:46395 (size: 25.9 KiB, free: 5.8 GiB)
23/08/07 16:28:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on x.x.x.100:41023 (size: 25.9 KiB, free: 5.8 GiB)
23/08/07 16:28:59 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 16) (x.x.x.48, executor 3, partition 15, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 14) (x.x.x.78, executor 4, partition 13, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 15) (x.x.x.100, executor 1, partition 14, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 12) (x.x.x.48, executor 3, partition 11, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 13) (x.x.x.75, executor 2, partition 12, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 10) (x.x.x.78, executor 4, partition 9, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 11) (x.x.x.100, executor 1, partition 10, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 8) (x.x.x.48, executor 3, partition 7, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9) (x.x.x.75, executor 2, partition 8, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 7) (x.x.x.100, executor 1, partition 6, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5) (x.x.x.75, executor 2, partition 4, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 6) (x.x.x.78, executor 4, partition 5, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3) (x.x.x.100, executor 1, partition 2, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4) (x.x.x.48, executor 3, partition 3, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (x.x.x.75, executor 2, partition 0, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2) (x.x.x.78, executor 4, partition 1, PROCESS_LOCAL, 4465 bytes) taskResourceAssignments Map()
23/08/07 16:28:59 INFO DAGScheduler: Submitting 20 missing tasks from ResultStage 1 (MapPartitionsRDD[24] at save at JDBCUtils.scala:978) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
23/08/07 16:28:59 INFO TaskSchedulerImpl: Adding task set 1.0 with 20 tasks resource profile 0
23/08/07 16:28:59 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1570
23/08/07 16:28:59 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on x.x.x.65:39309 (size: 25.9 KiB, free: 5.8 GiB)
23/08/07 16:28:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.9 KiB, free 5.8 GiB)
23/08/07 16:28:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 56.5 KiB, free 5.8 GiB)
23/08/07 16:28:59 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[24] at save at JDBCUtils.scala:978), which has no missing parents
23/08/07 16:28:59 INFO DAGScheduler: waiting: Set(ResultStage 1)
23/08/07 16:28:59 INFO DAGScheduler: failed: Set()
23/08/07 16:28:59 INFO DAGScheduler: running: Set()
23/08/07 16:28:59 INFO DAGScheduler: ShuffleMapStage 0 (rdd at DynamicFrame.scala:1948) finished in 18.056 s
23/08/07 16:28:59 INFO DAGScheduler: looking for newly runnable stages
23/08/07 16:28:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
23/08/07 16:28:59 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2650 ms on x.x.x.100 (executor 1) (1/1)
23/08/07 16:28:59 INFO MultipartUploadOutputStream: close closed:false s3://aws-glue-assets-hash-eu-west-2/sparkHistoryLogs/spark-application-hash.inprogress
23/08/07 16:28:59 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.75:38021 with 5.8 GiB RAM, BlockManagerId(2, x.x.x.75, 38021, None)
23/08/07 16:28:59 INFO ExecutorTaskManagement: connected executor 2
23/08/07 16:28:59 INFO JESSchedulerBackend$JESAsSchedulerBackendEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.75:48354) with ID 2,  ResourceProfileId 0
23/08/07 16:28:59 INFO ExecutorEventListener: Got executor added event for 2 @ 1691425739067
23/08/07 16:28:58 INFO LogPusher: uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-hash-eu-west-2/sparkHistoryLogs/
23/08/07 16:28:57 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on x.x.x.100:41023 (size: 9.3 KiB, free: 5.8 GiB)
23/08/07 16:28:57 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.48:46011 with 5.8 GiB RAM, BlockManagerId(3, x.x.x.48, 46011, None)
23/08/07 16:28:56 INFO ExecutorTaskManagement: connected executor 3
23/08/07 16:28:56 INFO ExecutorEventListener: Got executor added event for 3 @ 1691425736989
23/08/07 16:28:56 INFO JESSchedulerBackend$JESAsSchedulerBackendEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.48:33922) with ID 3,  ResourceProfileId 0
23/08/07 16:28:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (x.x.x.100, executor 1, partition 0, PROCESS_LOCAL, 4300 bytes) taskResourceAssignments Map()
23/08/07 16:28:56 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.78:46395 with 5.8 GiB RAM, BlockManagerId(4, x.x.x.78, 46395, None)
23/08/07 16:28:56 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.100:41023 with 5.8 GiB RAM, BlockManagerId(1, x.x.x.100, 41023, None)
23/08/07 16:28:56 INFO ExecutorEventListener: Got executor added event for 4 @ 1691425736625
23/08/07 16:28:56 INFO ExecutorTaskManagement: connected executor 4
23/08/07 16:28:56 INFO JESSchedulerBackend$JESAsSchedulerBackendEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.78:35480) with ID 4,  ResourceProfileId 0
23/08/07 16:28:56 INFO ExecutorTaskManagement: connected executor 1
23/08/07 16:28:56 INFO ExecutorEventListener: Got executor added event for 1 @ 1691425736605
23/08/07 16:28:56 INFO JESSchedulerBackend$JESAsSchedulerBackendEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.100:52348) with ID 1,  ResourceProfileId 0
23/08/07 16:28:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/08/07 16:28:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
23/08/07 16:28:41 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at rdd at DynamicFrame.scala:1948) (first 15 tasks are for partitions Vector(0))
23/08/07 16:28:41 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1570
23/08/07 16:28:41 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on x.x.x.65:39309 (size: 9.3 KiB, free: 5.8 GiB)
23/08/07 16:28:41 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 5.8 GiB)
23/08/07 16:28:41 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 19.6 KiB, free 5.8 GiB)
23/08/07 16:28:41 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at rdd at DynamicFrame.scala:1948), which has no missing parents
23/08/07 16:28:41 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
23/08/07 16:28:41 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
23/08/07 16:28:41 INFO DAGScheduler: Got job 0 (save at JDBCUtils.scala:978) with 20 output partitions
23/08/07 16:28:41 INFO DAGScheduler: Final stage: ResultStage 1 (save at JDBCUtils.scala:978)
23/08/07 16:28:41 INFO DAGScheduler: Registering RDD 7 (rdd at DynamicFrame.scala:1948) as input to shuffle 0
23/08/07 16:28:41 INFO SparkContext: Starting job: save at JDBCUtils.scala:978
23/08/07 16:28:41 INFO CodeGenerator: Code generated in 28.144541 ms
23/08/07 16:28:41 INFO GlueJDBCSink: Use batch insert with batchSize of 1000
23/08/07 16:28:41 INFO GlueCloudWatchReporter: About to enter executor
23/08/07 16:28:41 INFO JDBCWrapper$: INFO: using ssl properties: Map(trustCertificateKeyStoreUrl -> file:/opt/amazon/certs/RDSTrustStore.jks, useSSL -> true, trustCertificateKeyStorePassword -> , verifyServerCertificate -> true)
23/08/07 16:28:41 INFO JDBCWrapper$: enforceSSL = true, from connection properties, will only attempt SSL with CN matching
23/08/07 16:28:40 INFO GlueContext: The DataSink in action for the given format/connectionType (mysql) is com.amazonaws.services.glue.sinks.MySqlDataSink
23/08/07 16:28:40 INFO GlueContext: Glue secret manager integration: secretId is not provided.
23/08/07 16:28:40 INFO DataCatalogWrapper: Encrypted Catalog password  empty, using value of unencrypted Catalog password
23/08/07 16:28:40 INFO GlueContext: Using location: databasename.tablename
23/08/07 16:28:40 INFO GlueContext: getCatalogSink: catalogId: null, nameSpace: databasename, tableName: aur_databasename_tablename, isRegisteredWithLF: false
23/08/07 16:28:40 INFO LakeformationRetryWrapper$: Lakeformation: API call succeeded
23/08/07 16:28:40 INFO CodeGenerator: Code generated in 183.305423 ms
23/08/07 16:28:35 INFO SharedState: Warehouse path is 'file:/tmp/spark-warehouse'.
23/08/07 16:28:35 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/08/07 16:28:35 INFO GlueCloudWatchReporter: About to enter executor
23/08/07 16:28:35 INFO JDBCWrapper$: INFO: using ssl properties: Map(trustCertificateKeyStoreUrl -> file:/opt/amazon/certs/RDSTrustStore.jks, useSSL -> true, trustCertificateKeyStorePassword -> , verifyServerCertificate -> true)
23/08/07 16:28:35 INFO JDBCWrapper$: enforceSSL = true, from connection properties, will only attempt SSL with CN matching
23/08/07 16:28:34 INFO JDBCJobBookmarkUtil$: Skip JDBC Bookmark, Bookmark is not enabled or transformationContext is empty.
23/08/07 16:28:34 INFO GlueContext: The DataSource in action : com.amazonaws.services.glue.JDBCDataSource
23/08/07 16:28:34 INFO GlueContext: Glue secret manager integration: secretId is not provided.
23/08/07 16:28:34 INFO GlueContext: nameSpace: databasename, tableName: aur_databasename_tablename, connectionName aurora-databasestaging, vendor: mysql
23/08/07 16:28:34 INFO DataCatalogWrapper: Encrypted Catalog password  empty, using value of unencrypted Catalog password
23/08/07 16:28:34 INFO GlueContext: getCatalogSource: transactionId: <not-specified> asOfTime: <not-specified> catalogPartitionIndexPredicate: <not-specified> 
23/08/07 16:28:34 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: databasename, tableName: aur_databasename_tablename, isRegisteredWithLF: false, isGoverned: false, isRowFilterEnabled: false, useAdvancedFiltering: false
23/08/07 16:28:34 INFO LakeformationRetryWrapper$: Lakeformation: API call succeeded
23/08/07 16:28:34 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: x.x.x.0 Proxy Port: 8888
23/08/07 16:28:34 INFO BlockManagerInfo: Removed broadcast_0_piece0 on x.x.x.65:39309 in memory (size: 33.7 KiB, free: 5.8 GiB)
23/08/07 16:28:34 INFO BlockManagerInfo: Removed broadcast_1_piece0 on x.x.x.65:39309 in memory (size: 33.7 KiB, free: 5.8 GiB)
23/08/07 16:28:33 INFO AWSConnectionUtils$: AWSConnectionUtils: use proxy in glue client configuration. Host: x.x.x.0, Port: 8888
23/08/07 16:28:33 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: x.x.x.0 Proxy Port: 8888
23/08/07 16:28:33 INFO AWSGlueJobBookmarkService: AWSGlueJobBookmarkService: create JES client with proxy: host x.x.x.0, port 8888
23/08/07 16:28:33 INFO SparkContext: Created broadcast 1 from broadcast at DynamoConnection.scala:53
23/08/07 16:28:33 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on x.x.x.65:39309 (size: 33.7 KiB, free: 5.8 GiB)
23/08/07 16:28:33 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 33.7 KiB, free 5.8 GiB)
23/08/07 16:28:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 361.7 KiB, free 5.8 GiB)
23/08/07 16:28:33 INFO SparkContext: Created broadcast 0 from broadcast at DynamoConnection.scala:53
23/08/07 16:28:33 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on x.x.x.65:39309 (size: 33.7 KiB, free: 5.8 GiB)
23/08/07 16:28:33 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.7 KiB, free 5.8 GiB)
23/08/07 16:28:33 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 361.7 KiB, free 5.8 GiB)
23/08/07 16:28:33 INFO AvroReaderUtil$: Creating default Avro field parser for version 1.7.
23/08/07 16:28:33 INFO FileListPersistence: create FileListPersistence with conf: fs.s3.serverSideEncryption.kms.keyId: None
23/08/07 16:28:33 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev hash]
23/08/07 16:28:33 INFO GPLNativeCodeLoader: Loaded native gpl library
23/08/07 16:28:33 INFO GlueContext: GlueMetrics configured and enabled
23/08/07 16:28:32 INFO ExecutorTaskManagement: executor task g-hash created for executor 4
23/08/07 16:28:32 INFO TaskGroupInterface: createChildTask API response code 200
23/08/07 16:28:32 INFO TaskGroupInterface: creating executor task for executor 4; clientToken gr_hash_e_4_a_spark-application-hash
23/08/07 16:28:32 INFO ExecutorTaskManagement: executor task g-hash created for executor 3
23/08/07 16:28:32 INFO JESSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
23/08/07 16:28:32 INFO log: Logging initialized @9834ms to org.sparkproject.jetty.util.log.Slf4jLog
23/08/07 16:28:32 INFO ExecutorTaskManagement: executor task g-hash created for executor 2
23/08/07 16:28:32 INFO TaskGroupInterface: creating executor task for executor 3; clientToken gr_hash_e_3_a_spark-application-hash
23/08/07 16:28:32 INFO TaskGroupInterface: creating executor task for executor 2; clientToken gr_hash_e_2_a_spark-application-hash
23/08/07 16:28:32 INFO ExecutorTaskManagement: executor task g-hash created for executor 1
23/08/07 16:28:32 INFO SingleEventLogFileWriter: Logging events to file:/tmp/spark-event-logs/spark-application-hash.inprogress
23/08/07 16:28:32 INFO GlueCloudwatchSink: CloudwatchSink: jobName: databasename_dyn2aur jobRunId: jr_hash
23/08/07 16:28:32 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: x.x.x.0 Proxy Port: 8888
23/08/07 16:28:32 INFO GlueCloudwatchSink: CloudwatchSink: Obtained credentials from the Instance Profile
23/08/07 16:28:32 INFO GlueCloudwatchSink: GlueCloudwatchSink: get cloudwatch client using proxy: host x.x.x.0, port 8888
23/08/07 16:28:32 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, x.x.x.65, 39309, None)
23/08/07 16:28:32 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, x.x.x.65, 39309, None)
23/08/07 16:28:32 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.65:39309 with 5.8 GiB RAM, BlockManagerId(driver, x.x.x.65, 39309, None)
23/08/07 16:28:32 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, x.x.x.65, 39309, None)
23/08/07 16:28:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/08/07 16:28:31 INFO NettyBlockTransferService: Server created on x.x.x.65:39309
23/08/07 16:28:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39309.
23/08/07 16:28:31 INFO TaskGroupInterface: creating executor task for executor 1; clientToken gr_hash_e_1_a_spark-application-hash
23/08/07 16:28:31 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: x.x.x.0 Proxy Port: 8888
23/08/07 16:28:31 INFO JESSchedulerBackend: JESClusterManager: Initializing JES client with proxy: host: x.x.x.0, port: 8888
23/08/07 16:28:31 INFO JESSchedulerBackend: JESSchedulerBackend
23/08/07 16:28:31 INFO JESSchedulerBackend$JESAsSchedulerBackendEndpoint: JESAsSchedulerBackendEndpoint
23/08/07 16:28:31 INFO SubResultCacheManager: Sub-result caches are disabled.
23/08/07 16:28:31 INFO SparkEnv: Registering OutputCommitCoordinator
23/08/07 16:28:31 INFO MemoryStore: MemoryStore started with capacity 5.8 GiB
23/08/07 16:28:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-hash
23/08/07 16:28:31 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/08/07 16:28:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/08/07 16:28:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/08/07 16:28:31 INFO SparkEnv: Registering BlockManagerMaster
23/08/07 16:28:31 INFO SparkEnv: Registering MapOutputTracker
23/08/07 16:28:31 INFO Utils: Successfully started service 'sparkDriver' on port 45977.
23/08/07 16:28:30 INFO SecurityManager: Changing modify acls groups to: 
23/08/07 16:28:30 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
23/08/07 16:28:30 INFO SecurityManager: Changing view acls groups to: 
23/08/07 16:28:30 INFO SecurityManager: Changing modify acls to: spark
23/08/07 16:28:30 INFO SecurityManager: Changing view acls to: spark
23/08/07 16:28:30 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/08/07 16:28:30 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor
23/08/07 16:28:30 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 10240, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/08/07 16:28:30 INFO SparkContext: Submitted application: nativespark-databasename_dyn2aur-jr_hash
23/08/07 16:28:30 INFO ResourceUtils: No custom resources configured for spark.driver.
23/08/07 16:28:30 INFO ResourceUtils: ==============================================================
23/08/07 16:28:30 INFO SparkContext: Running Spark version 3.3.0-amzn-1
23/08/07 16:28:28 INFO SafeLogging: Initializing logging subsystem
23/08/07 16:28:26 INFO PlatformInfo: Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
23/08/07 16:28:26 INFO PlatformInfo: Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
23/08/07 16:28:26 INFO PlatformInfo: Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
aws-glue
1个回答
0
投票

注意:这不是答案。我尝试将此添加为评论,但没有足够的声誉点。

评论指出我看到类似的 json <> parquet 转换的胶水作业案例。作业完成并显示成功运行状态,屏幕打印预期输出日志,但预期 s3 目标中没有预期文件对象。

INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''

INFO PlatformInfo: Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json

INFO PlatformInfo: Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json

INFO PlatformInfo: Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look

INFO CoarseGrainedExecutorBackend: eagerFSInit: Eagerly initialized FileSystem at s3://does/not/exist in 2403 ms
© www.soinside.com 2019 - 2024. All rights reserved.