这是我的要求: 我在 AWS Glue 中有一个爬虫和一个 pyspark 作业。我必须使用步骤功能设置工作流程。
问题:
参考资料:
晚了几个月才回答这个问题,但这可以在步骤函数内完成。 您可以创建以下状态来实现它:
TriggerCrawler
:任务状态:触发 Lambda 函数,在此 Lambda 函数中,您可以使用任何 aws-sdk 编写用于触发 AWS Glue Crawler 的代码PollCrawlerStatus
:任务状态:轮询 Crawler 状态并将其作为 lambda 响应返回的 Lambda 函数。IsCrawlerRunSuccessful
:选择状态:根据 Glue 爬虫的状态,您可以将下一个状态设为选择状态,该状态将进入触发您的 Glue 作业的下一个状态(一旦 Glue 爬虫状态为“就绪”)或在再次轮询之前,先转到 Wait State
几秒钟。RunGlueJob
:任务状态:触发粘合作业的 Lambda 函数。WaitForCrawler
:等待状态:在再次轮询状态之前等待“n”秒。Finish
:成功状态。此 Step Function 的外观如下:
{
"StartAt": "crawler",
"States": {
"crawler_name": {
"Type": "Task",
"Parameters": {
"Name": "crawler"
},
"Resource": "arn:aws:states:::aws-sdk:glue:startCrawler",
"Next": "crawler_info",
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"BackoffRate": 2,
"IntervalSeconds": 10,
"MaxAttempts": 2
}
]
},
"crawler_info": {
"Type": "Task",
"Next": "crawler_status",
"Parameters": {
"Name": "crawler"
},
"Resource": "arn:aws:states:::aws-sdk:glue:getCrawler",
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"BackoffRate": 2,
"IntervalSeconds": 10,
"MaxAttempts": 3
}
]
},
"crawler_status": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Crawler.State",
"StringEquals": "FAILED",
"Next": "crawler_failed"
},
{
"Variable": "$.Crawler.State",
"StringEquals": "RUNNING",
"Next": "crawler_finish_wait"
},
{
"Variable": "$.Crawler.State",
"StringEquals": "STOPPING",
"Next": "crawler_finish_wait"
},
{
"Variable": "$.Crawler.State",
"StringEquals": "SUCCESS",
"Next": "glue_job"
}
],
"Default": "glue_job"
},
"crawler_finish_wait": {
"Type": "Wait",
"Seconds": 10,
"Next": "crawler_info"
},
"crawler_failed": {
"Type": "Fail"
},
"glue_job": {
"Type": "Task",
etc etc....
要进行安排,请使用 Eventbridge 调度程序 :)