rq（redis队列）工马意外终止，建议如何调试？

Question

我在使用RQ工马处理大量作业，遇到了问题。

观察

工作回报 work-horse terminated unexpectedly; waitpid returned None
该作业连接到数据库，并简单地运行几个SQL语句，就像一个简单的插入或删除语句。
错误信息几乎立即发生：在启动的几秒钟内。
有时作业运行得很好，没有问题。
在其中一个作业上，我可以看到它做了一个插入，但随后只是返回错误信息。
在rq worker上，我看到以下日志条目。

{"message": "my_queue: my_job() (dcf797c4-1434-4b77-a344-5bbb1f775113)"}
{"message": "Killed horse pid 8451"}
{"message": "Moving job to FailedJobRegistry (work-horse terminated unexpectedly; waitpid returned None)"}

挖掘rq代码(https:/github.comrqrq。)，"Killed horse pid...... "一行是RQ故意杀死作业本身的提示。作业杀死代码唯一发生的地方是在下面的片段中。要达到 self.kill_horse() 行，a HorseMonitorTimeoutException 必须发生和 utcnow - job.started_at 区别必须是>job.timeout（超时是巨大的btw）。

        while True:
            try:
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break
            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # Kill the job from this side if something is really wrong (interpreter lock/etc).
                if job.timeout != -1 and (utcnow() - job.started_at).total_seconds() > (job.timeout + 1):
                    self.kill_horse()
                    break

有时作业在队列中挂了很长时间，然后工人才真正得到它们。不过我希望start_at会被重置。这个假设可能是错误的。
作业是使用 rq_scheduler 创建的，并且使用 cron 字符串定期执行（每天晚上 11 点，等等）。

对此，我的下一步应该是什么？

Answer 1

我认为RQ的最新版本(https:/github.comrqrqreleasestagv1.4.0。)有解决方案。

Fixed a bug that may cause early termination of scheduled or requeued jobs. Thanks @rmartin48!

rq（redis队列）工马意外终止，建议如何调试？

问题描述投票：0回答：1

观察

1个回答

最新问题

rq（redis队列）工马意外终止，建议如何调试？

问题描述 投票：0回答：1

观察

1个回答

最新问题

问题描述投票：0回答：1