OR-工具库,CP-SAT 求解器。当worker数量较多(num_search_workers)时,调用stop_search后模型计算不会停止

问题描述 投票:0回答:1

我正在解决标准作业车间调度问题。启动是通过 Docker 容器中的气流进行的。 以下是机器参数:

  • CPU:英特尔至强金牌 6230
  • 内存:300GB以上

当每次计算的操作数变为大约 1000 次时,求解器会因内存错误而崩溃(任务退出并返回代码 -9),而无法找到解决方案(增加 RAM 量并没有帮助)。在测试过程中发现,可以通过在模型设置中指定大量工人(100 人以上)来解决该问题。但是,工作人员数量增加得越多,找到所需数量的解决方案或停止计时器、调用 stop_search 后,模型将不会退出,但一切都会冻结的可能性就越大。

我的回调如下所示:

class ObjectiveEarlyStopping(cp_model.CpSolverSolutionCallback):
    def __init__(self, solution_count_limit=GlobalVariables.SOLUTION_COUNT_LIMIT,
                 max_execution_time=GlobalVariables.MIN_EXECUTION_WORKTIME):
        super(ObjectiveEarlyStopping, self).__init__()
        self._solution_count = 0
        self._solution_limit = max(2, solution_count_limit)
        self._max_execution_time = max_execution_time

        self._logger = LocalLogger().logger
        self._timer = None
        self._no_improvement_timer_limit = GlobalVariables.SOLUTION_IMPROVEMENT_TIMEOUT_SEC
        self._total_execution_time = 0

    def on_solution_callback(self):
        self._solution_count += 1
        self._logger.info(f"Feasible solution #{self._solution_count} found.")

        if self._solution_count >= self._solution_limit:
            self._logger.debug(f"Stopping search after {self._solution_count} solutions")
            self._stop_timer()
            self._timer = None
            self._logger.info(f"Stopping search __3")
            super().StopSearch()
            self._logger.info(f"Stopping search __4")
        else:
            self._reset_timer()

    def _stop_timer(self):
        if self._timer:
            self._timer.cancel()

    def _reset_timer(self):
        self._total_execution_time += self._no_improvement_timer_limit
        self._stop_timer()
        self._timer = Timer(self._no_improvement_timer_limit, self.StopSearch)
        self._timer.start()

    def StopSearch(self):
        self._logger.debug(f"{self._no_improvement_timer_limit} seconds without improvement")
        self._timer = None

        if self._solution_count >= 2 or self._total_execution_time > self._max_execution_time:
            self._logger.info(f"Stopping search __1")
            super().StopSearch()
            self._logger.info(f"Stopping search __2")
        else:
            self._logger.debug("Not enough solutions, continue search")
            self._reset_timer()

例如,我的日志如下所示:

[2024-04-12, 13:46:17 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:17 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:22 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:22 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:26 UTC] {solution_callback.py:23} INFO - Feasible solution #1 found.
[2024-04-12, 13:46:27 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:27 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:29 UTC] {solution_callback.py:23} INFO - Feasible solution #2 found.
[2024-04-12, 13:46:32 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:32 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:37 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:37 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:42 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:43 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:48 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:48 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:49 UTC] {solution_callback.py:46} DEBUG - 20 seconds without improvement
[2024-04-12, 13:46:49 UTC] {solution_callback.py:50} INFO - Stopping search __1
[2024-04-12, 13:46:49 UTC] {solution_callback.py:52} INFO - Stopping search __2
[2024-04-12, 13:46:53 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:53 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:46:58 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:46:58 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:47:03 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:47:03 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:47:08 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:47:08 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:47:13 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:47:13 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:47:18 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:47:18 UTC] {job.py:216} DEBUG - [heartbeat]
[2024-04-12, 13:47:23 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: production_capacity_balancing.production_shceduling_optimizer manual__2024-04-12T13:34:02+00:00 [running]> from DB
[2024-04-12, 13:47:23 UTC] {job.py:216} DEBUG - [heartbeat]

并且此心跳消息将一直显示,直到我手动停止它。 我还尝试将

super().StopSearch()
替换为
super().stop_search()
,其中有
has_response()
检查呼叫,但这也没有帮助。

请告诉我退出模型时如何避免冻结?

根据输入参数选择工人数量的最佳方法是什么?

python optimization or-tools constraint-programming cp-sat
1个回答
0
投票

第一:https://github.com/google/or-tools/blob/stable/ortools/sat/docs/troubleshooting.md#improving-performance-with-multiple-workers

我建议使用 32 到 64 个工作线程,并且不要超过核心数量。 您添加的每个工作人员大致都会添加一份模型副本。

帮助求解器的最佳方法是给出一个可行的解决方案作为提示(如果构建一个解决方案很容易)。

另请注明您使用的是哪个版本。

最后,对于非常大的模型,有时我们没有足够频繁地检查时间限制。尽管如此,我们在这些检查方面仍取得了良好进展。这就是我要求版本的原因。

© www.soinside.com 2019 - 2024. All rights reserved.