Python3 多处理子进程有时会在第一次网络访问时卡在 futex_wait_queue_me 中

问题描述 投票:0回答:0

我有一个用 Python3.9 编写的主程序,最终将使用多处理加载和控制模块。

模块将做一些短暂的工作并终止。有时,模块将开始运行(将调试日志发送到 stdout),但在第一次网络访问时卡住(通常是请求调用,但 IIRC 确实首先执行 pymssql 连接的模块出现故障)。

运行该模块的子进程会永远挂在futex_wait_queue_me中,无法用SIGTERM杀死; SIGKILL 是停止进程所必需的。堆栈显示

cat /proc/12345/stack
 <0>] futex_wait_queue_me+0xb6/0x110
[<0>] futex_wait+0xe9/0x240
[<0>] do_futex+0x174/0xbc0
[<0>] __x64_sys_futex+0x146/0x1c0
[<0>] do_syscall_64+0x33/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

我怀疑从父控制进程继承的句柄上有一些锁定,所以我添加了 os.closerange(3,100) 来关闭除 stdin/stdout/stderr 之外的所有句柄,但这没有帮助。

def startSub(params, config):
  sub=Process(target=SubModule, args=(params, config))
  sub.daemon=True
  sub.start()
  return sub
  # sub will run independently until it terminates; no IPC, only sub.is_alive() from here

class SubModule:
  def __init__(self, params, config):
    os.closerange(3,100)
    setproctitle("Submodule")                    # identify this process
    log.debug("Starting stuff")                  # this will show up correctly
    result=requests.get("https://somewhere.org") # sometimes this will not come back
    log.debug("came back from requests")         #

程序运行在python3.9-slim-bullseye容器中。将 gdb 添加到容器后,我能够捕获原始容器中问题的回溯:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method getaddrinfo of module object at remote 0x7f4fd9634350>
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 329, in create_connection
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 681, in _new_conn
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1467, in urlopen
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 951, in send
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 1310, in request
  File "/usr/lib/python3/dist-packages/requests/api.py", line 61, in request

很明显,程序卡在了 _socket.getaddrinfo(host, port, socket.AF_INET, socket.SOCK_STREAM, 0):

(gdb) bt
#0  0x00007f4fd9c19d66 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4fd9cb4bed in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f4fd9cb4fcb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f4fd9cb3955 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f4fd9cb474b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f4fd9cb48d3 in __resolv_context_get () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f4fd9ca54e9 in gethostbyname2_r () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f4fd9c7cfac in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007f4fd9c7db25 in getaddrinfo () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000615b1e in socket_getaddrinfo (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at ../Modules/socketmodule.c:6574
#10 0x000000000065e4fd in cfunction_call (func=<built-in method getaddrinfo of module object at remote 0x7f4fd9634350>, args=<optimized out>, kwargs=<optimized out>)
    at ../Objects/methodobject.c:539

这让我现在一无所知。 任何人都可以阐明正在发生的事情吗?

注意:这显然 not 这里描述的问题 https://dzone.com/articles/how-to-deadlock-your-python-with-getaddrinfo,因为我没有陷入锁等待调用 libc 的 getaddrinfo(),但锁发生在 getaddrinfo() 的深处。

python-3.x linux multiprocessing locking
© www.soinside.com 2019 - 2024. All rights reserved.