我有一个用 Python3.9 编写的主程序,最终将使用多处理加载和控制模块。
模块将做一些短暂的工作并终止。有时,模块将开始运行(将调试日志发送到 stdout),但在第一次网络访问时卡住(通常是请求调用,但 IIRC 确实首先执行 pymssql 连接的模块出现故障)。
运行该模块的子进程会永远挂在futex_wait_queue_me中,无法用SIGTERM杀死; SIGKILL 是停止进程所必需的。堆栈显示
cat /proc/12345/stack
<0>] futex_wait_queue_me+0xb6/0x110
[<0>] futex_wait+0xe9/0x240
[<0>] do_futex+0x174/0xbc0
[<0>] __x64_sys_futex+0x146/0x1c0
[<0>] do_syscall_64+0x33/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
我怀疑从父控制进程继承的句柄上有一些锁定,所以我添加了 os.closerange(3,100) 来关闭除 stdin/stdout/stderr 之外的所有句柄,但这没有帮助。
def startSub(params, config):
sub=Process(target=SubModule, args=(params, config))
sub.daemon=True
sub.start()
return sub
# sub will run independently until it terminates; no IPC, only sub.is_alive() from here
class SubModule:
def __init__(self, params, config):
os.closerange(3,100)
setproctitle("Submodule") # identify this process
log.debug("Starting stuff") # this will show up correctly
result=requests.get("https://somewhere.org") # sometimes this will not come back
log.debug("came back from requests") #
程序运行在python3.9-slim-bullseye容器中。将 gdb 添加到容器后,我能够捕获原始容器中问题的回溯:
(gdb) py-bt
Traceback (most recent call first):
<built-in method getaddrinfo of module object at remote 0x7f4fd9634350>
File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 329, in create_connection
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 681, in _new_conn
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
conn = self._new_conn()
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
conn.connect()
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1467, in urlopen
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 951, in send
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 1310, in request
File "/usr/lib/python3/dist-packages/requests/api.py", line 61, in request
很明显,程序卡在了 _socket.getaddrinfo(host, port, socket.AF_INET, socket.SOCK_STREAM, 0):
(gdb) bt
#0 0x00007f4fd9c19d66 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4fd9cb4bed in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f4fd9cb4fcb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f4fd9cb3955 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f4fd9cb474b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f4fd9cb48d3 in __resolv_context_get () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007f4fd9ca54e9 in gethostbyname2_r () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007f4fd9c7cfac in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00007f4fd9c7db25 in getaddrinfo () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000615b1e in socket_getaddrinfo (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at ../Modules/socketmodule.c:6574
#10 0x000000000065e4fd in cfunction_call (func=<built-in method getaddrinfo of module object at remote 0x7f4fd9634350>, args=<optimized out>, kwargs=<optimized out>)
at ../Objects/methodobject.c:539
这让我现在一无所知。 任何人都可以阐明正在发生的事情吗?
注意:这显然 not 这里描述的问题 https://dzone.com/articles/how-to-deadlock-your-python-with-getaddrinfo,因为我没有陷入锁等待调用 libc 的 getaddrinfo(),但锁发生在 getaddrinfo() 的深处。