是什么导致 GNU sed 使用 `.*` 替换效率低下?

问题描述 投票:0回答:1

SLES 具有一个

supportconfig
命令,在创建“更新”时发现该命令非常慢。第一次检查显示
sed
正在使用“100% CPU”几分钟,所以我怀疑有些效率相当低的东西,因为
sed
通常非常高效。

这是我的独立测试用例(该文件是从

supportconfig
创建的临时文件复制的):

# time sed -i -e '   s/\(.*[P|p]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*[P|p]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;     s/\(.*[P|p]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;        s/\(.*[P|p]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;    s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g; s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;      s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;   s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;      s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;    s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;  s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;        s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;     s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g' /tmp/uuu

real    6m53.283s
user    6m45.909s
sys     0m0.129s
# wc /tmp/uuu
  12234  123937 2711538 /tmp/uuu

所以处理一个 2.7MB 的文本文件花了将近 7 分钟。 唯一特别的可能是文本行相当长。

我怀疑

.*
在匹配时会导致大量回溯,也许程序员只是有点懒,没有提供更好的正则表达式。 当
sed
运行时,我也做了一个
strace
,但这基本上只是显示
brk
系统调用(内存分配)。

那么是什么原因导致性能如此糟糕,有没有办法改善呢?

不幸的是,我无法提供原始输入文件,因为它似乎包含非免费下载的 URL。 但我可以尝试对我保存的文件进行不同的

sed
调用。

sed
正在使用的版本是
sed-4.2.2-7.3.1.x86_64

可读性转换

[P|p]
修复为
[Pp]
,更具可读性的
sed
命令版本将如下所示:

sed -i -e '
s/\(.*[Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g'

很明显,目的是从日志文件中删除密码等敏感信息。

注意: 在某些情况下,代码块似乎确实将

*
解析为标记...

尝试解决方案

进行https://stackoverflow.com/a/77820459/6607497中建议的修改,我得到了这些结果:

### original version of regex
# /tmp/uuu.sh

real    6m39.697s
user    6m39.092s
sys     0m0.032s
### modified version of regex
# /tmp/uuu.sh

real    0m0.237s
user    0m0.226s
sys     0m0.009s

当将原始输入文件缩小到较小的测试用例(约9kB)时,效果仍然存在,但不那么引人注目:

### original regexes
# /tmp/uuu.sh /tmp/eeee

real    0m0.119s
user    0m0.115s
sys     0m0.004s
### improved regexes
# /tmp/uuu.sh /tmp/eeee

real    0m0.007s
user    0m0.007s
sys     0m0.000s

这是经过混淆和修剪的测试输入文件作为 BASE64(由于行很长);通过管道输入到

base64 -d | gzip -d
进行解码:

# gzip < /tmp/eeee | base64
H4sIAH/cp2UAA9VZbVPbuBb+vP4VZ9yZ7m6JnThQyk0J99JAu2x5yRDKbKfTySi2kmiQLY8kA7nL
7m+/R/ILSYAUmnRv6zKNrJfjo+ccPedIetZuf4KOiGOSRPC5/ajnmfMM6pmS9QFL6v+dpCmV4HmJ
SDyWaCpJqNkltTXeKB154ZiGFwpSKaIs1Mo5pUNJ1ZglI1BUXrKQws+9D739/iFLsuv+vpGRSqZo
v4fNVPaHQvZ7u93+bppyFhLNRKL6QbPf677sX29t9jc3fvadQ0EiI1LSVCimhZxARDTxfR8/mDex
RGnCOY0gJeEFGVFlWnsAN3B6O6p8buDAaJIQDsckpkVdVXzqcwPnVCpUHUu7MhznX1DwhijqeJ63
5t3z3Fd5b8fHPWt3S1XBYWuo0H96E6VpPKu44tSL0Xj4w+mIhBOsO8wLR7b6aTgETa+Rl3LjGVSF
bXNsbe9w3/ttF3uhfb2uELzUYUxmBRmnAes0cOs08BsbjWH3kjBOBowzPYH9a00TC33QBBRqdfBf
Gi3u0cHigDoUCqDj5UrYyp5xxC/rkDsuoOOC6T/tuAt1+EjVDA69GRTKynlAF+pQfO5BWzyIg/Ps
H6MGz7uOucj0LUds/3soxIBIZ1tpSUm842zHVClcs6AnKW27LBkKd+ceKnlOcCW/XpJPciH+dr34
6AOff5hzvjRyISVNDy4Q8ThT+vYVEqShtpsvChcuc2Zpu8acLqrDKbJK2224gJqFY1si0hTy+Zkh
SSRk281957ADz7l+PdY6Va16/erqyleZon4o4vrzkX7tgsrQD+SkHPD0RedalPLxS1imdYcd3NJr
OEtwzuY7IwSLygoG1H4spM4xw/HlcBeGnFwaFLonJ4cuMDWw/YeEK+re2qas2dk2oA05G6IxWUz7
GhF/td5YX29uNRBhTa+xptlobnhBw1sPzhpBq9HAv7VG4NZ3tkvF8q/u0BKh7fpcy3ZEVShZaua/
82TA0X+mx2/XC3zuuE/JZ9+DA32ZMVfoQHPU/nQP8m4l/H+dqH6frxwcf/jj4UhwH8RP8ZjvyFtW
7hi9pXiltzpaqRyh8UibzyMkhkO0EhBATFNJx4YaLqlTRn6VMW06VREXBhnjGpCuCSiMTJjUhSKi
YFT3q2FnYwopJxrRjIFEEQZfRc1YHIJFSCiNFAyliKshGoeo2ExYaXxhiRdyRhMNETXxWoEWts+V
kDzK466qBscCB6XiCqeecRgj53mYUZivkySkdm6ZRm2rASZViQnGWGnydGVtr/z7M6RqUIEVSotx
/igaQ29sVNToBsrK1DQcJ4KL0aQaFVKp2bAiKRJKgQjoKYRqZmjVnxJM+svlxBTQShFEhCj1z9P2
Ckm52DfMrRnMT9TXsune+d7t0tEym105tuKHptI7m6oZcy5lzJm92RctNK/FYnPlsr1S9mOZ7kF7
LSK6nVyzxemRCSSpCRrOLUXNbk7HlKcKJiKDmI0kMXwBZNqVDV9NUUQhaMrqQcOSQBDYH4FrXIKy
cFr6un/xBM0aDEq6QNAvmc34yzwfrsYMCYFIChFToUhQrwz3AchAi9gqd8AaGtsMVi3kuCRCxuM1
oxESlJeManDw5gh+J5dk01IQUnqSxQNUWgwLMZwNJJEM1fjFODG9JnGKUIkU44TiXsP/l7/1q+/8
ZNCt5/D+NI/0bhgazi/4exZz5DeWhDyLcEYsMeDLR0wLZzWoLOtbU8aFDYmJZREzTG2Y2fhQOAmx
RY9JstB4hWimFeVDH7ol/qiW0RtDkiq/YoyRqQx9trSbytIUfR9nYWAiOo9IeiwphQklNrbkXXIy
KIxhUS+NAJuFsCvGOdK+RaRHU3QfY5NmI3hlNcGogTFyBvSHyaQ+tzGsl/vkfNt+TmUVmx6xdzfb
9tPuEe79NRyUq7YFXA2qY4BkyEaZzAW+ZQjWIqn2GIDq0B4B2P8wt0uGzicTmz87ccbNOUB+HNYu
FgdVren6Xy6oTCj/daazn1cqHJSjVct/vKAmsyQxmcAqFA4ljdDLGNKYH9WPO52pCgcTVWm5EF6c
7h+dnO/vwZuP6IHd7snpWefk+O3BuxdOigEdM5poYaevt5VMY/DOrdvjRBOKftDw1/31TW/dD7b8
DT+PEE7Pf+njc+Yb97PnMrjw6zl1RcVv3aTeYSHlgtKUcITbNwQt8VMz2vU00RmSzl6xEE3+9lZk
mOWs0u1kNpg0/cAzv5iJeUZBr9BwyWMpZc6lpmADz1N2Ts6nP11mbTxkVLqt2yyr5hbOh5U2z6q5
JgrjWxGGa24uAmtOi0BFI6w1TmIEPTW1qpmzOG9zowaBh3mGCC8o5udXTI/hQ8JZzAwdnTOpkajg
CLNKk3vXoCuZkOYgoDdFojVkaVw9uUHe0wmqhcHUZPZG/5knn4fUqk80NhpaMolZIwCbM+EffDjr
YC96nTJM/MtuzZdeYwuzq/lu01zerxDa7ZwdnO9jszkNw3fM67n7V20O/LtpyawVFtnAONOUHf76
vCQjTHmLY8/JE1wumaQtsGmOk0negnIrq8Kw2sp+AyrqfSsqyqmgr8UFTR7DWauc03Lb9tXz8Tez
19/ft65fZ4f1H8IOXze3jR9hbt86PVnmBqjYAZI8BuAmwCQMnsIuN7DLGVH330s96lniLraSsJ+Q
gbl8uYF33XfQMbdR9jLY3iph6QyDFDhL3bneuXpdXsLaolonKG4El+DUlUgo7jKxZNQqEb+tRYEA
TrO48J06ki/vJJ9gx+UlFLeec9qW5RuYEKWbK81yTZpQ5ikrWGo0uUQ1R5Km4DGzpbqeVO+X0Dna
Ozw43neOT/rd05M/Pra5CAkf42a2FjRf+Q38F9Re/P7h+P0Lx+QyfSugbYqY1eQNrfWguVV3hlWr
0zvpvO8VAh2FiaoqW0YiHaOjFG82O1ogMhFF4wKt/gfLVc0SqSIAAA==
regex performance sed substitution
1个回答
0
投票

正如怀疑的那样,那些

.*
比赛的表现非常糟糕:

删除这些之后,性能确实有了巨大的提高(如因子 1500):

real    0m0.230s
user    0m0.225s
sys     0m0.005s

作为参考,这是使用的规范命令(我还删除了

-i
以保留输入文件以进行可重复测试):

time sed -e '
s/\([Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
s/\(_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
s/\(password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g
' "${1:-/tmp/uuu}" >"${2:-/dev/null}"
© www.soinside.com 2019 - 2024. All rights reserved.