是否有可用的 Python 脚本或工具可以从 Python 源代码中删除注释和文档字符串?
它应该处理以下情况:
"""
aas
"""
def f():
m = {
u'x':
u'y'
} # faake docstring ;)
if 1:
'string' >> m
if 2:
'string' , m
if 3:
'string' > m
最后我想出了一个简单的脚本,它使用 tokenize 模块并删除评论标记。它似乎工作得很好,除了我无法在所有情况下删除文档字符串。看看是否可以改进它以删除文档字符串。
import cStringIO
import tokenize
def remove_comments(src):
"""
This reads tokens using tokenize.generate_tokens and recombines them
using tokenize.untokenize, and skipping comment/docstring tokens in between
"""
f = cStringIO.StringIO(src)
class SkipException(Exception): pass
processed_tokens = []
last_token = None
# go thru all the tokens and try to skip comments and docstrings
for tok in tokenize.generate_tokens(f.readline):
t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok
try:
if t_type == tokenize.COMMENT:
raise SkipException()
elif t_type == tokenize.STRING:
if last_token is None or last_token[0] in [tokenize.INDENT]:
# FIXEME: this may remove valid strings too?
#raise SkipException()
pass
except SkipException:
pass
else:
processed_tokens.append(tok)
last_token = tok
return tokenize.untokenize(processed_tokens)
我还想在具有良好单元测试覆盖率的大量脚本上对其进行测试。你能推荐一个这样的开源项目吗?
我是“mygod,他使用正则表达式编写了一个Python解释器...”(即pyminifier)的作者在下面的链接中提到=)。
我只是想插话并说我使用 tokenizer 模块对代码进行了相当多的改进(由于这个问题我发现了 =) )。
您会很高兴地注意到,代码不再过多依赖正则表达式,而是使用分词器来达到很好的效果。无论如何,这是 pyminifier 的
remove_comments_and_docstrings()
函数import cStringIO, tokenize
def remove_comments_and_docstrings(source):
"""
Returns 'source' minus comments and docstrings.
"""
io_obj = cStringIO.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
# The following two conditionals preserve indentation.
# This is necessary because we're not using tokenize.untokenize()
# (because it spits out code with copious amounts of oddly-placed
# whitespace).
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
# Remove comments:
if token_type == tokenize.COMMENT:
pass
# This series of conditionals removes docstrings:
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
# This is likely a docstring; double-check we're not inside an operator:
if prev_toktype != tokenize.NEWLINE:
# Note regarding NEWLINE vs NL: The tokenize module
# differentiates between newlines that start a new statement
# and newlines inside of operators such as parens, brackes,
# and curly braces. Newlines inside of operators are
# NEWLINE and newlines that start new code are NL.
# Catch whole-module docstrings:
if start_col > 0:
# Unlabelled indentation means we're inside an operator
out += token_string
# Note regarding the INDENT token: The tokenize module does
# not label indentation inside of an operator (parens,
# brackets, and curly braces) as actual indentation.
# For example:
# def foo():
# "The spaces before this docstring are tokenize.INDENT"
# test = [
# "The spaces before this string do not get a token"
# ]
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
return out
这完成了工作:
""" Strip comments and docstrings from a file.
"""
import sys, token, tokenize
def do_file(fname):
""" Run on just one file.
"""
source = open(fname)
mod = open(fname + ",strip", "w")
prev_toktype = token.INDENT
first_line = None
last_lineno = -1
last_col = 0
tokgen = tokenize.generate_tokens(source.readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
if 0: # Change to if 1 to see the tokens fly by.
print("%10s %-14s %-20r %r" % (
tokenize.tok_name.get(toktype, toktype),
"%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
ttext, ltext
))
if slineno > last_lineno:
last_col = 0
if scol > last_col:
mod.write(" " * (scol - last_col))
if toktype == token.STRING and prev_toktype == token.INDENT:
# Docstring
mod.write("#--")
elif toktype == tokenize.COMMENT:
# Comment
mod.write("##\n")
else:
mod.write(ttext)
prev_toktype = toktype
last_col = ecol
last_lineno = elineno
if __name__ == '__main__':
do_file(sys.argv[1])
我在文档字符串和注释的位置留下存根注释,因为它简化了代码。如果您完全删除它们,您还必须删除它们之前的缩进。
这里是对 Dan 的解决方案的修改,使其适用于 Python3 + 还删除空行 + 使其可供使用:
import io, tokenize, re
def remove_comments_and_docstrings(source):
io_obj = io.StringIO(source)
out = ""
prev_toktype = tokenize.INDENT
last_lineno = -1
last_col = 0
for tok in tokenize.generate_tokens(io_obj.readline):
token_type = tok[0]
token_string = tok[1]
start_line, start_col = tok[2]
end_line, end_col = tok[3]
ltext = tok[4]
if start_line > last_lineno:
last_col = 0
if start_col > last_col:
out += (" " * (start_col - last_col))
if token_type == tokenize.COMMENT:
pass
elif token_type == tokenize.STRING:
if prev_toktype != tokenize.INDENT:
if prev_toktype != tokenize.NEWLINE:
if start_col > 0:
out += token_string
else:
out += token_string
prev_toktype = token_type
last_col = end_col
last_lineno = end_line
out = '\n'.join(l for l in out.splitlines() if l.strip())
return out
with open('test.py', 'r') as f:
print(remove_comments_and_docstrings(f.read()))
我找到了一种更简单的方法来使用 ast 和 astunparse 模块(可从 pip 获得)。它将代码文本转换为语法树,然后 astunparse 模块再次打印出不带注释的代码。我不得不通过简单的匹配来删除文档字符串,但它似乎有效。我一直在查看输出,到目前为止,此方法的唯一缺点是它会从代码中删除所有换行符。
import ast, astunparse
with open('my_module.py') as f:
lines = astunparse.unparse(ast.parse(f.read())).split('\n')
for line in lines:
if line.lstrip()[:1] not in ("'", '"'):
print(line)
尝试测试以 NEWLINE 结尾的每个标记块。然后正确的文档字符串模式(包括它作为注释的情况,但没有分配给
__doc__
)我相信是(假设匹配是从换行符之后的文件开头执行的):
( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE
这应该处理所有棘手的情况:字符串连接、行延续、模块/类/函数文档字符串、字符串后同一行中的注释。请注意,NL 和 NEWLINE 标记之间存在差异,因此我们无需担心表达式内的行的单个字符串。
我刚刚使用了 Dan McDougall 给出的代码,发现了两个问题。
我想我已经通过添加(返回之前)几行来解决这两个问题:
# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
token_type = tokens[0]
token_string = tokens[1]
if previous_token_type == tokenize.NL and token_type == tokenize.NL:
pass
else:
# add necessary spaces
prev_space = ''
next_space = ''
if token_string in ['and', 'as', 'or', 'in', 'is']:
prev_space = ' '
if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
next_space = ' '
content_without_newlines += prev_space + token_string + next_space # This will be our new output!
previous_token_type = token_type
我试图创建一个程序来计算 python 文件中的所有行,忽略空白行、带有注释和文档字符串的行。这是我的解决方案:
with open(file_path, 'r', encoding='utf-8') as pyt_file:
count = 0
docstring = False
for i_line in pyt_file.readlines():
cur_line = i_line.rstrip().replace(' ', '')
if cur_line.startswith('"""') and not docstring:
marks_counter = Counter(cur_line)
if marks_counter['"'] == 6:
count -= 1
else:
docstring = True
elif cur_line.startswith('"""') and docstring:
count -= 1
docstring = False
if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
count += 1
我的问题是检测文档字符串(包括单行和多行),所以我想如果你想删除它们,你可以尝试使用相同的标志解决方案。
附注我知道这是一个老问题,但是当我处理我的问题时,我找不到任何简单有效的东西
我建议使用此代码(根据@SurpriseDog)
from typing import Any
import ast
from ast import Constant
import astunparse # pip install astunparse
class NewLineProcessor(ast.NodeTransformer):
"""class for keeping '\n' chars inside python strings during ast unparse"""
def visit_Constant(self, node: Constant) -> Any:
if isinstance(node.value, str):
node.value = node.value.replace('\n', '\\n')
return node
with open(file_from) as f:
tree = ast.parse(f.read())
tree = NewLineProcessor().visit(tree)
lines = astunparse.unparse(tree).split('\n')
print(lines)
虽然这个问题是十多年前提出的,但我写这篇文章是为了解决同样的问题 - 希望将它们删除以进行编译。
import ast
import astor
import re
def remove_docs_and_comments(file):
with open(file,"r") as f:
code = f.read()
parsed = ast.parse(code)
for node in ast.walk(parsed):
if isinstance(node, ast.Expr) and isinstance(node.value, ast.Str):
# set value to empty string
node.value = ast.Constant(value='')
formatted_code = astor.to_source(parsed)
pattern = r'^.*"""""".*$' # remove empty """"""
formatted_code = re.sub(pattern, '', formatted_code, flags=re.MULTILINE)
return formatted_code
remove_docs_and_comments("your_script.py")
它将返回没有文档字符串和注释的压缩代码。
这不是一件容易的事。考虑下面的代码,你想删除哪些部分?
# bug.py
"""\
"""
''' #
abcaldskf\
'''
a = ''' #hello"\n
# , "\n world!
'''
b = '#' # "asdf"
#\n"y = ''' #hello"\n
c = '"' """''"""'''"'''
'''"''' + '\n' # '''''""""''"""''
d = \
"""aaa""" + 'zzz' # aa
'''"zz''' '"' "'" '' # '"""
"" + a #
''
可能只有 DFA 才是正确的方法。这是我的解决方案:
codebook = { '\\': 0, "'": 1, '"': 2, '#': 3, '\n': 4 }
dfa = [[ -1, -1, -1, -1, -1, -1 ],
[ -1, 2, 9, 18, 1, 1 ],
[ 16, 3, 8, 8, -1, 8 ],
[ -1, 4, 9, 18, 1, 1 ],
[ 4, 5, 4, 4, 4, 4 ],
[ 4, 6, 4, 4, 4, 4 ],
[ 4, 7, 4, 4, 4, 4 ],
[ -1, 2, 9, 18, 1, 1 ],
[ 16, 1, 8, 8, -1, 8 ],
[ 17, 15, 10, 15, -1, 15 ],
[ -1, 2, 11, 18, 1, 1 ],
[ 11, 11, 12, 11, 11, 11 ],
[ 11, 11, 13, 11, 11, 11 ],
[ 11, 11, 14, 11, 11, 11 ],
[ -1, 2, 9, 18, 1, 1 ],
[ 17, 15, 1, 15, -1, 15 ],
[ 8, 8, 8, 8, -1, 8 ],
[ 15, 15, 15, 15, -1, 15 ],
[ 18, 18, 18, 18, 1, 18 ]]
def run_dfa(text, callback):
s = 1
for i, c in enumerate(text):
s = dfa[s][codebook.get(c, 5)]
if s <= 0:
break
callback(i, s)
return s == 1
class CommentDetector:
def __init__(self, text: str):
self.text = text
self.prev_cmt = -1
self.mls_cmts, self.num_cmts = [], []
def __call__(self, i, s):
cmt = 0
if (s >= 4 and s <= 7) or (s >= 11 and s <= 14):
cmt = 1
elif s == 18:
cmt = 2
if self.prev_cmt != 1 and cmt == 1:
assert not self.mls_cmts or self.mls_cmts[-1][1] >= 0
self.mls_cmts.append([i - 2, -1])
elif self.prev_cmt == 1 and cmt != 1:
assert self.mls_cmts and i - self.mls_cmts[-1][0] > 6 and self.mls_cmts[-1][-1] == -1
self.mls_cmts[-1][1] = i
elif self.prev_cmt != 2 and cmt == 2:
assert not self.num_cmts or self.num_cmts[-1][1] >= 0
self.num_cmts.append([i, -1])
elif self.prev_cmt == 2 and cmt != 2:
assert self.num_cmts and self.num_cmts[-1][-1] == -1
self.num_cmts[-1][1] = i
self.prev_cmt = cmt
def clean_effective_mls_cmts(self):
text = self.text[:]
for beg, end in self.num_cmts:
# text[beg:end] = [' '] * (end - beg)
text = text[:beg] + ' ' * (end - beg) + text[end:]
for i in range(len(self.mls_cmts) - 1, -1, -1):
beg = text.rfind('\n', 0, self.mls_cmts[i][0]) + 1
end = text.find('\n', self.mls_cmts[i][1])
subtext = text[beg:self.mls_cmts[i][0]] + text[self.mls_cmts[i][1]:end]
if re.search(r'[^\s]+', subtext):
del self.mls_cmts[i]
def remove_all_comments(self):
text = self.text
for beg, end in self.mls_cmts:
text = text[:beg] + text[end:]
for beg, end in self.num_cmts:
text = text[:beg] + text[end:]
return text
def get_comments_mask(self):
mask = [0] * len(self.text)
for beg, end in self.mls_cmts:
mask[beg:end] = [1] * (end - beg)
for beg, end in self.num_cmts:
mask[beg:end] = [2] * (end - beg)
return mask
text = open('bug.py').read().strip('\n\t ') + '\n'
text = text.replace('\r\n', '\n').replace('\r', '\n').replace('\\\n', '')
cmt_det = CommentDetector(text)
assert run_dfa(text, cmt_det)
cmt_det.clean_effective_mls_cmts()
mask = cmt_det.get_comments_mask()
ind_chars = [' ', '^', '*']
for i, c in enumerate(text):
mask[i] = '\n' if c == '\n' else ind_chars[mask[i]]
mask_lines = ''.join(mask).splitlines()
for i, line in enumerate(text.splitlines()):
print(f'[{i:04d}] {line}')
print(f'[{i:04d}] {mask_lines[i]}')
输出:
[0000] # bug.py
[0000] ********
[0001] """
[0001] ^^^
[0002] """
[0002] ^^^
[0003] ''' #
[0003] ^^^^^
[0004] abcaldskf'''
[0004] ^^^^^^^^^^^^
[0005] a = ''' #hello"\n
[0005]
[0006] # , "\n world!
[0006]
[0007] '''
[0007]
[0008] b = '#' # "asdf"
[0008] ********
[0009] #\n"y = ''' #hello"\n
[0009] *********************
[0010] c = '"' """''"""'''"'''
[0010]
[0011] '''"''' + '\n' # '''''""""''"""''
[0011] ******************
[0012] d = """aaa""" + 'zzz' # aa
[0012] ****
[0013] '''"zz''' '"' "'" '' # '"""
[0013] ******
[0014] "" + a #
[0014] *
[0015] ''
[0015]
输出中,每行代码用两行表示,前一行是原始代码,后一行用
^
和 *
来表示前一行的哪些部分是三引号,哪些部分是三引号评论。
CommentDetector可以合理修改以满足您的需求。