删除 Python 注释/文档字符串的脚本

问题描述 投票:0回答:10

是否有可用的 Python 脚本或工具可以从 Python 源代码中删除注释和文档字符串?

它应该处理以下情况:

"""
aas
"""
def f():
    m = {
        u'x':
            u'y'
        } # faake docstring ;)
    if 1:
        'string' >> m
    if 2:
        'string' , m
    if 3:
        'string' > m

最后我想出了一个简单的脚本,它使用 tokenize 模块并删除评论标记。它似乎工作得很好,除了我无法在所有情况下删除文档字符串。看看是否可以改进它以删除文档字符串。

import cStringIO
import tokenize

def remove_comments(src):
    """
    This reads tokens using tokenize.generate_tokens and recombines them
    using tokenize.untokenize, and skipping comment/docstring tokens in between
    """
    f = cStringIO.StringIO(src)
    class SkipException(Exception): pass
    processed_tokens = []
    last_token = None
    # go thru all the tokens and try to skip comments and docstrings
    for tok in tokenize.generate_tokens(f.readline):
        t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok

        try:
            if t_type == tokenize.COMMENT:
                raise SkipException()

            elif t_type == tokenize.STRING:

                if last_token is None or last_token[0] in [tokenize.INDENT]:
                    # FIXEME: this may remove valid strings too?
                    #raise SkipException()
                    pass

        except SkipException:
            pass
        else:
            processed_tokens.append(tok)

        last_token = tok

    return tokenize.untokenize(processed_tokens)

我还想在具有良好单元测试覆盖率的大量脚本上对其进行测试。你能推荐一个这样的开源项目吗?

python comments
10个回答
28
投票

我是“mygod,他使用正则表达式编写了一个Python解释器...”(即pyminifier)的作者在下面的链接中提到=)。
我只是想插话并说我使用 tokenizer 模块对代码进行了相当多的改进(由于这个问题我发现了 =) )。

您会很高兴地注意到,代码不再过多依赖正则表达式,而是使用分词器来达到很好的效果。无论如何,这是 pyminifier 的

remove_comments_and_docstrings()
函数
(注意:它可以在先前发布的代码中断的边缘情况下正常工作):

import cStringIO, tokenize
def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out

13
投票

这完成了工作:

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.

    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("##\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file(sys.argv[1])

我在文档字符串和注释的位置留下存根注释,因为它简化了代码。如果您完全删除它们,您还必须删除它们之前的缩进。


10
投票

这里是对 Dan 的解决方案的修改,使其适用于 Python3 + 还删除空行 + 使其可供使用:

import io, tokenize, re
def remove_comments_and_docstrings(source):
    io_obj = io.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        if token_type == tokenize.COMMENT:
            pass
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
                if prev_toktype != tokenize.NEWLINE:
                    if start_col > 0:
                        out += token_string
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    out = '\n'.join(l for l in out.splitlines() if l.strip())
    return out
with open('test.py', 'r') as f:
    print(remove_comments_and_docstrings(f.read()))

5
投票

我找到了一种更简单的方法来使用 ast 和 astunparse 模块(可从 pip 获得)。它将代码文本转换为语法树,然后 astunparse 模块再次打印出不带注释的代码。我不得不通过简单的匹配来删除文档字符串,但它似乎有效。我一直在查看输出,到目前为止,此方法的唯一缺点是它会从代码中删除所有换行符。

import ast, astunparse

with open('my_module.py') as f:
    lines = astunparse.unparse(ast.parse(f.read())).split('\n')
    for line in lines:
        if line.lstrip()[:1] not in ("'", '"'):
            print(line)

1
投票

尝试测试以 NEWLINE 结尾的每个标记块。然后正确的文档字符串模式(包括它作为注释的情况,但没有分配给

__doc__
)我相信是(假设匹配是从换行符之后的文件开头执行的):

( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE

这应该处理所有棘手的情况:字符串连接、行延续、模块/类/函数文档字符串、字符串后同一行中的注释。请注意,NL 和 NEWLINE 标记之间存在差异,因此我们无需担心表达式内的行的单个字符串。


0
投票

我刚刚使用了 Dan McDougall 给出的代码,发现了两个问题。

  1. 有太多空的新行,所以我决定每次有两个连续的新行时删除行
  2. 处理Python代码时,所有空格都丢失了(缩进除外),因此“import Anything”之类的东西变成了“importAnything”,这导致了问题。我在需要完成的保留 Python 单词前后添加了空格。我希望我没有犯任何错误。

我想我已经通过添加(返回之前)几行来解决这两个问题:

# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
    token_type = tokens[0]
    token_string = tokens[1]
    if previous_token_type == tokenize.NL and token_type == tokenize.NL:
        pass
    else:
        # add necessary spaces
        prev_space = ''
        next_space = ''
        if token_string in ['and', 'as', 'or', 'in', 'is']:
            prev_space = ' '
        if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
            next_space = ' '
        content_without_newlines += prev_space + token_string + next_space # This will be our new output!
    previous_token_type = token_type

0
投票

我试图创建一个程序来计算 python 文件中的所有行,忽略空白行、带有注释和文档字符串的行。这是我的解决方案:

with open(file_path, 'r', encoding='utf-8') as pyt_file:
  count = 0
  docstring = False

  for i_line in pyt_file.readlines():

    cur_line = i_line.rstrip().replace(' ', '')

    if cur_line.startswith('"""') and not docstring:
      marks_counter = Counter(cur_line)
      if marks_counter['"'] == 6:
        count -= 1
      else:
        docstring = True

    elif cur_line.startswith('"""') and docstring:
      count -= 1
      docstring = False

    if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
      count += 1

我的问题是检测文档字符串(包括单行和多行),所以我想如果你想删除它们,你可以尝试使用相同的标志解决方案。

附注我知道这是一个老问题,但是当我处理我的问题时,我找不到任何简单有效的东西


0
投票

我建议使用此代码(根据@SurpriseDog)

from typing import Any

import ast
from ast import Constant
import astunparse  # pip install astunparse

class NewLineProcessor(ast.NodeTransformer):
    """class for keeping '\n' chars inside python strings during ast unparse"""
    def visit_Constant(self, node: Constant) -> Any:
        if isinstance(node.value, str):
            node.value = node.value.replace('\n', '\\n')
        return node

with open(file_from) as f:
    tree = ast.parse(f.read())
    tree = NewLineProcessor().visit(tree)
    lines = astunparse.unparse(tree).split('\n')
    print(lines)


0
投票

虽然这个问题是十多年前提出的,但我写这篇文章是为了解决同样的问题 - 希望将它们删除以进行编译。

import ast
import astor
import re

def remove_docs_and_comments(file):
    with open(file,"r") as f:
        code = f.read() 
    parsed = ast.parse(code)
    for node in ast.walk(parsed):
        if isinstance(node, ast.Expr) and isinstance(node.value, ast.Str):
            # set value to empty string
            node.value = ast.Constant(value='') 
    formatted_code = astor.to_source(parsed)  
    pattern = r'^.*"""""".*$' # remove empty """"""
    formatted_code = re.sub(pattern, '', formatted_code, flags=re.MULTILINE) 
    return formatted_code 

remove_docs_and_comments("your_script.py")

它将返回没有文档字符串和注释的压缩代码。


0
投票

这不是一件容易的事。考虑下面的代码,你想删除哪些部分?

# bug.py
"""\

"""
''' #
abcaldskf\
'''
a = ''' #hello"\n
# , "\n world!
'''
b = '#' # "asdf"
#\n"y = ''' #hello"\n
c = '"' """''"""'''"'''
'''"''' + '\n' # '''''""""''"""''
d = \
"""aaa""" + 'zzz' # aa
'''"zz''' '"' "'" '' # '"""
"" + a #
''

可能只有 DFA 才是正确的方法。这是我的解决方案:

codebook = { '\\': 0, "'": 1, '"': 2, '#': 3, '\n': 4 }
dfa = [[ -1, -1, -1, -1, -1, -1 ],
       [ -1,  2,  9, 18,  1,  1 ],
       [ 16,  3,  8,  8, -1,  8 ],
       [ -1,  4,  9, 18,  1,  1 ],
       [  4,  5,  4,  4,  4,  4 ],
       [  4,  6,  4,  4,  4,  4 ],
       [  4,  7,  4,  4,  4,  4 ],
       [ -1,  2,  9, 18,  1,  1 ],
       [ 16,  1,  8,  8, -1,  8 ],
       [ 17, 15, 10, 15, -1, 15 ],
       [ -1,  2, 11, 18,  1,  1 ],
       [ 11, 11, 12, 11, 11, 11 ],
       [ 11, 11, 13, 11, 11, 11 ],
       [ 11, 11, 14, 11, 11, 11 ],
       [ -1,  2,  9, 18,  1,  1 ],
       [ 17, 15,  1, 15, -1, 15 ],
       [  8,  8,  8,  8, -1,  8 ],
       [ 15, 15, 15, 15, -1, 15 ],
       [ 18, 18, 18, 18,  1, 18 ]]

def run_dfa(text, callback):
    s = 1
    for i, c in enumerate(text):
        s = dfa[s][codebook.get(c, 5)]
        if s <= 0:
            break
        callback(i, s)
    return s == 1

class CommentDetector:
    def __init__(self, text: str):
        self.text = text
        self.prev_cmt = -1
        self.mls_cmts, self.num_cmts = [], []

    def __call__(self, i, s):
        cmt = 0
        if (s >= 4 and s <= 7) or (s >= 11 and s <= 14):
            cmt = 1
        elif s == 18:
            cmt = 2
        if self.prev_cmt != 1 and cmt == 1:
            assert not self.mls_cmts or self.mls_cmts[-1][1] >= 0
            self.mls_cmts.append([i - 2, -1])
        elif self.prev_cmt == 1 and cmt != 1:
            assert self.mls_cmts and i - self.mls_cmts[-1][0] > 6 and self.mls_cmts[-1][-1] == -1
            self.mls_cmts[-1][1] = i
        elif self.prev_cmt != 2 and cmt == 2:
            assert not self.num_cmts or self.num_cmts[-1][1] >= 0
            self.num_cmts.append([i, -1])
        elif self.prev_cmt == 2 and cmt != 2:
            assert self.num_cmts and self.num_cmts[-1][-1] == -1
            self.num_cmts[-1][1] = i
        self.prev_cmt = cmt

    def clean_effective_mls_cmts(self):
        text = self.text[:]
        for beg, end in self.num_cmts:
            # text[beg:end] = [' '] * (end - beg)
            text = text[:beg] + ' ' * (end - beg) + text[end:]
        for i in range(len(self.mls_cmts) - 1, -1, -1):
            beg = text.rfind('\n', 0, self.mls_cmts[i][0]) + 1
            end = text.find('\n', self.mls_cmts[i][1])
            subtext = text[beg:self.mls_cmts[i][0]] + text[self.mls_cmts[i][1]:end]
            if re.search(r'[^\s]+', subtext):
                del self.mls_cmts[i]

    def remove_all_comments(self):
        text = self.text
        for beg, end in self.mls_cmts:
            text = text[:beg] + text[end:]
        for beg, end in self.num_cmts:
            text = text[:beg] + text[end:]
        return text

    def get_comments_mask(self):
        mask = [0] * len(self.text)
        for beg, end in self.mls_cmts:
            mask[beg:end] = [1] * (end - beg)
        for beg, end in self.num_cmts:
            mask[beg:end] = [2] * (end - beg)
        return mask

text = open('bug.py').read().strip('\n\t ') + '\n'
text = text.replace('\r\n', '\n').replace('\r', '\n').replace('\\\n', '')
cmt_det = CommentDetector(text)
assert run_dfa(text, cmt_det)
cmt_det.clean_effective_mls_cmts()
mask = cmt_det.get_comments_mask()
ind_chars = [' ', '^', '*']
for i, c in enumerate(text):
    mask[i] = '\n' if c == '\n' else ind_chars[mask[i]]
mask_lines = ''.join(mask).splitlines()
for i, line in enumerate(text.splitlines()):
    print(f'[{i:04d}] {line}')
    print(f'[{i:04d}] {mask_lines[i]}')

输出:

[0000] # bug.py
[0000] ********
[0001] """
[0001] ^^^
[0002] """
[0002] ^^^
[0003] ''' #
[0003] ^^^^^
[0004] abcaldskf'''
[0004] ^^^^^^^^^^^^
[0005] a = ''' #hello"\n
[0005]                  
[0006] # , "\n world!
[0006]               
[0007] '''
[0007]    
[0008] b = '#' # "asdf"
[0008]         ********
[0009] #\n"y = ''' #hello"\n
[0009] *********************
[0010] c = '"' """''"""'''"'''
[0010]                        
[0011] '''"''' + '\n' # '''''""""''"""''
[0011]                ******************
[0012] d = """aaa""" + 'zzz' # aa
[0012]                       ****
[0013] '''"zz''' '"' "'" '' # '"""
[0013]                      ******
[0014] "" + a #
[0014]        *
[0015] ''
[0015]   

输出中,每行代码用两行表示,前一行是原始代码,后一行用

^
*
来表示前一行的哪些部分是三引号,哪些部分是三引号评论。

CommentDetector可以合理修改以满足您的需求。

© www.soinside.com 2019 - 2024. All rights reserved.