Python使用正则表达式将文本分割成标记

Question

[嗨，我有一个有关将字符串拆分为标记的问题。

这里是一个示例字符串：

string =“当我在等待的时候，一个人从一间侧室出来，一眼就能确定他一定是长约翰。他的左腿被臀部割断，在左肩下方他背着拐杖，用灵巧的手抓着拐杖，像一只鸟一样在它上面跳来跳去；他又高又结实，脸像火腿一样大，脸色苍白，苍白，但聪明而微笑。最欢快的精神，他在桌子间四处走动时吹口哨，用一个快乐的字眼或一巴掌打招呼，以吸引更多的客人。”

并且我正在尝试将string正确地拆分为其标记。

这是我的功能count_words

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

以及此处split的结果

[['as'，'i'，'was'，'waiting'，'a'，'man'，'came'，'out'，'of'，'a'，“侧面”，“房间”，“和”，“在”，“一个”，“扫视”，“我”，“是”，“确定”，“他”，“必须”，“是”，“长”，“约翰”，“他”，“左”，“腿”，“是”，“切”，“关闭”，“关闭”，“通过”，“该”，“臀部”，“和”，“下方”，“该”，“左侧”，“肩膀”，“他”，“携带”，“一个”，“拐杖”，“其中”，“他”，“受管理”，“有”，“精彩”，“灵巧”，“跳跃”，“关于”，“在”，“它”，“喜欢”，“一只”，“鸟”，“他”，“是”，“非常”，“高”，“和”，“强壮”，“ with”，“ a”，“ face”，“ as”，“ big”，“ as”，“ a”，“ ham-plain”，“ and”，“苍白”，“但是”，“智能”，“和”，“微笑”，“确实”，“他”，“似乎”，“中”，“该”，“最”，“开朗”，“精神”，“低语”，“是”，“他”，“移动”，“大约”，“中间”，“该”，“桌子”，“带有”，“一个”，“快乐”，“单词”，“或”，“一个”，“巴掌”，“上”，“该”，“肩膀”，“用于”，'the'，'more'，'favoured'，'of'，'his'，'guests'，'']

如您所见，''列表的最后一个索引中有空字符串split。

[请帮助我理解列表中的空字符串，并正确分割此示例string。

Answer 1

您可以使用list comprehension遍历re.split生成的列表项，并仅在它们不是空字符串时保留它们：

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation 
    # (Hint: Use regex to split on non-alphanumeric characters) 

    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    split = [x for x in split if x != '']  # <- list comprehension
    print(split)

您还应该考虑从函数中返回数据，并从调用方中打印数据，而不是从函数内部中打印数据。这将为您将来提供灵活性。

Answer 2

发生这种情况是因为字符串的结尾是.，并且它位于拆分的pattern中，所以，当匹配.时，下一个匹配将从一个空开始，这就是为什么看到''的原因。

我建议此解决方案使用re.findall代替，以相反的方式进行：

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.findall(r"[a-z\-]+", lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

Answer 3

Python的wiki解释了此行为：

如果分隔符中有捕获组，并且在字符串开头，结果将以空字符串开头。的字符串的末尾也是如此

尽管您实际上不是一个捕获组，但效果是相同的。请注意，它可以在结尾也可以在开头（例如，如果您的字符串以空格开头）。

其他人已经（或多或少）提出的2个解决方案是这些：

解决方案1：`findall`

正如其他用户指出的那样，您可以使用findall并尝试反转模式的逻辑。使用您的，您可以轻松否定您的角色类别：[^\s\.,!?:;'\"-]+。

但是它取决于您的正则表达式模式，因为它并不总是那么容易。

解决方案2：检查起始令牌和结束令牌

而不是检查每个标记是否为!= ''，您只需查看标记中的第一个或最后一个标记，因为您急切地需要分割的所有字符。

split = re.split("[\s\.,!?:;'\"-]+",lowerText)

if split[0] == '':
    split = split[1:]

if split[-1] == '':
    split = split[:-1]

Answer 4

您有一个空字符串，原因是一个点也要匹配以在string结尾处分割，并且任何内容都在下游。但是，您可以使用filter函数过滤掉空字符串，从而完成函数：

import re
import collections


def count_words(text):
    """Count how many times each unique word occurs in text."""

    lowerText = text.lower()

    split = re.split("[ .,!?:;'\"\-]+",lowerText)
    ## filer out empty strings and count
    ## words:

   return collections.Counter( filter(None, split) )


count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})

Answer 5

导入字符串

def count_words（text）：

counts = dict() 


text = text.translate(text.maketrans('', '', string.punctuation))
text = text.lower()

words = text.split()
print(words)

for word in words:

    if word not in counts:
        counts[word] = 1
    else:
        counts[word] += 1

return counts

有效。

Python使用正则表达式将文本分割成标记

问题描述投票：4回答：5

5个回答

解决方案1：`findall`

解决方案2：检查起始令牌和结束令牌

最新问题

Python使用正则表达式将文本分割成标记

问题描述 投票：4回答：5

5个回答

解决方案1：findall

解决方案2：检查起始令牌和结束令牌

最新问题

问题描述投票：4回答：5

解决方案1：`findall`