Python如何提取所有列表元素,如果它们以相同的字符集开头

问题描述 投票:0回答:1

我必须根据使用的POS标签将POS标记的单词列表拆分为子列表。我的列表看起来像这样:

List=[", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
      "CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
      "IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
      "MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
      "NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
      "NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'",  "PRP -> 'it'",
      "PRP$ -> 'our'",  "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
      "UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
      "VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
      "VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'", 
      "WDT -> 'that'", "WP -> 'what'"]

我想要的输出就像是

[["IN -> 'as'", "IN -> 'because'", "IN -> 'if'", "IN -> 'in'"],["UH -> 'Oh'", "UH -> 'Wow'"]]

甚至更好

CC = ['but']
CD = ['hundred', 'one']

我搜索了很多,但我能找到的唯一功能,至少部分功能是这样的:

from itertools import groupby
print([list(g) for k, g in groupby(List, key=lambda x: x[0])])

我玩过x的值,但似乎没有什么效果很好。

我也考虑过使用这样的东西:

RB = []
for item in List:
    if item.startswith('RB'):
        g=re.findall('-> (.*)', item)
        RB.append(g)

这当然可行,但对于大约40个不同的POS标签来说这样做会很痛苦。必须有一个更简单的方法。

python nltk pos-tagger
1个回答
0
投票

使用defaultdict

from collections import defaultdict

List = [", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
      "CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
      "IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
      "MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
      "NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
      "NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'",  "PRP -> 'it'",
      "PRP$ -> 'our'",  "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
      "UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
      "VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
      "VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'", 
      "WDT -> 'that'", "WP -> 'what'"]

data = defaultdict(set)
for key, value in (_.split('->') for _ in List):
  d[key.strip()].add(value.strip().replace("'", '').replace('"', ''))
print(dict(data))

这导致:

{',': {','}, '.': {'.', '!', '?'}, 'CC': {'but'}, 'CD': {'hundred','one'}, 'DT': {'the'}, 'EX': {'There'}, 'IN': {'in', 'because', 'as', 'if'}, 'JJ': {'Sure'}, 'MD': {'will', 'ca', 'll', 'can', 'would'}, 'NN': {'Applause', 'doubt', 'anybody'}, 'NNP': {'Syria'}, 'NNS': {'Generals', 'states', 'people'}, 'PRP': {'it'}, 'PRP$': {'our'}, 'RB': {'there'}, 'RBR': {'more'}, 'RP': {'out'}, 'TO': {'to'}, 'UH': {'Wow', 'Oh'}, 'VB': {'want', 'stop'}, 'VBD': {'was', 'knew'}, 'VBG': {'going', 'allowing', 'doing'}, 'VBN': {'called'}, 'VBP': {'take'}, 'VBZ': {'s', 'is'}, 'WDT': {'that'}, 'WP': {'what'}}
© www.soinside.com 2019 - 2024. All rights reserved.