从字符串中创建连续的双字短语。

Question

我花了难以置信的时间，试图寻找一种使用itertools将一个句子转化为两个词组列表的方法。

我想把这个 "快速的棕色狐狸"

然后把它变成这样："快速"，"快速棕色"，"棕色狐狸"。"快速"，"快速棕色"，"棕色狐狸"

我试过的所有方法都能返回从单字到4字的列表，但没有任何方法只返回对。

我尝试了很多不同用途的itertools组合，我知道这是可行的，但我就是想不出正确的组合，我也不想为一些东西定义一个函数。我知道 是可以用两行或更少的代码完成的。谁能帮帮我？

Answer 1

试试

s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)

试试...

['the quick', 'quick brown', 'brown fox']

解释

使用以下方法创建词对的迭代器压缩

zip(words, words[1:]

迭代对

for pair in zip(words, words[1:])

创建结果字

[' '.join(pair) for ...]

Answer 2

@DarrylG的答案似乎是这样的，但你也可以使用。

s = "the quick brown fox"
p  = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']

演示

Answer 3

如果你想要一个纯粹的迭代器解决方案，用于大字符串，并不断使用内存。

input       = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
next(input_iter2) # skip first
output = itertools.starmap(
    lambda a, b: f"{a} {b}", 
    zip(input_iter1, input_iter2)
)
list(output)                                                         
# ['the quick', 'quick brown', 'brown fox']

如果你有额外3倍的字符串内存来存储split()和加倍输出的列表，那么不使用itertools可能会更快、更容易。

inputs = "the quick brown fox".split(' ')    

output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ] 
#  ['the quick', 'quick brown', 'brown fox']

更新

支持任意ngram大小的通用解决方案。

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

测试:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

输出。

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

你可能也会发现这个问题的相关性。python中的n克，四克、五克、六克？

从字符串中创建连续的双字短语。

问题描述投票：0回答：3

3个回答

更新

最新问题

从字符串中创建连续的双字短语。

问题描述 投票：0回答：3

3个回答

更新

最新问题

问题描述投票：0回答：3