在提交 b3120f0 时添加了标点符号保留,但是,目前使用 pip 安装的 PyPI 版本 不是最新的 GitHub 存储库。
要手动添加此功能,请找到
wordninja
模块安装目录:
pip show wordninja
导航到上述命令输出中列出的目录,然后将
split
文件中的 wordninja.py
函数编辑为:
def split(self, s):
"""Uses dynamic programming to infer the location of spaces in a string without spaces."""
punctuations = _SPLIT_RE.findall(s)
texts = _SPLIT_RE.split(s)
assert len(punctuations) + 1 == len(texts)
new_texts = [self._split(x) for x in texts]
for i, punctuation in enumerate(punctuations):
new_texts.insert(2*i+1, punctuation)
return [item for sublist in new_texts for item in sublist]
代替:
def split(self, s):
"""Uses dynamic programming to infer the location of spaces in a string without spaces."""
l = [self._split(x) for x in _SPLIT_RE.split(s)]
return [item for sublist in l for item in sublist]
import wordninja
text = "This isasentence. I am trying to splitthesewords."
split_text = wordninja.split(text)
print(split_text)
更新
split
函数前的输出:
['This', 'is', 'a', 'sentence', 'I', 'am', 'trying', 'to', 'split', 'these', 'words']
更新
split
函数后的输出:
['This', ' ', 'is', 'a', 'sentence', '.', ' ', 'I', ' ', 'am', ' ', 'trying', ' ', 'to', ' ', 'split', 'these', 'words', '.']
启用此功能后,您必须手动处理列表中的空格和标点符号。