我想创建一个函数,它能够按点拆分包含多个句子的字符串,但同时处理缩写。例如,它不应该在“Univ”之后拆分。和“部门”。有点难以解释,但我会展示测试用例。我看过这篇文章(Split string with "."(dot)while handling abbreviations)但是答案删除了非标点符号点(美国到美国)我想保留点
这是我的功能:
def split_string_by_punctuation(line: str) -> list[str]:
"""
Splits a given string into a list of strings using terminal punctuation marks (., !, ?, or :) as delimiters.
This function utilizes regular expression patterns to ensure that abbreviations, honorifics,
and certain special cases are not considered as sentence delimiters.
Args:
line (str): The input string to be split into sentences.
Returns:
list: A list of strings representing the sentences obtained after splitting the input string.
Notes:
- Negative lookbehind is used to exclude abbreviations (e.g., "e.g.", "i.e.", "U.S.A."),
which might have a period but are not the end of a sentence.
- Negative lookbehind is also used to exclude honorifics (e.g., "Mr.", "Mrs.", "Dr.")
that might have a period but are not the end of a sentence.
- Negative lookbehind is also used to exclude some abbreviations (e.g., "Dept.", "Univ.", "et al.")
that might have a period but are not the end of a sentence.
- Positive lookbehind is used to match a whitespace character following a terminal
punctuation mark (., !, ?, or :).
"""
punct_regex = re.compile(r"(?<=[.!?;:])(?:(?<!Prof\.)|(?<!Dept\.)|(?<!Univ\.)|(?<!et\sal\.))(?<!\w\.\w.)(?<![A-Z][a-z]\.)\s")
return re.split(punct_regex, line)
这些是我的测试用例:
def test_split_string_by_punctuation(self):
# Test case 1
text1 = "I am studying at Univ. of California, Dept. of Computer Science. The research team includes " \
"Prof. Smith, Dr. Johnson, and Ms. Adams et al. so we are working on a new project."
result1 = split_string_by_punctuation(text1)
assert result1 == ['I am studying at Univ. of California, Dept. of Computer Science.',
'The research team includes Prof. Smith, Dr. Johnson, and Ms. Adams et al. '
'so we are working on a new project.'], "Test case 1 failed"
# Test case 2
text2 = "This is a city in U.S.A.. This is i.e. one! What about this e.g. one? " \
"Finally, here's the last one:"
result2 = split_string_by_punctuation(text2)
assert result2 == ['This is a city in U.S.A..', 'This is i.e. one!', 'What about this e.g. one?',
"Finally, here's the last one:"], "Test case 2 failed"
# Test case 3
text3 = "This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list"
result3 = split_string_by_punctuation(text3)
assert result3 == [
'This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list'], \
"Test case 3 failed"
例如测试用例1的结果是 ['我在大学读书', '加利福尼亚州,部门', '计算机科学。', '研究团队包括Prof.', '史密斯、约翰逊博士和亚当斯女士等人', '所以我们正在开展一个新项目。'] 将字符串拆分为“Univ.”、“Dept.”、“Prof.”。和“等人”。
我建议使用
findall
来捕捉句子而不是 split
来识别断句。
其他备注:
当您将正则表达式对象作为
argument传递给
re.compile
(或任何其他re.split
方法)时,使用re
会适得其反,因为它会再次编译。相反,您应该在正则表达式对象上调用该方法,例如punct_regex.split(line)
。但是,由于此正则表达式仅使用一次,您可能会跳过对 compile
的调用。编译将在 re
方法调用时发生。
列出所有可能的缩写将是一项永无止境的任务!除非你确定你抓住了所有这些,否则我会建议一个试探法:如果一个点后面没有空格和大写字母,那么前面的单词就是缩写。如果单词的第一个字母大写并且最多有 4 个字母并且后跟一个点,那么它也是一个缩写。在所有其他情况下,该点被解释为结束一个句子。
你的测试用例中有一些错误。
修复测试用例后,该功能通过测试:
def split_string_by_punctuation(line):
punct_regex = r"(?=\S)(?:[A-Z][a-z]{0,3}\.|[^.?!;:]|\.(?!\s+[A-Z]))*.?"
return re.findall(punct_regex, line)