使用 python 根据条件将文本分成段落

问题描述 投票:0回答:0

我已经将 非结构化 pdf(使用 tika)转换为可读文本,分成单独的句子(尽管如果更容易的话,这些可以合并成一大块文本)。请参阅此处以获取文本示例:

['guided by our core values, weve experienced first-hand how our focused actions both big and small can translate into meaningful experiences for our customers, bringing our purpose to feed and foster community to life each day.',
 'with the strength of our full system, weve worked together to build a more diverse, equitable and inclusive business, source more food responsibly, adopt more sustainable practices, and implement innovative and credible solutions in our ongoing quest to be a good neighbor in the communities where we live, work and serve.',
 'we are proud of the work we do to make a difference and will continue to help uphold this promise in all of the communities in which we operate.',
 'showing up for our communities ray kroc used to say, none of us is as good as all of us a phrase that serves as a constant reminder of mcdonalds impact on the world when we leverage thecollective strength of our system.',
 'we all felt this sentiment more deeply overthe past two years as we continued tonavigate the covid-19 pandemic andit is equally as prevalent now as we face ongoing headwinds.',
 'front of mind is the continuing humanitarian crisis resulting from the war against ukraine.',
 'our hearts and minds remain with the ukrainian people and all impacted, as the ongoing war has brought new elements of uncertainty tocommunities around the world.',
 'in moments like these, our number one priority remains our people.']

但现在我想根据一组特定的标准将它重新构建成段落。每段应该:

  • 超过两行
  • 由 50% 的字母字符组成
  • 包含至少一个句号
  • 以句号结束
  • 至少包含 20 个单词,其中 15 个应该不同

我试过使用 ifelse 标准循环,但我一无所获。

我该怎么办?非常感谢您的帮助!

python text apache-tika paragraph
© www.soinside.com 2019 - 2024. All rights reserved.