移除字符串,直到符合python中的条件。

问题描述 投票:-1回答:1

我有这些字符串向量

text1 = "  SPEECH Remarks at the European Economics and Financial Centre Remarks by Luis de Guindos, Vice-President of the ECB, at the European Economics and Financial Centre London, 2 March 2020 I am delighted to be here today at the European Economics and F'
text2 = "  SPEECH  The ECB’s response to the COVID-19 pandemic Remarks by Isabel Schnabel, Member of the Executive Board of the ECB, at a 24-Hour Global Webinar co-organised by the SAFE Policy Center on “The COVID-19 Crisis and Its Aftermath: Corporate Governance Implications and Policy Challenges” Frankfurt am Main, 16 April 2020 The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th"

我如何根据我在文本中看到的日期删除之前的所有文本?

预期的结果应该是。

text1 = "I am delighted to be here today at the European Economics and F"

text2 = "The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th"

重要

请注意,因为我正在处理大量类似的文件,所以不可能知道所有的日期。我认为理想的解决方案应该是能够识别日期,以便删除开头不必要的文字。

python nlp text-processing
1个回答
1
投票

使用正则表达式

编码

import re

def remove_predate(text):
  '''Detect full and abbreviated dates i.e. 02 January 2020 and 02 Jan 2020'''

  date_pattern = r'(.*?)(\d{1,2}\s+(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{4})'

  regex_detect = re.compile(date_pattern)
  m = regex_detect.match(text)

  if m:
    span = m.span(0)
    return text[span[1]:]  # skips text before and including date

  return text

测试

print(remove_predate(text1))
print(remove_predate(text2))

产量

 I am delighted to be here today at the European Economics and F
 The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th

1
投票

使用regexps。

import re
month_names = ('January', …, 'December') # fill the missing names
date_regexp = r'\d{1,2}\s+(' + '|'.join(month_names) + r')\s+\d{4}'
rx = re.compile('.*?' + date_regexp)
text1 = re.sub(rx, '', text1)

1
投票

首先,你必须了解 日期格式 在你的演讲稿&文本中,在搜索之前,,,他们可以写成01102020,2020年10月1日,1-10-2020在各种演讲中。如果你能找到固定的日期格式,就可以用regex来查找日期。

带斜杠的日期的Regex表达式,从 regexlib

^\d{1,2}\/\d{1,2}\/\d{4}$

日期的Regex表达式,不含斜杠,但不含空格,从 regexlib

^((31(?!\ (Feb(ruary)?|Apr(il)?|June?|(Sep(?=\b|t)t?|Nov)(ember)?)))|((30|29)(?!\ Feb(ruary)?))|(29(?=\ Feb(ruary)?\ (((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))|(0?[1-9])|1\d|2[0-8])\ (Jan(uary)?|Feb(ruary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sep(?=\b|t)t?|Nov|Dec)(ember)?)\ ((1[6-9]|[2-9]\d)\d{2})$
© www.soinside.com 2019 - 2024. All rights reserved.