如果前三句包含关键词,如何过滤字符串?

问题描述 投票:-1回答:1

我有一个Pandas数据框架,叫做 df. 它有一列叫做 article. 该 article 我想只保留那些前四句包含关键词 "COVID-19 "和("中国 "或 "中文")的文章。但我无法自己找到一种方法来进行这个操作。

(在这个字符串中,句子以 \n. 一个例子的文章是这样的:)

\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
python pandas filter keyword-search
1个回答
1
投票

首先我们定义一个函数,根据你的关键词是否出现在给定的句子中,返回一个布尔值。

def contains_covid_kwds(sentence):
    kw1 = 'COVID19'
    kw2 = 'China'
    kw3 = 'Chinese'
    return kw1 in sentence and (kw2 in sentence or kw3 in sentence)

然后我们通过应用这个函数创建一个布尔值系列(使用 Series.apply)的句子。df.article 列。

请注意,我们使用lambda函数来截断传递给 contains_covid_kwds 直到第五次出现 '\n'即您的前四句话(更多信息如何操作) 此处):

series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))

然后我们把布尔系列传给 df.loc为了 本地化 的行,该系列的评估结果为 True:

filtered_df = df.loc[series]

1
投票

你可以使用pandas apply方法,然后按我的方法做。

string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})

def findKeys(string):
    string_list = string.strip().lower().split('\n')
    flag=0
    keywords=['china','covid-19','wuhan']

    # Checking if the article has more than 4 sentences
    if len(string_list)>4:
        # iterating over string_list variable, which contains sentences.
        for i in range(4):
            # iterating over keywords list
            for key in keywords:
                # checking if the sentence contains any keyword
                if key in string_list[i]:
                    flag=1
                    break
    # Else block is executed when article has less than or equal to 4 sentences
    else:
        # Iterating over string_list variable, which contains sentences
        for i in range(len(string_list)):
            # iterating over keywords list
            for key in keywords:
                # Checking if sentence contains any keyword
                if key in string_list[i]:
                    flag=1
                    break
    if flag==0:
        return False
    else:
        return True

然后调用df上的pandas apply方法:-。

df['Contains Keywords?'] = df['article'].apply(findKeys)

0
投票

首先,我创建一个系列,其中只包含原始`df['articles']列中的前四个句子,并将其转换为小写,假设搜索应该是不分大小写的。

articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()

然后使用一个简单的布尔掩码,只过滤那些在前四句中找到关键词的行。

df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]

0
投票

这里。

found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
    if s1 in string and (s2 in string or s3 in string):
        found.append(string)
© www.soinside.com 2019 - 2024. All rights reserved.