我正在尝试对我的数据集进行一些预处理。具体来说,我试图从文本中删除付费墙语言(下面以粗体显示),但我不断得到一个空字符串作为我的输出。
这是示例文本:
“在数十万高中摔跤手中,只有一小部分人知道赢得州冠军是什么感觉。{{Elided}} 就是这个比例的一部分。里士满的少年通过赢得……Premium 加入了这个群体。内容仅供订阅者使用。请登录此处访问内容或前往此处购买订阅。“
还有我的自定义函数:
import re
import string
import nltk
from nltk.corpus import stopwords
# function to detect paywall-related text
def detect_paywall(text):
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
for keyword in paywall_keywords:
if re.search(r'\b{}\b'.format(keyword), text, flags=re.IGNORECASE):
return True
return False
# function for text preprocessing
def preprocess_text(text):
# Check if the text contains paywall-related content
if detect_paywall(text):
# Remove paywall-related sentences or language from the text
sentences = nltk.sent_tokenize(text)
cleaned_sentences = [sentence for sentence in sentences if not detect_paywall(sentence)]
cleaned_text = ' '.join(cleaned_sentences)
return cleaned_text.strip() # Remove leading/trailing whitespace
# Tokenization
tokens = nltk.word_tokenize(text)
# Convert to lowercase
tokens = [token.lower() for token in tokens]
# Remove punctuation
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in stripped if word.isalpha() and word not in stop_words]
return ' '.join(words)
我尝试修改要检测的单词列表,但无济于事。但是,我发现从列表中删除“订阅者”确实删除了付费专区语言的第二句。但这并不是很理想,因为还剩下另一半。
该功能也不一致,因为它适用于这段文本(因为它将删除付费专区语言),但不适用于上面的文本。
在数十万高中摔跤运动员中,只有一小部分人知道赢得州冠军是什么感觉。 {{Elided}} 是该百分比的一部分。里士满的小学生通过获胜加入了该群体……优质内容仅向订阅者开放。请登录此处访问内容或前往此处购买订阅。
有输入:
import re
text = "Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription."
text
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
过滤器的形成图案:
patt = re.compile('|'.join(['.*' + e for e in paywall_keywords]))
'.*login|.*subscription|.*purchase a subscription|.*subscribers'
按句子分割文本:
phrases = text.split(sep='.')
['Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title',
' {{Elided}} is part of that percentage',
' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
' Please login here to access content or go here to purchase a subscription',
'']
查找点击:
found = list(filter(patt.match, phrases))
[' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
' Please login here to access content or go here to purchase a subscription']
消除这些并修改文本:
'.'.join([p for p in phrases if p not in found])
'Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage.'