我正在尝试从一般新闻报道中提取文本,但我对网络爬行不熟悉,所以不确定如何找出这个NameError: Deprecated argument: use output_format instead, e.g. output_format="xml".
def beautifulsoup_extract_text_fallback(response_content):
'''
This is a fallback function, so that we can always return a value for text content.
Even for when both Trafilatura and BeautifulSoup are unable to extract the text from a
single URL.
'''
# Create the beautifulsoup object:
soup = BeautifulSoup(response_content, 'html.parser')
# Finding the text:
text = soup.find_all(text=True)
# Remove unwanted tag elements:
cleaned_text = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style',]
# Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
# is NOT in the blacklist
for item in text:
if item.parent.name not in blacklist:
cleaned_text += '{} '.format(item)
# Remove any tab separation and strip the text:
cleaned_text = cleaned_text.replace('\t', '')
return cleaned_text.strip()
def extract_text_from_single_web_page(url):
downloaded_url = trafilatura.fetch_url(url)
try:
a = trafilatura.extract(downloaded_url, xml_output = True, with_metadata=True, include_comments = False, json_output=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
except AttributeError:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
if a:
json_output = json.loads(a)
return json_output['text']
else:
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
另一个尝试使用某些 URL 的代码块:
single_url = "https://abcnews.go.com/International/wireStory/strong-earthquake-shakes-papua-new-guinea-98688556" #'https://nypost.com/2022/03/02/massive-magnitude-5-0-earthquake-shakes-alaska/'
text = extract_text_from_single_web_page(url=single_url)
print(text)
我尝试将此元素 xml_output = True 添加到 trafilatura.extract(),并将 output_format.xml 文件添加到我的目录(不确定它是否相关)。
我不确定这里发生了什么,如果有任何建议尝试在此功能之上我的 URL 设置,我将不胜感激。