NameError:不推荐使用的参数:改为使用 output_format,例如output_format="xml"

问题描述 投票:0回答:0

我正在尝试从一般新闻报道中提取文本,但我对网络爬行不熟悉,所以不确定如何找出这个NameError: Deprecated argument: use output_format instead, e.g. output_format="xml".

def beautifulsoup_extract_text_fallback(response_content):
    
    '''
    This is a fallback function, so that we can always return a value for text content.
    Even for when both Trafilatura and BeautifulSoup are unable to extract the text from a 
    single URL.
    '''
    
    # Create the beautifulsoup object:
    soup = BeautifulSoup(response_content, 'html.parser')
    
    # Finding the text:
    text = soup.find_all(text=True)
    
    # Remove unwanted tag elements:
    cleaned_text = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style',]

    # Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
    # is NOT in the blacklist
    for item in text:
        if item.parent.name not in blacklist:
            cleaned_text += '{} '.format(item)
            
    # Remove any tab separation and strip the text:
    cleaned_text = cleaned_text.replace('\t', '')
    return cleaned_text.strip()
    

def extract_text_from_single_web_page(url):
    
    downloaded_url = trafilatura.fetch_url(url)
    try:
        a = trafilatura.extract(downloaded_url,  xml_output = True, with_metadata=True, include_comments = False, json_output=True,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    except AttributeError:
        a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
                            date_extraction_params={'extensive_search': True, 'original_date': True})
    if a:
        json_output = json.loads(a)
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            # We will only extract the text from successful requests:
            if resp.status_code == 200:
                return beautifulsoup_extract_text_fallback(resp.content)
            else:
                # This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
                return np.nan
        # Handling for any URLs that don't have the correct protocol
        except MissingSchema:
            return np.nan

另一个尝试使用某些 URL 的代码块:

single_url = "https://abcnews.go.com/International/wireStory/strong-earthquake-shakes-papua-new-guinea-98688556" #'https://nypost.com/2022/03/02/massive-magnitude-5-0-earthquake-shakes-alaska/'
text = extract_text_from_single_web_page(url=single_url)
print(text)

Screenshot of Error Msg

我尝试将此元素 xml_output = True 添加到 trafilatura.extract(),并将 output_format.xml 文件添加到我的目录(不确定它是否相关)。

我不确定这里发生了什么,如果有任何建议尝试在此功能之上我的 URL 设置,我将不胜感激。

beautifulsoup xml-parsing web-crawler text-extraction google-news
© www.soinside.com 2019 - 2024. All rights reserved.