在Newspaper3k中解决可靠的输出问题

Question

Current Behavior:

在尝试使用News-aggregator package Newspaper3k时，我无法产生一致/可靠的输出。

System/Environment Setup:

Windows 10
Miniconda3 4.5.12
Python 3.7.1
Newspaper3k 0.2.8

Steps (Code) to Reproduce:

import newspaper

cnn_paper = newspaper.build('http://cnn.com')
print(cnn_paper.size())

Expected Behavior/Output (varies based on current links posted on cnn):

在连续的Print输出运行中，在cnn上生成一致数量的已发布链接。

Actual Behavior/Output

第一次运行代码产生的链接数量不同于之后立即运行的代码。

1st Run Print output: 94 (as of time of posting this question)
2nd Run Print output: 0 
3rd Run Print output: 18
4th Run Print output: 7

打印实际链接的方式与上述链接计数打印的方式不同。我尝试过使用多种不同的新闻来源，并产生相同的意外差异。我是否需要更改用户代理标头？这是检测问题吗？如何产生可靠的结果？

任何帮助将非常感激。

谢谢。

Answer 1

通过更好地理解6.1.3 Article caching in the user documentation标题下的默认缓存，解决了我的问题。

除了我一般的无知，我的困惑来自read the docs 'Documentation' listed the caching function as a TODO as can be seen here的事实

经过更好的审查，我发现：

默认情况下，报纸缓存所有先前提取的文章，以消除其已经提取的任何文章。此功能的存在是为了防止重复文章并提高提取速度。

cbs_paper.size（）的返回值从1030变为2，因为当我们第一次抓取cbs时，我们发现了1030篇文章。但是，在第二次抓取时，我们会删除所有已经抓取过的文章。这意味着自我们第一次提取以来已经发表了2篇新文章。

您可以使用memoize_articles参数选择退出此功能。您还可以传递高级部分中涵盖的较低级别的“配置”对象。

>>>import newspaper
>>>cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>>cbs_paper.size()1030

在Newspaper3k中解决可靠的输出问题

问题描述投票：0回答：1

Current Behavior:

System/Environment Setup:

Steps (Code) to Reproduce:

Expected Behavior/Output (varies based on current links posted on cnn):

Actual Behavior/Output

1个回答

最新问题

在Newspaper3k中解决可靠的输出问题

问题描述 投票：0回答：1

Current Behavior:

System/Environment Setup:

Steps (Code) to Reproduce:

Expected Behavior/Output (varies based on current links posted on cnn):

Actual Behavior/Output

1个回答

最新问题

问题描述投票：0回答：1