我正在尝试构建一个 webscrapper,它根据 html 标签获取某些信息并将它们放入字典中。
我有第一个函数,它抓取网站并返回如下字典:
{"Url": "www.test1.de", "Document Title": "test1", "Releaes Date": "January 1, 2020",...}
我的第二个函数获取链接列表作为输入,并且应该使用第一个函数循环遍历这些链接,然后将这些字典附加到一个大字典中。
def create_dict(link_list):
all_data_dict = {}
count = 0
for link in link_list:
all_data_dict[count] = scrape_doc_info(link,tag_list, selector_dict) # this function returns the dictionnary mentioned above
print(all_data_dict)
count +=1
return(all_data_dict)
我希望有以下内容。
all_data_dict = { 0 = {"Url": "www.test1.de", "Document Title": "test1", "Releaes Date": "January 1, 2020",...},
1 = {"Url": "www.test2.de", "Document Title": "test2", "Releaes Date": "January 2, 2022",...},..., 20 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...}}
但是我的代码总是用最后一个链接的值覆盖键的值。因此,如果我循环 20 个链接,我将始终拥有每个键的最后一个链接的值:
all_data_dict = { 0 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...},
1 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...},..., 20 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...}}
打印参数的控制台输出如下:
第一循环:
all_data_dict = { 0 = {"Url": "www.test1.de", "Document Title": "test1", "Releaes Date": "January 1, 2020",...}
第二个循环:
all_data_dict = { 0 = {"Url": "www.test2.de", "Document Title": "test2", "Releaes Date": "January 2, 2022",...},
1 = {"Url": "www.test2.de", "Document Title": "test2", "Releaes Date": "January 2, 2022",...}}
第20循环:
all_data_dict = { 0 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...},
1 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...},..., 20 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2200",...}}
您的
scrape_doc_info
功能一定有问题(不确定可能是什么)
以下代码具有您预期的结果:
dict1 = {"Url": "www.test1.de", "Document Title": "test1", "Releaes Date": "January 1, 2020"}
dict2 = {"Url": "www.test20.de", "Document Title": "test20", "Releaes Date": "January 20, 2020"}
list_of_dicts = [dict1, dict2]
def create_dict(link_list):
all_data_dict = {}
count = 0
for link in link_list:
all_data_dict[count] = link # this function returns the dictionary mentioned above
count +=1
return(all_data_dict)
my_dict = create_dict(list_of_dicts)
print(my_dict)
控制台输出:
{0: {'Url': 'www.test1.de', 'Document Title': 'test1', 'Releaes Date': 'January 1, 2020'}, 1: {'Url': 'www.test20.de', 'Document Title': 'test20', 'Releaes Date': 'January 20, 2020'}}