网络抓取后无法从字典中检索值

Question

我希望这里的人们能够回答我认为是一个简单的问题。我是一个完整的新手，一直在尝试从Archdaily网站创建图像网络抓取工具。下面是经过多次尝试调试后的代码：

#### - Webscraping 0.1 alpha -
#### - Archdaily - 

import requests
from bs4 import BeautifulSoup

# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'

# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content

# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
    if k == 'url_large':
        print(v)

这些元素在这里：

img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']

尝试隔离data-images属性，如下所示：

My github upload of this portion, very long

您可以看到，或者我在这里完全错了，我试图从此最终字典列表中调用'url_large'值的尝试遇到TypeError，如下所示：

Traceback (most recent call last):
  File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
    for k, v in img_list():
TypeError: 'str' object is not callable

我相信我的错误在于最终导致了“数据图像”的隔离，在我看来，这就像列表中的字典一样，因为它们被方括号和花括号包围。我完全不在这里，因为我基本上是盲目进入这个项目的（甚至还没有读过Guttag的书的第4章）。

我也到处寻找想法，并试图模仿我发现的东西。我找到了其他人以前提供的将数据更改为JSON数据的解决方案，因此我找到了以下代码：

jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])

但是那是半身像，显示在这里：

Traceback (most recent call last):
  File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
    print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str

我在更改这些字符串值时缺少一个步骤，但是我不确定在哪里可以更改它们。我希望有人可以帮助我解决此问题，谢谢！

Answer 1

与类型有关。

img_list实际上不是列表，而是字符串。您尝试通过img_list()调用它，这会导致错误。

您有使用json.loads将其转换为字典的正确想法。这里的错误非常简单-jsonData是列表，而不是字典。您有一张以上的图片。

您可以遍历列表。列表中的每个项目都是一个字典，您将能够在列表中的每个字典中找到url_large属性：

images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
    print(image_properties['url_large'])

Answer 2

错误来源：img_list是字符串。您必须使用json.loads将其转换为列表，并且它不会成为必须循环的字典列表。

有效解决方案：

import json
import requests
from bs4 import BeautifulSoup

# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'

# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content

# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
    for k, v in img.items():
        if k == 'url_large':
            print(v)

网络抓取后无法从字典中检索值

问题描述投票：0回答：2

2个回答

最新问题

网络抓取后无法从字典中检索值

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2