get使用python beautifulsoup从html指定值

Question

我是新的报废，我正在做一些报废项目，我试图从下面的Html获得价值：

<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>

我想得到这个值：379104其中位于onclick im使用BeautifulSoup该代码：

 for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
            temp = i.parent.parent.contents[0]

临时返回对象列表和temp =到Html上面可以有人帮忙提取这个id谢谢!!

编辑******哇伙计们感谢惊人的解释!!!!!但我有2个问题1.retry机制，没有工作我将它设置为超时= 1，以使其失败，但一旦它失败返回：

requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',))

你可以帮我解决下面的重试机制代码：2。当我设置超时= 6时，机器性问题没有重试机制，需要15分钟的报废持续时间为15分钟我如何才能提高代码性能？代码如下：

def get_items(self, dict):
        itemdict = {}
        for k, v in dict.items():
            boolean = True
        # here, we fetch the content from the url, using the requests library
            while (boolean):
             try:
                a =requests.Session()
                retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[301,500, 502, 503, 504])
                a.mount(('https://'), HTTPAdapter(max_retries=retries))
                page_response = a.get('https://www.XXXXXXX.il' + v, timeout=1)
             except requests.exceptions.Timeout:
                print  ("Timeout occurred")
                logging.basicConfig(level=logging.DEBUG)
             else:
                 boolean = False

            # we use the html parser to parse the url content and store it in a variable.
            page_content = BeautifulSoup(page_response.content, "html.parser")
            for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
                parent = i.parent.parent.contents[0]
                getparentfunc= parent.find("a", attrs={"href": "javascript:void(0)"})
                itemid = re.search(".*'(\d+)'.*", getparentfunc.attrs['onclick']).groups()[0]
                itemName = re.sub(r'\W+', ' ', i.parent.contents[0].text)
                priceitem = re.sub(r'[\D.]+ ', ' ', i.text)
                itemdict[itemid] = [itemName, priceitem]

Answer 1

from bs4 import BeautifulSoup as bs
import re

txt = """<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>"""

soup = bs(txt,'html.parser')
a = soup.find("a", attrs={"href":"javascript:void(0)"})
r = re.search(".*'(\d+)'.*", data).groups()[0]
print(r) # will print '379104'

编辑

用".*\}.*,.*'(\d+)'\).*"替换了".*'(\d+)'.*"。它们产生相同的结果，但后者更清洁。

Explanation : Soup

qazxsw poi（第一个）元素与/ qazxsw poi标记，其中属性“href”具有“javascript：void（0）”作为其值。关于find的更多信息。

这相当于

beautiful soup keyword arguments here

在旧版本的Beautiful Soup中，没有class_快捷方式，您可以使用上面提到的attrs技巧。创建一个字典，其“class”的值是您要搜索的字符串（或正则表达式，或其他）。 - a = soup.find("a", attrs={"href":"javascript:void(0)"})

a = soup.find("a", href="javascript:void(0)")指向see beautiful soup documentation about "attrs"类型的元素。我们可以通过属性a访问标记属性，就像我们对字典所做的那样（更多关于<class 'bs4.element.Tag'>的内容）。这就是我们在以下声明中所做的。

a.attrs

字典键以tags属性命名。这里我们有以下键/属性名称：'title'，'href'和'onclick'。我们可以通过打印来检查出来。

beautiful soup attributes

这将输出

a_tag_attributes = a.attrs # that's the dictionary of attributes in question...

从这里开始，我们需要得到我们感兴趣的数据。我们数据的关键是“onclick”（它以我们寻求的数据所在的html属性命名）。

print(a_tag_attributes.keys()) # equivalent to print(a.attrs.keys())

dict_keys(['title', 'href', 'onclick']) # those are the attributes names (the keys to our dictionary)现在持有以下字符串。

data = a_tag_attributes["onclick"] # equivalent to data = a.attrs["onclick"]

Explanation : Regex

现在我们已经隔离了包含我们想要的数据的部分，我们将只提取我们需要的部分。我们将通过使用正则表达式（data）来实现。

要在Python中使用正则表达式，我们必须导入Regex模块"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"。更多关于“重新”模块this site is an excellent resource if you want to know more about Regex, good stuff。

re

Regex允许我们搜索与模式匹配的字符串。

这里的字符串是我们的数据，模式是here, good good stuff（这也是一个字符串，你可以通过使用双引号来判断）。

您可以将正则表达式视为类固醇上的通配符。您可能熟悉import re等通配符，以便在文件管理器中查找所有文本文件。正则表达式相当于".*'(\d+)'.*"。

最好的阅读正则表达式，以进一步了解它是什么。这是一个*.txt。

在这里，我们^.*\.txt$为一个字符串。我们将字符串描述为没有或无限数量的字符。这些字符后跟一些数字（至少一个）和一个用单引号括起来。然后我们有更多的角色。

括号用于提取一个组（在正则表达式中称为捕获），我们只捕获一个数字的部分。

通过将正则表达式的一部分放在圆括号或圆括号内，可以将正则表达式的该部分组合在一起。这允许您将量词应用于整个组或将更改限制为正则表达式的一部分。只有括号可用于分组。方括号定义字符类，具有特定限制的量词使用花括号。 - quick start, good good good stuff

search

定义符号：

。*匹配任何字符（行终止符除外），*表示可以没有或无限量 '匹配角色' \ d +匹配至少一位数（等于[0-9]）;这是我们捕获的部分（\ d +）捕获组;这意味着捕获字符串中重复数字至少一个的部分（）用于捕获，保存括号内与模式匹配的部分。

捕获的部分（如果有的话）以后可以通过调用Use Parentheses for Grouping and Capturing对r = re.search(".*'(\d+)'.*", data)的结果进行访问。这将返回一个元组，其中包含捕获的内容或r.groups()（re.search指的是None函数调用的结果）。

在我们的例子中，元组的第一个（也是唯一的）项是数字......

我们现在可以访问位于元组第一个索引处的数据（我们只捕获了一个组）

re.search

Answer 2

下面的两个解决方案都假设captured_group = r.groups()[0] # that's the tuple containing our data (we captured...)属性具有规则/一致的结构

如果只能有一个匹配，那么类似于以下内容。

 print(captured_group[0]) # this will print out '379104'

如果不止一场比赛

onclick

get使用python beautifulsoup从html指定值

问题描述投票：0回答：2

2个回答

Explanation : Soup

Explanation : Regex

最新问题

get使用python beautifulsoup从html指定值

问题描述 投票：0回答：2

2个回答

Explanation : Soup

Explanation : Regex

最新问题

问题描述投票：0回答：2