re.findall找不到全部,只有一部分。怎么可能?

问题描述 投票:0回答:1

我有一个包含五个网站的文本文件。在每个网站中都有多个亚马逊链接,我的目标是收集所有链接。但是,五个网站之一使用“ amzn.to”而不是“ amazon.com”来指向亚马逊链接,我最初认为仅通过使用此链接即可解决:

any(re.findall(r'(amazon.com|amzn.to)', str, re.IGNORECASE))

应该在我的整个亚马逊链接列表中包含十个amzn.to链接,但仅找到两个。

这是我的完整代码:

import requests
import re
from bs4 import BeautifulSoup
from collections import OrderedDict

file_name = raw_input("Enter file name: ")
filepath = "%s"%(file_name)

with open(filepath) as f:
    listoflinks = [line.rstrip('\n') for line in f]

raw_links = []
for i in listoflinks:
    html = requests.get(i).text
    bs = BeautifulSoup(html)
    possible_links = bs.find_all('a')
    for link in possible_links:
        if link.has_attr('href'):
            raw_links.append(link.attrs['href'])

amazon_links = []
for str in raw_links:
    if (any(re.findall(r'(amazon.com|amzn.to)', str, re.IGNORECASE))) and (str not in amazon_links):
        amazon_links.append(str)

for i in amazon_links:
    print i
print len(amazon_links)

我知道它可以,但是效果不如我所愿。请帮助我找出问题所在。

python regex web-scraping beautifulsoup
1个回答
0
投票

使用简体中文的解决方案。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://www.shifu.com/best-shower-curtain-rods/')
doc = SimplifiedDoc(html)
amazon_links = doc.getElements('a')
amazon_links = amazon_links.containsOr(['amazon.com','amzn.to'],attr='href')
print ([a.href for a in amazon_links])

结果:

['https://www.amazon.com/InterDesign-Constant-Tension-Shower-Curtain/dp/B006J23OGU/ref=as_li_ss_il?ie=UTF8&qid=1531507667&sr=8-1-spons&keywords=InterDesign+Cameo+Constant+Tension+Shower+Curtain+Rod&th=1&linkCode=li2&tag=shifu02-20&linkId=9cb3c83107c687168b9c74469d907a6a', 'https://amzn.to/2N9WWyn', 'https://www.amazon.com/Bath-Bliss-Expandable-72-inch-Curtain/dp/B00VMTKHBU/ref=as_li_ss_il?s=aps&ie=UTF8&qid=1531508002&sr=1-1-catcorr&keywords=Bath+Bliss+Expandable+42+to+72-inch+Curved+Shower+Curtain+Rod&linkCode=li2&tag=shifu02-20&linkId=3aabac324a48f7eeeae9a1b329d92f6f', 'https://amzn.to/2zC1rQh', 'https://www.amazon.com/Zenna-Home-35633SSP-NeverRust-Aluminum/dp/B00JVG5NMY/ref=as_li_ss_il?s=hi&ie=UTF8&qid=1531508744&sr=1-1&keywords=Zenna+Home+35633SSP,+NeverRust+Aluminum+Tension+Curved+Shower+Curtain+Rod&dpID=31nFuJj2r4L&preST=_SY300_QL70_&dpSrc=srch&linkCode=li2&tag=shifu02-20&linkId=d9c411023c0c33b4eef978b69932a649', 'https://amzn.to/2NeG4qg',
... and so on.

您可以获得SimplifiedDoc here的示例>

© www.soinside.com 2019 - 2024. All rights reserved.