在python中,我应该添加什么来从我的(文本文件)或我的(xml文件)中获取URLs,其中包括URLs列表?

问题描述 投票:-1回答:1

我有这样的代码,所有的工作都是正常的(一个链接)代码的结果存储值(availableOffers,otherpricess,currentprice,page_url)在(prices.csv)文件中。

我的问题是: 第一 : 我不知道该怎么写才能从我的(文本文件)或我的(xml文件)中获取URL,而不是这段代码中的一个URL。

from bs4 import BeautifulSoup as soup  
from urllib.request import urlopen as uReq  

page_url = "XXXXXXXXX"


uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()


availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")


out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice,page_url \n"

f = open(out_filename, "w")
f.write(headers)


f.write(availableOffers + ", " + otherpricess + ", " + currentprice + ", " + page_url + "\n")

f.close()  

第二个问题 : 当URL没有值(otherpricess)时,我得到这个错误信息

line 13, in <module> 
otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
AttributeError: 'NoneType' object has no attribute 'text'

我如何绕过这个错误,并告诉代码工作,即使有一个值缺失

谢谢

python
1个回答
1
投票

要从文本文件中获取URL,你可以用以下方法 open 文件(和你写的一样),在 "r" 模式,并在它的行上迭代。

例如,假设你有以下的urls文件,名为 urls.txt:

http://www.google.com
http://www.yahoo.com

为了获取urls并对其进行迭代,请执行以下操作。

out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice,page_url \n"

with open(out_filename, "w") as fw:
    fw.write(headers)
    with open("urls.txt", "r") as fr:
        for url in map(lambda x: x.strip(), fr.readlines()):  # the strip is to remove the trailing '\n'
            print(url)
            uClient = uReq(url)
            page_soup = soup(uClient.read(), "html.parser")
            # write the rest logic here
            # ...
            # write to the output file
            fw.write(availableOffers + ", " + otherpricess + ", " + currentprice + ", " + page_url + "\n")

关于你的第二个问题,你可以检查 page_soup.find("span", {"class": "price"}) 是不是None,如果是,则提取文本。例如:

otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "") if page_soup.find("span", {"class": "price"}) else "" 
# in case there is no value, otherpricess will be empty string but you can change it to any other value.
© www.soinside.com 2019 - 2024. All rights reserved.