在python中，我应该添加什么来从我的(文本文件)或我的(xml文件)中获取URLs，其中包括URLs列表？

Question

我有这样的代码，所有的工作都是正常的(一个链接)代码的结果存储值(availableOffers,otherpricess,currentprice,page_url)在(prices.csv)文件中。

我的问题是: 第一 : 我不知道该怎么写才能从我的(文本文件)或我的(xml文件)中获取URL，而不是这段代码中的一个URL。

from bs4 import BeautifulSoup as soup  
from urllib.request import urlopen as uReq  

page_url = "XXXXXXXXX"


uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()


availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")


out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice,page_url \n"

f = open(out_filename, "w")
f.write(headers)


f.write(availableOffers + ", " + otherpricess + ", " + currentprice + ", " + page_url + "\n")

f.close()

第二个问题 : 当URL没有值(otherpricess)时，我得到这个错误信息

line 13, in <module> 
otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
AttributeError: 'NoneType' object has no attribute 'text'

我如何绕过这个错误，并告诉代码工作，即使有一个值缺失

谢谢

Answer 1

要从文本文件中获取URL，你可以用以下方法 open 文件（和你写的一样），在 "r" 模式，并在它的行上迭代。

例如，假设你有以下的urls文件，名为 urls.txt:

http://www.google.com
http://www.yahoo.com

为了获取urls并对其进行迭代，请执行以下操作。

out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice,page_url \n"

with open(out_filename, "w") as fw:
    fw.write(headers)
    with open("urls.txt", "r") as fr:
        for url in map(lambda x: x.strip(), fr.readlines()):  # the strip is to remove the trailing '\n'
            print(url)
            uClient = uReq(url)
            page_soup = soup(uClient.read(), "html.parser")
            # write the rest logic here
            # ...
            # write to the output file
            fw.write(availableOffers + ", " + otherpricess + ", " + currentprice + ", " + page_url + "\n")

关于你的第二个问题，你可以检查 page_soup.find("span", {"class": "price"}) 是不是None，如果是，则提取文本。例如：

otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "") if page_soup.find("span", {"class": "price"}) else "" 
# in case there is no value, otherpricess will be empty string but you can change it to any other value.

在python中，我应该添加什么来从我的(文本文件)或我的(xml文件)中获取URLs，其中包括URLs列表？

问题描述投票：-1回答：1

1个回答

最新问题

在python中，我应该添加什么来从我的(文本文件)或我的(xml文件)中获取URLs，其中包括URLs列表？

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1