如何使用来自多个URL的Web抓取内容创建CSV文件?

问题描述 投票:0回答:1

我想从webscraped内容创建一个CSV文件。内容来自FinViz.com我想从20个不同的股票中删除该网站20次,并将所有内容输入CSV文件。在我的代码中,我从twitter内容中删除了一系列股票。生成的股票列表与我希望从FinViz.com表格获取信息的列表相同。

这是我的代码:

import csv
import urllib.request
from bs4 import BeautifulSoup

twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")

print(soup.title.text)

tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)

url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]

for url in url_list:

    fpage = urllib.request.urlopen(url)
    fsoup = BeautifulSoup(fpage, 'html.parser')

    # scrape single page and add data to list
    # write datalist
with open('today.csv', 'a') as file:
    writer = csv.writer(file)
# write header row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))

# write body row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))

我遇到的麻烦是我的CSV文件只包含列表中最后一项的网页图形数据。相反,我希望整个列表在一系列行中。

这是我的CSV文件的样子:

Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change

-,-,-1.75,7.94%,79.06M,-22.52%,296.48M,-,-1.74,-4.61%,72.41M,-23.16%,-85.70M,-,-0.36,62.00%,3.21%,1.63%,15.10M,19.63,-197.00%,18.05%,2.57,66.67%,-0.65,-,-8.10%,-127.70%,12.17,-6.25%,0.93,4.03,-,146.70%,2.05 - 5.86,3.59%,-,-,-,385.80%,-36.01%,-,-,1.30,-,76.50%,82.93%,0.41,100,1.30,-59.60%,-,36.98,16.13% 9.32%,Yes,-,90.00%,-,0.82,3.63,Yes,-,Nov 08,-,902.43K,3.75,2.30,-22.08%,-10.43%,11.96%,"742,414",3.31%
python csv web-scraping
1个回答
1
投票

最好先打开输出文件,而不是为每个获取的URL继续打开/关闭它。需要异常处理来捕获URL不存在的情况。

同样在输出中,您应该使用newline=''打开文件,以避免将额外的空行写入文件:

import csv
import urllib.request
from bs4 import BeautifulSoup

write_header = True

twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")

print(soup.title.text)

tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)

url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]

with open('today.csv', 'w', newline='') as file:
    writer = csv.writer(file)

    for url in url_list:
        try:
            fpage = urllib.request.urlopen(url)
            fsoup = BeautifulSoup(fpage, 'html.parser')

            # write header row (once)
            if write_header:
                writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
                write_header = False

            # write body row
            writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))            
        except urllib.error.HTTPError:
            print("{} - not found".format(url))

所以today.csv会像:

Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change
-,-,-10.85,4.60%,2.36M,11.00%,8.09M,-,-,-62.38%,1.95M,-16.14%,-14.90M,-,-,2.30%,10.00%,-44.42%,0.00M,-,21.80%,-5.24%,3.10,-38.16%,1.46,2.35,-,-155.10%,65.00,-50.47%,-,-,-,-238.40%,2.91 - 11.20,-38.29%,-,-,54.50%,-,-69.37%,1.63,-,2.20,-,-,17.87%,0.36,15,2.20,-,-,39.83,11.38% 10.28%,No,0.00,68.70%,-,1.48,3.30,Yes,0.00,Feb 28 AMC,-,62.76K,3.43,1.00,-5.21%,-25.44%,-37.33%,"93,166",3.94%
-,-,-0.26,1.50%,268.98M,3.72%,2.25B,38.05,0.22,-0.64%,263.68M,-9.12%,-55.50M,-,0.05,-,9.96%,-12.26%,1.06B,2.12,-328.10%,25.95%,2.32,17.72%,12.61,0.66,650.00%,-0.90%,12.64,-38.73%,0.03,264.87,-,-1.90%,6.69 - 15.27,-0.48%,-,-,-28.70%,0.00%,-45.17%,2.20,-,0.70,16.40%,67.80%,25.11%,0.41,477,0.80,71.90%,5.30%,52.71,4.83% 5.00%,Yes,0.80,7.80%,-5.20%,0.96,7.78,Yes,0.80,Feb 27 AMC,-,11.31M,8.37,2.20,0.99%,-1.63%,-4.72%,"10,843,026",7.58%

如果您只希望文件包含一次运行脚本的数据,则不需要附加a,只需使用w

© www.soinside.com 2019 - 2024. All rights reserved.