美丽的汤网页刮板

问题描述 投票:-2回答:1

我正在尝试用以下网址https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00抓取一个网页

我想用下面的html代码刮一张桌子。我已经尝试了很少的东西,但无法实现所需的表插入到csv.Here <“tr”>标签没有关闭数据,因此将数据分隔到不同的行是一个问题。

谢谢你的帮助--J

<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
                <tr>
                    <td class='innertable_header1' rowspan='3'>Category of shareholder</td>
                    <td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
                    <td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
                    <td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
                    <td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
                    <td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
                    <td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
                </tr>
                <tr></tr>
                <tr></tr>
                <tr>
                    <td class='TTRow_left'>(A) Promoter & Promoter Group</td>
                    <td class='TTRow_right'>19</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'></td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'>12.90</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <tr>
                        <td class='TTRow_left'>(B) Public</td>
                        <td class='TTRow_right'>9,16,058</td>
                        <td class='TTRow_right'>1,87,81,45,362</td>
                        <td class='TTRow_right'>1,32,95,642</td>
                        <td class='TTRow_right'>1,89,14,41,004</td>
                        <td class='TTRow_right'>86.61</td>
                        <td class='TTRow_right'>1,88,74,40,959</td>
                        <tr>
                            <td class='TTRow_left'>(C1) Shares underlying DRs</td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'>0.00</td>
                            <td class='TTRow_right'></td>
                            <tr>
                                <td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
                                <td class='TTRow_right'>1</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'></td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'>0.49</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <tr>
                                    <td class='TTRow_left'>(C) Non Promoter-Non Public</td>
                                    <td class='TTRow_right'>1</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'></td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'>0.49</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <tr>
                                        <td class='TTRow_left'>Grand Total</td>
                                        <td class='TTRow_right'>9,16,078</td>
                                        <td class='TTRow_right'>2,17,06,54,147</td>
                                        <td class='TTRow_right'>1,32,95,642</td>
                                        <td class='TTRow_right'>2,18,39,49,789</td>
                                        <td class='TTRow_right'>100.00</td>
                                        <td class='TTRow_right'>2,17,99,49,744</td>
                                    </tr>
            </table>
python python-2.7 web-scraping beautifulsoup
1个回答
1
投票

你可以试试这个:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[\n\r]+|\s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])

输出:

[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']
© www.soinside.com 2019 - 2024. All rights reserved.