使用 AppleScript 解析 HTML 源代码

问题描述 投票:0回答:4

我正在尝试解析我已在 Automator 中转换为 TXT 文件的 HTML 文件。

我之前使用Automator从一个网站下载了HTML文件,现在我正在努力解析源代码。

最好,我只想获取表格的信息,我需要对 1800 个不同的 HTML 文件重复此操作。

这里是源代码的例子:

</head>
<body>
<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>


    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </span>
                                    </span>
    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
            <ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->


<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
                </ul>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                </div>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>

            </span>

            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                                                        <tr>
                    <th>Role</th>
                    <td>
                    <p>Other</p>                            </td>
                </tr>
                <tr>  
                    <th>Organisation Type</th>
                    <td>
                    <p>Asset Manager</p>                        </td>
                </tr>
                <tr>
                    <th>Email</th>
                    <td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td>
                </tr>
                <tr>
                    <th>Website</th>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                </tr>
                <tr>
                    <th>Phone</th>
                    <td>41 78 616 7334</td>
                </tr>
                <tr>
                    <th>Fax</th>
                    <td></td> 
                </tr>
                <tr>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                </tr>
                <tr>
                    <th>City</th>
                    <td>Schindellegi</td>
                </tr>
                <tr>
                    <th>State</th>
                    <td>CH</td>
                </tr>
                <tr>
                    <th>Country</th>
                    <td>Switzerland</td>
                </tr>
                <tr>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </tr>
        </table>
                </div><!-- /main-content -->
                    <div id="sidebar"  >
                    </div>

            <div id="similar_sidebar" class="similar_refine" >



            </div>
                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">

</div>

我的 AppleScript 尝试使用

text item delimiters
以类似的方式提取表格:

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

如何从 HTML 文件中解析表格?

html parsing applescript delimiter automator
4个回答
5
投票

与其创建自己的 HTML 解析器,不如通过 do javascript 命令在 Safari 中利用 HTML 解析器。 JavaScript 具有处理 HTML 元素和数据的内置功能。

此脚本仅获取页面中第一个表格的 HTML:

tell application "Safari"
    tell document 1
        set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
    end tell
end tell

您可以使用此技术将基本的 DOM 脚本应用到任何页面,并抓取您想要读取的任何数据。您可以只获取表格单元格的值,或任何您想要的值。


1
投票

你真的很亲密。问题是您的 startText 变量。起始表格标签不在 html 文本中,因此无法找到。开始表格的那一行实际上是......

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">

所以我修改了您的代码以分两步查找该标签。首先...

<table

然后这个分别...

>

通过这种方式,我们可以忽略表格标签(宽度、边框等)附带的所有代码,因为我认为它会因文件而异。这样做之后,我们只得到表的代码。试试这个...

set p to input
set ex to extractBetween(p, "<table", ">", "</table>")

to extractBetween(SearchText, startText1, startText2, endText)
    set tid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to startText1
    set endItems to text item -1 of SearchText
    set AppleScript's text item delimiters to endText
    set beginningToEnd to text item 1 of endItems
    set AppleScript's text item delimiters to startText2
    set finalText to (text items 2 thru -1 of beginningToEnd) as text
    set AppleScript's text item delimiters to tid
    return finalText
end extractBetween

0
投票

尝试:

set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"

-1
投票

行之有效的奇迹:

tell application "Safari" to set sourceCode to characters (offset of ¬
    "<table" in (source of document 1 as string)) thru ((offset of ¬
    "/table" in (source of document 1 as string)) + (count of "/table")) ¬
    of (source of document 1 as string) as string

NB 脚本仅检索表 1

© www.soinside.com 2019 - 2024. All rights reserved.