使用 AppleScript 解析 HTML 源代码

问题描述 投票:0回答:4

我正在尝试解析我已在 Automator 中转换为 TXT 文件的 HTML 文件。


最好,我只想获取表格的信息,我需要对 1800 个不同的 HTML 文件重复此操作。


<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>

    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->

<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>


            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                    <p>Other</p>                            </td>
                    <th>Organisation Type</th>
                    <p>Asset Manager</p>                        </td>
                    <td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                    <td>41 78 616 7334</td>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </div><!-- /main-content -->
                    <div id="sidebar"  >

            <div id="similar_sidebar" class="similar_refine" >

                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">


我的 AppleScript 尝试使用

text item delimiters

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

如何从 HTML 文件中解析表格?

html parsing applescript delimiter automator

与其创建自己的 HTML 解析器,不如通过 do javascript 命令在 Safari 中利用 HTML 解析器。 JavaScript 具有处理 HTML 元素和数据的内置功能。

此脚本仅获取页面中第一个表格的 HTML:

tell application "Safari"
    tell document 1
        set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
    end tell
end tell

您可以使用此技术将基本的 DOM 脚本应用到任何页面,并抓取您想要读取的任何数据。您可以只获取表格单元格的值,或任何您想要的值。


你真的很亲密。问题是您的 startText 变量。起始表格标签不在 html 文本中,因此无法找到。开始表格的那一行实际上是......

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">






set p to input
set ex to extractBetween(p, "<table", ">", "</table>")

to extractBetween(SearchText, startText1, startText2, endText)
    set tid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to startText1
    set endItems to text item -1 of SearchText
    set AppleScript's text item delimiters to endText
    set beginningToEnd to text item 1 of endItems
    set AppleScript's text item delimiters to startText2
    set finalText to (text items 2 thru -1 of beginningToEnd) as text
    set AppleScript's text item delimiters to tid
    return finalText
end extractBetween



set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"



tell application "Safari" to set sourceCode to characters (offset of ¬
    "<table" in (source of document 1 as string)) thru ((offset of ¬
    "/table" in (source of document 1 as string)) + (count of "/table")) ¬
    of (source of document 1 as string) as string

NB 脚本仅检索表 1

© www.soinside.com 2019 - 2024. All rights reserved.