我正在尝试解析我已在 Automator 中转换为 TXT 文件的 HTML 文件。
我之前使用Automator从一个网站下载了HTML文件,现在我正在努力解析源代码。
最好,我只想获取表格的信息,我需要对 1800 个不同的 HTML 文件重复此操作。
这里是源代码的例子:
</head>
<body>
<div id="header">
<div class="wrapper">
<span class="access">
<div id="fb-root"></div>
<span class="access">
Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward | <a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>
</span>
</span>
</div><!-- /wrapper -->
</div><!-- /header -->
<div id="masthead">
<div class="wrapper">
<a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
<div id="navigation">
<ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>
</div><!-- /navigation -->
</div><!-- /wrapper -->
</div><!-- /masthead -->
<div id="content">
<div class="wrapper">
<div id="main-content">
<!-- per Project stuff -->
<span class="section">
<img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
<h1><span id="profile-name-104947" >Christian Sieling</span></h1>
<ul class="gbutton-group right">
<li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li>
<li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
</ul>
<div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
<span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
<a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
</div>
<h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>
</span>
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
<tr>
<th>Role</th>
<td>
<p>Other</p> </td>
</tr>
<tr>
<th>Organisation Type</th>
<td>
<p>Asset Manager</p> </td>
</tr>
<tr>
<th>Email</th>
<td><a href="mailto:[email protected]" title="[email protected]" >[email protected]</a></td>
</tr>
<tr>
<th>Website</th>
<td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
</tr>
<tr>
<th>Phone</th>
<td>41 78 616 7334</td>
</tr>
<tr>
<th>Fax</th>
<td></td>
</tr>
<tr>
<th>Mailing Address</th>
<td>Birrenstrasse 30</td>
</tr>
<tr>
<th>City</th>
<td>Schindellegi</td>
</tr>
<tr>
<th>State</th>
<td>CH</td>
</tr>
<tr>
<th>Country</th>
<td>Switzerland</td>
</tr>
<tr>
<th class="lastrow" >Zip/ Postal Code</th>
<td class="lastrow" >8834</td>
</tr>
</table>
</div><!-- /main-content -->
<div id="sidebar" >
</div>
<div id="similar_sidebar" class="similar_refine" >
</div>
</div><!-- /wrapper -->
</div><!-- /content -->
<div id="footer">
</div>
我的 AppleScript 尝试使用
text item delimiters
以类似的方式提取表格:
set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween
如何从 HTML 文件中解析表格?
与其创建自己的 HTML 解析器,不如通过 do javascript 命令在 Safari 中利用 HTML 解析器。 JavaScript 具有处理 HTML 元素和数据的内置功能。
此脚本仅获取页面中第一个表格的 HTML:
tell application "Safari"
tell document 1
set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
end tell
end tell
您可以使用此技术将基本的 DOM 脚本应用到任何页面,并抓取您想要读取的任何数据。您可以只获取表格单元格的值,或任何您想要的值。
你真的很亲密。问题是您的 startText 变量。起始表格标签不在 html 文本中,因此无法找到。开始表格的那一行实际上是......
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
所以我修改了您的代码以分两步查找该标签。首先...
<table
然后这个分别...
>
通过这种方式,我们可以忽略表格标签(宽度、边框等)附带的所有代码,因为我认为它会因文件而异。这样做之后,我们只得到表的代码。试试这个...
set p to input
set ex to extractBetween(p, "<table", ">", "</table>")
to extractBetween(SearchText, startText1, startText2, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText1
set endItems to text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text item 1 of endItems
set AppleScript's text item delimiters to startText2
set finalText to (text items 2 thru -1 of beginningToEnd) as text
set AppleScript's text item delimiters to tid
return finalText
end extractBetween
尝试:
set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"
行之有效的奇迹:
tell application "Safari" to set sourceCode to characters (offset of ¬
"<table" in (source of document 1 as string)) thru ((offset of ¬
"/table" in (source of document 1 as string)) + (count of "/table")) ¬
of (source of document 1 as string) as string
NB 脚本仅检索表 1