从html文本中提取字符串

Question

我正在获取带有curl的html，需要提取仅第二个表语句。请注意，卷曲的html是单个字符串且未格式化。为了获得更好的解释，请参见以下内容：（...代表更多html）

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

到目前为止，我已经尝试了多条SED行，而且我认为尝试像这样匹配第二张表并不是一种平滑的方法：

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'

Answer 1

将下面的脚本另存为script.py，然后像这样运行它：

python3 script.py input.html

此脚本解析HTML并检查属性（[C0]和width）。这种方法的优势在于，如果您更改HTML文件的格式，则该脚本仍将起作用，因为脚本会解析HTML，而不是依赖于精确的字符串匹配。

cellpadding

Answer 2

HTML解析器会更好，但您可以像这样使用#!/usr/bin/env python3 from html.parser import HTMLParser import sys if len(sys.argv) < 2: print("ERROR: expected argument - filename") sys.exit(1) with open(sys.argv[1], 'r') as content_file: content = content_file.read() do_print = False class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): global do_print if tag == "table": if ("width", "100%") in attrs and ("cellpadding", "4") in attrs: print('<table width="100%" cellpadding="4">') do_print = True def handle_endtag(self, tag): global do_print if do_print and tag == "table": print('</table>') do_print = False def handle_data(self, data): global do_print if do_print: print(data) parser = MyHTMLParser() parser.feed(content)：

awk

发现启动时将awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file <table width="100%" cellpadding="4"> ... </table> #to this设置为true
[/<table width="100%" cellpadding="4">/ {f=1}如果标志f为真，则执行默认操作，打印行。
[f;找到结束后，清除标志f以停止打印。

也可以使用，但是像标记控件一样好：

/<\/table>/ {f=0}

从html文本中提取字符串

问题描述投票：0回答：2

2个回答

最新问题

从html文本中提取字符串

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2