如何清理HTML字符串以使用lxml在python中解析它?

问题描述 投票:0回答:2

我有一个包含HTML代码的python字符串,来自我要使用lxml库解析的JSON。该字符串包含几个转义字符和其他特殊字符。如何清除此代码,以便可以使用lxml从其中提取信息?我想在字符串上使用XPATH selectros。

String-

<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n    <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n    <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n        <tr>\r\n            <td align=\"center\">\r\n\r\n                <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n                    <tr>\r\n                        <td width=\"600\">\r\n                            <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td align=\"center\">\r\n                                        <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n                                    </td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td>\r\n                                        <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td>\r\n                                                    <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n                                                        <tbody>\r\n                                                        <tr>\r\n                                                            <td width=\"10\"></td>\r\n                                                            <td>\r\n                                                                <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Email Verification Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr id='aaaaa'>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Request Submission Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Product </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Journey Type </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Adult </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Child </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Infant </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Flight Class </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Preferred Airline </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            </td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Non Stop Flight </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Email </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">[email protected]</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Mobile</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Travel Policy Email</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">[email protected]</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n                                                                    </tr>\r\n\r\n                                                                </table>\r\n                                                            </td>\r\n                                                            <td width=\"10\"></td>\r\n                                                        </tr>\r\n\r\n                                                        </tbody>\r\n                                                    </table>\r\n\r\n                                                </td>\r\n                                                <td width=\"10\"></td>\r\n                                            </tr>\r\n\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                        </table>\r\n\r\n                                    </td>\r\n                                </tr>\r\n                            </table>\r\n                        </td>\r\n                    </tr>\r\n                </table>\r\n            </td>\r\n        </tr>\r\n    </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>

使用干净的字符串,解析器像这样工作-

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>
python nlp html-parsing lxml text-parsing
2个回答
1
投票

也许您想使用BeautifulSoup?这是一个结构化代码的框架,因此您可以对其进行迭代。您还可以搜索特定的标签,类等。附言它的解析器选项之一是lxml。

from bs4 import BeautifulSoup
soup = BeautifulSoup(broken_html, 'lxml')
soup.titel  # returns <title>Titel</title>
soup.find_all('div')  # returns an array with all div tags
my_tag = soup.find(id="yourID")
my_tag.find_all('div')  # returns you every div tag in the tag with the id yourID

0
投票

看起来您需要首先取消转义字符串,因此请查看ChristopheD's answer

html_unescaped_string = html_escaped_string.decode('string_escape')

然后,您确实可以使用BeautifulSoup并用手指交叉,发现在字符串的其他格式错误的位中也是如此。

© www.soinside.com 2019 - 2024. All rights reserved.