在 C++ 中解析 HTML 表格中的表格行链接

问题描述 投票:0回答:1

我想使用 C++ 解析以下 HTML 代码中的所有链接(由表格行表示):

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-type" content="text/html; charset=UTF-8"/>
        <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
        <link rel="stylesheet" href="/_autoindex/assets/css/autoindex.css"/>
        <script src="/_autoindex/assets/js/tablesort.js"></script>
        <script src="/_autoindex/assets/js/tablesort.number.js"></script>
        <title>Index of /mydirectory/subdirectory/</title>
    </head>
    <body>
        <div class="content">
            <h1>Index of /mydirectory/subdirectory/</h1>
            <div id="table-list">
                <table id="table-content">
                    <thead class="t-header">
                        <tr>
                            <th class="colname" aria-sort="ascending">
                                <a class="name" href="?ND" onclick="return false"">Name</a></th><th class=" colname " data-sort-method=" number "><a href=" ?MA "  onclick=" return false"">Last Modified</a>
                            </th>
                            <th class="colname" data-sort-method="number"><a href="?SA"onclick="return false"">Size</a></th></tr></thead>
<tr data-sort-method="none "><td><a href="/mydirectory/"><img class="icon " src="/_autoindex/assets/icons/corner-left-up.svg " alt="Up ">Parent Directory</a></td><td></td><td></td></tr>
<tr><td data-sort="first.json "><a href="/mydirectory/subdirectory/first.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">first.json</a></td><td data-sort="1704288747 ">2024-01-03 13:32</td><td data-sort="4096 ">      4k</td></tr>
<tr><td data-sort="second.json "><a href="/mydirectory/subdirectory/second.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">second.json</a></td><td data-sort="1704290309 ">2024-01-03 13:58</td><td data-sort="4096 ">      4k</td></tr>
<tr><td data-sort="third.json "><a href="/mydirectory/subdirectory/third.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">third.json</a></td><td data-sort="1704290300 ">2024-01-03 13:58</td><td data-sort="4096 ">      4k</td></tr>
</table></div>
<address>Proudly Served by LiteSpeed Web Server at example.com Port 443</address></div><script>new Tablesort(document.getElementById("table-content "));</script></body></html>

这是来自 Apache Web 服务器的目录列表。我的预期结果是一个

std::vector<std::string>
,其中包含表中所有 3 个 JSON 文件的(相对)url。

对于实现,我尝试使用 Apache

xerces-c
但这个库似乎没有完整的
XPath
支持。此外,承诺提供全面 xalan-c
 支持的 
XPath
 在我的包管理器 
vcpkg
等中不可用。

我怎样才能像 Java 的

JSoup
使用
xerces-c
操作那样实现这种解析?

std::vector<std::string> parse_all_links(const std::string &website_content)
{
    std::vector<std::string> collected_links;
    
    try
    {
        XMLPlatformUtils::Initialize();
    }
    catch (const XMLException& exception)
    {
        auto error_message = XMLString::transcode(exception.getMessage());
        logger->error("Failed to initialize XML platform utils: " + std::string(error_message));
        XMLString::release(&error_message);

        return collected_links;
    }
    
    {   
        XercesDOMParser parser;
        parser.setValidationScheme(XercesDOMParser::Val_Never);

        const MemBufInputSource input_source(reinterpret_cast<const XMLByte*>(website_content.data()),
            website_content.size(), "dummy");
        parser.parse(input_source);

        // ...  
    }
    
    XMLPlatformUtils::Terminate();

    return collected_links;
}

任何其他 HTML 解析库解决方案也可以,最好带有

vcpkg
端口,以便更好地使用。

c++ html-parsing xerces-c
1个回答
0
投票

尽管已经过时且无人维护,Google 的gumbo-parser仍然可以完成这项工作:

#include <iostream>
#include <sstream>
#include <fstream>
#include <vector>
#include <gumbo.h>
#include <boost/algorithm/string.hpp>

struct link_searcher_configuration_t
{
    std::vector<std::string> file_extensions;

    std::string server_base_url;
};

void search_for_links(const GumboNode* node, std::vector<std::string>& links,
                      const link_searcher_configuration_t &link_searcher_configuration)
{
    if (node->type != GUMBO_NODE_ELEMENT)
    {
        return;
    }

    if (node->v.element.tag == GUMBO_TAG_A)
    {
        const auto href = gumbo_get_attribute(&node->v.element.attributes, "href");

        const std::string link = href->value;
        for (const auto &file_extension : link_searcher_configuration.file_extensions)
        {
            if (boost::trim_copy(link).ends_with("." + file_extension))
            {
                links.emplace_back(link_searcher_configuration.server_base_url + href->value);
                break;
            }
        }
    }

    const auto children = &node->v.element.children;
    for (unsigned int child_index = 0; child_index < children->length; ++child_index)
    {
        search_for_links(static_cast<GumboNode*>(children->data[child_index]),
            links, link_searcher_configuration);
    }
}

std::vector<std::string> find_links(const std::string &file_contents,
    const link_searcher_configuration_t& link_searcher_configuration)
{
    const auto output = gumbo_parse(file_contents.c_str());
    std::vector<std::string> links;
    search_for_links(output->root, links, link_searcher_configuration);
    gumbo_destroy_output(&kGumboDefaultOptions, output);

    return links;
}

int main()
{
    // Load HTML file
    const std::ifstream file_input_stream("HTMLPage.html");
    std::stringstream buffer;
    buffer << file_input_stream.rdbuf();
    const auto file_contents = buffer.str();

    // Parse all links and print them
    const link_searcher_configuration_t link_searcher_configuration
    {
        {"json"},
        "https://example.com"
    };
    for (const std::vector<std::string> links = find_links(file_contents, link_searcher_configuration);
        const auto &link : links)
    {
        std::cout << link << '\n';
    }
}
© www.soinside.com 2019 - 2024. All rights reserved.