html-parsing 相关问题

HTML解析是消耗HTML文档的序列化并产生可以以编程方式工作的表示的过程 - 例如，为了从中提取数据。 HTML规范定义了用于解析HTML的标准算法，该算法在所有主流浏览器中实现。

我正在使用正则表达式来检索 html 页面的文本。我使用这个正则表达式消除 html 标签： <[^>]+> 问题是这个正则表达式在 html 标签上无法正常工作，如下所示：我正在使用正则表达式来检索 html 页面的文本。我使用这个正则表达式消除 html 标签： <[^>]+> 问题是这个正则表达式无法在这样的 html 标签上正常工作： <input type="button" onclick="if (a > b) do_somthing();"> 此正则表达式将与 <input type="button" onclick="if (a > 匹配，并且 b) do_somthing();"> 将保留。我应该使用哪个正则表达式来匹配此标记？实现此目标的更好且正确的方法是使用 HTML 解析器（如敏捷 HTML 包）来解析 HTML 并根据您的要求使用。使用 REGEX 解析 HTML 很困难，而且容易出错。了解更多：http://www.mikesdotnetting.com/article/273/using-the-htmlagilitypack-to-parse-html-in-asp-net 如上所述，请阅读以下链接，为什么正则表达式不适用于 HTML -> 不要对 HTML 使用正则表达式。正如评论中建议的那样，使用 C# HTML 解析器，例如CsQuery。你可以试试这个： :%s/<.\{-}[^ ]> [^ ]> 确保匹配 > 之前没有任何空格。

c# .net regex html-parsing

回答 3 投票 0

将 HTML 表中的数据获取到 Access 数据库中

如何从 HTML 表（例如，从市场数据 S&P 500）动态填充数据库？我有一个雅虎帐户！金融的。在帐户中我可以查看 HTML 格式的财务数据。我需要一张 SIM 卡...

html sql ms-access html-parsing

回答 4 投票 0

javascript如何找到包含文本的DOM节点？

给定一个获取的html页面，我想找到包含一部分文本的特定节点。我想最困难的方法是一一迭代所有节点，尽可能深入，然后......

javascript dom html-parsing domparser

回答 2 投票 0

解析 HTML 并隔离在具有 id 属性的合格标签的已知前缀之后找到的整数

简单来说，我有一个前缀为“msg”的字符串，后跟一些用作列表项 ID 的数字例如 ........................<... 简单来说，我有一个前缀为“msg”的字符串，后跟一些用作列表项 ID 的数字例如 <li id="msg1"></li>..............<li id="msg1234567890"></li> 获取数字的最有效方法是什么？在 VB 中，我会执行以下操作： str = "msg1" str = right(str,len(str)-3) 我如何在 PHP 中做类似（或更高效）的事情？在 php 中也是一样（使用 substr）： $str = "msg1"; $str = substr($str,3); 只需使用预浸料： preg_match_all('%<li id="msg(\d+)"></li>%i', $subject, $result, PREG_PATTERN_ORDER); substr( $string, 3 ); 参见 https://www.php.net/manual/en/function.substr.php 解析有效的 HTML 时，请使用 HTML 解析器。下面演示了如何使用 DomDocument 和 XPath 查询来专门定位带有 li 前缀的 msg 值的 id 元素，然后在推送之前使用 sscanf() 隔离 msg 之后的整数（转换为整数）进入结果数组。代码：（演示） $html = <<<HTML <ul> <li id="msg1"></li> <li id="msg1234567890"></li> </ul> HTML; $dom = new DomDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $result = []; foreach ($xpath->evaluate("//li[starts-with(@id, 'msg')]/@id") as $id) { sscanf($id->nodeValue, 'msg%d', $result[]); } var_export($result); 输出： array ( 0 => 1, 1 => 1234567890, )

php arrays string html-parsing text-extraction

回答 4 投票 0

将我的 HTML 从一种形式转换为另一种形式

我只是在旧网页上查看一些糟糕的 HTML 标记。我注意到我的标记中反复出现一些错误。我希望用一个程序来修复这些问题，但不确定是什么 API 或

html regex parsing beautifulsoup html-parsing

回答 3 投票 0

在 C++ 中解析 HTML 表格中的表格行链接

我想使用C++解析以下HTML代码中的所有链接（由表格行表示）：我想使用 C++ 解析以下 HTML 代码中的所有链接（由表格行表示）： <!DOCTYPE html> <html> <head> <meta http-equiv="Content-type" content="text/html; charset=UTF-8"/> <meta name="viewport" content="width=device-width, initial-scale=1.0"/> <link rel="stylesheet" href="/_autoindex/assets/css/autoindex.css"/> <script src="/_autoindex/assets/js/tablesort.js"></script> <script src="/_autoindex/assets/js/tablesort.number.js"></script> <title>Index of /mydirectory/subdirectory/</title> </head> <body> <div class="content"> <h1>Index of /mydirectory/subdirectory/</h1> <div id="table-list"> <table id="table-content"> <thead class="t-header"> <tr> <th class="colname" aria-sort="ascending"> <a class="name" href="?ND" onclick="return false"">Name</a></th><th class=" colname " data-sort-method=" number "><a href=" ?MA " onclick=" return false"">Last Modified</a> </th> <th class="colname" data-sort-method="number"><a href="?SA"onclick="return false"">Size</a></th></tr></thead> <tr data-sort-method="none "><td><a href="/mydirectory/"><img class="icon " src="/_autoindex/assets/icons/corner-left-up.svg " alt="Up ">Parent Directory</a></td><td></td><td></td></tr> <tr><td data-sort="first.json "><a href="/mydirectory/subdirectory/first.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">first.json</a></td><td data-sort="1704288747 ">2024-01-03 13:32</td><td data-sort="4096 "> 4k</td></tr> <tr><td data-sort="second.json "><a href="/mydirectory/subdirectory/second.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">second.json</a></td><td data-sort="1704290309 ">2024-01-03 13:58</td><td data-sort="4096 "> 4k</td></tr> <tr><td data-sort="third.json "><a href="/mydirectory/subdirectory/third.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">third.json</a></td><td data-sort="1704290300 ">2024-01-03 13:58</td><td data-sort="4096 "> 4k</td></tr> </table></div> <address>Proudly Served by LiteSpeed Web Server at example.com Port 443</address></div><script>new Tablesort(document.getElementById("table-content "));</script></body></html> 这是来自 Apache Web 服务器的目录列表。我的预期结果是一个 std::vector<std::string>，其中包含表中所有 3 个 JSON 文件的（相对）url。对于实现，我尝试使用 Apache xerces-c 但这个库似乎没有完整的 XPath 支持。此外，承诺提供全面 xalan-c 支持的 XPath 在我的包管理器 vcpkg 等中不可用。我怎样才能像 Java 的 JSoup 使用 xerces-c 操作那样实现这种解析？ std::vector<std::string> parse_all_links(const std::string &website_content) { std::vector<std::string> collected_links; try { XMLPlatformUtils::Initialize(); } catch (const XMLException& exception) { auto error_message = XMLString::transcode(exception.getMessage()); logger->error("Failed to initialize XML platform utils: " + std::string(error_message)); XMLString::release(&error_message); return collected_links; } { XercesDOMParser parser; parser.setValidationScheme(XercesDOMParser::Val_Never); const MemBufInputSource input_source(reinterpret_cast<const XMLByte*>(website_content.data()), website_content.size(), "dummy"); parser.parse(input_source); // ... } XMLPlatformUtils::Terminate(); return collected_links; } 任何其他 HTML 解析库解决方案也可以，最好带有 vcpkg 端口，以便更好地使用。尽管已经过时且无人维护，Google 的gumbo-parser仍然可以完成这项工作： #include <iostream> #include <sstream> #include <fstream> #include <vector> #include <gumbo.h> #include <boost/algorithm/string.hpp> struct link_searcher_configuration_t { std::vector<std::string> file_extensions; std::string server_base_url; }; void search_for_links(const GumboNode* node, std::vector<std::string>& links, const link_searcher_configuration_t &link_searcher_configuration) { if (node->type != GUMBO_NODE_ELEMENT) { return; } if (node->v.element.tag == GUMBO_TAG_A) { const auto href = gumbo_get_attribute(&node->v.element.attributes, "href"); const std::string link = href->value; for (const auto &file_extension : link_searcher_configuration.file_extensions) { if (boost::trim_copy(link).ends_with("." + file_extension)) { links.emplace_back(link_searcher_configuration.server_base_url + href->value); break; } } } const auto children = &node->v.element.children; for (unsigned int child_index = 0; child_index < children->length; ++child_index) { search_for_links(static_cast<GumboNode*>(children->data[child_index]), links, link_searcher_configuration); } } std::vector<std::string> find_links(const std::string &file_contents, const link_searcher_configuration_t& link_searcher_configuration) { const auto output = gumbo_parse(file_contents.c_str()); std::vector<std::string> links; search_for_links(output->root, links, link_searcher_configuration); gumbo_destroy_output(&kGumboDefaultOptions, output); return links; } int main() { // Load HTML file const std::ifstream file_input_stream("HTMLPage.html"); std::stringstream buffer; buffer << file_input_stream.rdbuf(); const auto file_contents = buffer.str(); // Parse all links and print them const link_searcher_configuration_t link_searcher_configuration { {"json"}, "https://example.com" }; for (const std::vector<std::string> links = find_links(file_contents, link_searcher_configuration); const auto &link : links) { std::cout << link << '\n'; } }

c++ html-parsing xerces-c

回答 1 投票 0

使用简单的 HTML DOM 解析器到 JSON？

我正在尝试对抓取的网站的每个元素进行分组，将其转换为 json 元素，但它似乎不起作用。我试图对抓取的网站的每个元素进行分组，将其转换为 json 元素，但它似乎不起作用。 <?php // Include the php dom parser include_once 'simple_html_dom.php'; header('Content-type: application/json'); // Create DOM from URL or file $html = file_get_html('urlhere'); foreach($html->find('hr ul') as $ul) { foreach($ul->find('div.product') as $li) $data[$count]['products'][]['li']= $li->innertext; $count++; } echo json_encode($data); ?> 此返回 {"":{"products":[{"li":" <a class=\"th\" href=\"\/products\/56942-haters-crewneck-sweatshirt\"> <div style=\"background-image:url('http:\/\/s0.merchdirect.com\/images\/15814\/v600_B_AltApparel_Crew.png');\"> <img src=\"http:\/\/s0.com\/images\/6398\/product-image-placeholder-600.png\"> <\/div> <\/a> <div class=\"panel panel-info\" style=\"display: none;\"> <div class=\"name\"> <a href=\"\/products\/56942-haters-crewneck-sweatshirt\"> Haters Crewneck Sweatshirt <\/a> <\/div> <div class=\"subtitle\"> $60.00 <\/div> <\/div> "} 当我真正希望实现时： {"products":[{ "link":"/products/56942-haters-crewneck-sweatshirt", "image":"http://s0.com/images/15814/v600_B_AltApparel_Crew.png", "name":"Haters Crewneck Sweatshirt", "subtitle":"60.00"} ]} 如何摆脱所有冗余信息并可能命名重新格式化的 json 中的每个元素？谢谢！您只需在内循环中扩展逻辑即可： foreach($html->find('hr ul') as $ul) { foreach($ul->find('div.product') as $li) { $product = array(); $product['link'] = $li->find('a.th')[0]->href; $product['name'] = trim($li->find('div.name a')[0]->innertext); $product['subtitle'] = trim($li->find('div.subtitle')[0]->innertext); $product['image'] = explode("'", $li->find('div')[0]->style)[1]; $data[$count]['products'][] = $product; } } echo json_encode($data); 您可以使用此功能，我几乎为一般问题创建了它，因此您可以根据您的要求更新它们，这里是代码： function convertToNestedJSON(htmlString) { var sections = htmlString.split('<h2>'); var jsonObjects = []; for (var i = 1; i < sections.length; i++) { var section = sections[i]; var titleEndIndex = section.indexOf('</h2>'); var title = section.substring(0, titleEndIndex).trim(); var contentWithLinks = section.substring(titleEndIndex + 5).trim(); var content = contentWithLinks.replace(/<[^>]+>/g, ''); var links = contentWithLinks.match(/<a[^>]+>([^<]+)<\/a>/g) || []; links = links.map(link => { var matches = link.match(/<a[^>]+href=['"]([^'"]+)['"][^>]*>([^<]+)<\/a>/); return { href: matches[1], text: matches[2] }; }); jsonObjects.push({ title: title, content: content, links: links }); } return jsonObjects; }

php json html-parsing

回答 2 投票 0

获取 HTML 片段的两个索引之间的子字符串

在 powershell 中获取 URL 中的子字符串的最佳方法是什么？鉴于 http://somehost.aa.com/something?id=12345 我...

powershell substring html-parsing

回答 1 投票 0

powershell获取2个索引之间的子字符串

在 powershell 中获取 URL 中的子字符串的最佳方法是什么？鉴于http://somehost.aa.com/something?id=12345 我...

powershell substring html-parsing

回答 1 投票 0

Flutter：如何解析 html 文本并将粗体、p、斜体等标签分离到相应的文本小部件

后端 API 向我提供的数据文本为你好我粗体和我粗体和斜体粗体仅继续.. 如何在 dart 中将其解析为 Nor... 后端 API 向我提供的数据文本为 <p> hello <b> im bold and <i> im bold and italic </i> bold only continues.. </b> <p> 如何在 dart 中将其解析为普通文本，以便在其上应用一些 RichText 小部件？你可以使用flutter_html，不过它会有点重；但是如果你可以要求BE像whatsapp格式化文本那样发送文本；该实用程序与内部使用 RichText 的用例类似。看看吧，灵感来自于 Whatsapp 的造型方式；粗体、斜体、删除线、下划线、等宽字体、可单击超链接和动态字体大小 https://pub.dev/packages/typeset

flutter dart html-parsing

回答 1 投票 0

“没有名为 bs4 的模块”

我正在尝试在 Windows 10 上的 VSCode 中使用 beautiful soup ；我被告知要使用“from bs4 import Beautiful Soup”这一行导入 beautiful soup，但我不断收到错误消息，指出

python visual-studio-code beautifulsoup html-parsing

回答 1 投票 0

如何从 RSS feed 的“content:encoded”部分中提取 HTML 元素？

我正在尝试生成一份时事通讯，其中除其他内容外，还包括网站上存在的新闻条目。该网站是用 WordPress 构建的，并有 RSS 提要，这不是

html xml xml-parsing html-parsing xmlstarlet

回答 1 投票 0

我完全不懂网页解析

我一直在学习如何在线进行网页解析，但我并不完全理解我一直在使用的代码。有人可以帮助我吗？我尝试使用下面的代码，它适用于 1 个网站...

python html html-parsing

回答 1 投票 0

如何使用 JQuery 解析 Wordpress 中的查询

下面的JS代码解析category.php中的标题和内容，但我无法解析使用PHP代码显示的conference_speaker_business。基本上，我想解析并显示

javascript jquery ajax wordpress html-parsing

回答 1 投票 0

如何删除除img之外的所有html标签？

我得到了一些html文本，其中包含各种html标签，例如，，等等。现在我想使用正则表达式删除所有html标签，除了&... 我得到了一些html文本，其中包含各种html标签，比如<table>, <a>, <img>等等。现在我想使用正则表达式删除所有html标签，除了<img ...>和</img>（和大写<IMG></IMG>）。如何做到这一点？更新：我的任务很简单，只是在首页打印一个html的文本内容（包括图像）作为摘要，所以我认为正则表达式很好而且很简单。再次更新也许示例会让我的问题更好地理解:) 有一些html文本： <html> <head></head> <body> Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, <a href="xxx">know more</a> about me! </body> </html> 我想保留，并删除其他标签。以下是我想要的： Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me! 现在我的代码是这样的： html.replaceAll("<.*?>", "") 但是它会删除<和>之间的所有内容，但我想保留<img xxx>和</img>，并删除< and >之间的其他内容谢谢大家！我尝试了很多，这个正则表达式似乎对我有用： (?i)<(?!img|/img).*?> 我的代码是： html.replaceAll('(?i)<(?!img|/img).*?>', ''); 不要使用正则表达式来解析 HTML。请参阅此处，了解原因的令人信服的演示。使用适合您的语言/平台的 HTML 解析器。这里是一个java（HTML解析器）对于 .NET，建议使用 HTML Agility Pack 对于 ruby，有 nokogiry，虽然我不是 ruby 开发者，所以不知道它有多好为什么不使用正则表达式的一个简单答案是： Regexp 无法解析递归语法，例如： S -> (S) S -> Empty 因为这种语法有无限的状态。由于 HTML 具有递归语法，您可以简单地使用正则表达式。 SPAN -> <span>SPAN</span> SPAN -> text 但是在您的情况下，您可以表达非递归的正则表达式。 <(img|IMG)*>*</(img|IMG)> 这是一个简单的使用正则表达式： const html = "<html>...</html>"; return html.replace(/<.*?>/ig, function (tag) { if (tag.indexOf('<img ') === 0) { return tag; } else { return ''; } }) 删除除以下标签之外的所有 html 标签： <title>(.*?)<\/title>' <meta name="description" content="(.*?)"/>' <p>(.*?)<\/p>' <h4 class="sc-jMKfon fhunKk">(.*?)<\/h4>' <h2">(.*?)<\/h2> <img(.*?)\/> 找到：<(?!\/?(title|meta|p|h2|img)\b)[^>]*>|<\/(?!title>|meta>|p>|h2>|>)[^>]+> 替换为： (leave empty)

regex html-parsing

回答 6 投票 0

正则表达式提取页面上另一个标签内的第一个链接[重复]

我一直在尝试设置一个简单的 PHP API，该 API 基本上可以通过两个步骤从另一个站点检索信息。如果一个人要这样做，它将涉及：搜索网站单击...

php html-parsing

回答 1 投票 0

从视图源中抓取数据：https://www.youtube.com/embed/

我有一大堆嵌入的 Youtube 视频。其中一些不可用。我正在寻找一种解决方案，如何在大量 URL 中识别它们。可以手动检查以下内容...

parsing youtube-api google-sheets-formula html-parsing google-sheets-api

回答 1 投票 0

在 JavaScript 中如何解析 HTML 字符串以转换为表格数据（二维数组）

我喜欢在客户端解析html字符串。我们使用 React 和 TypeScript 作为前端框架。在解析 html 时，我还喜欢获取与元素关联的样式。它可以是内联的...

javascript html reactjs typescript html-parsing

回答 1 投票 0

将 HTML 表格数据转换为转置的二维数组

我需要从 HTML 表中抓取数据并将柱状数据定位为二维数组的行。我的代码没有显示正确的结构。 HTML 表格：我需要从 HTML 表中抓取数据并将柱状数据定向为二维数组的行。我的代码未显示正确的结构。 HTML 表格： <html> <head> </head> <body> <table> <tbody> <tr> <td>header</td> <td>header</td> <td>header</td> </tr> <tr> <td>content</td> <td>content</td> <td>content</td> </tr> <tr> <td>test</td> <td>test</td> <td>test</td> </tr> </tbody> </table> </body> </html> PHP 代码： $DOM = new \DOMDocument(); $DOM->loadHTML($valdat["table"]); $Header = $DOM->getElementsByTagName('tr')->item(0)->getElementsByTagName('td'); $Detail = $DOM->getElementsByTagName('td'); //#Get header name of the table foreach($Header as $NodeHeader) { $aDataTableHeaderHTML[] = trim($NodeHeader->textContent); } //print_r($aDataTableHeaderHTML); die(); //#Get row data/detail table without header name as key $i = 0; $j = 0; foreach($Detail as $sNodeDetail) { $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent); $i = $i + 1; $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j; } //print_r($aDataTableDetailHTML); die(); //#Get row data/detail table with header name as key and outer array index as row number for($j = 0; $j < count($aDataTableHeaderHTML); $j++) { for($i = 1; $i < count($aDataTableDetailHTML); $i++) { $aTempData[][$aDataTableHeaderHTML[$j]][] = $aDataTableDetailHTML[$i][$j]; } } $aDataTableDetailHTML = $aTempData; echo json_encode($aDataTableDetailHTML); 我的结果： [{"header":["content"]},{"header":["test"]},{"header":["content"]},{"header":["test"]},{"header":["content"]},{"header":["test"]}] 我们需要这样的结果： [ ["header","content","test"], ["header","content","test"], ["header","content","test"] ] 我更改了很多代码以（希望）简化它。这分两个阶段进行，第一个是提取 <tr> 元素并构建每行中所有 <td> 元素的数组 - 将结果存储到 $rows。其次是通过循环第一行来垂直捆绑数据，然后使用 array_column() 从所有行中提取相应的数据... $trList = $DOM->getElementsByTagName("tr"); $rows = []; foreach ( $trList as $tr ) { $row = []; foreach ( $tr->getElementsByTagName("td") as $td ) { $row[] = trim($td->textContent); } $rows[] = $row; } $aDataTableDetailHTML = []; foreach ( $rows[0] as $col => $value ) { $aDataTableDetailHTML[] = array_column($rows, $col); } echo json_encode($aDataTableDetailHTML); 测试数据给出... [["header","content","test"],["header","content","test"],["header","content","test"]] 我添加了一些额外的代码，它将 $aDataTableDetailHTML 数组分成两个值，然后添加键，在本例中为“header” //There are two elements that are not "header" $aDataTableDetailHTML = array_chunk($aTempData, 2); //For every item in the array foreach($aDataTableDetailHTML as $key=>$tag){ //Dynamically get the name, in this case, "header" $tagName = array_keys( $tag[0] )[0]; //Start an array containing the tagname ("header") $tagOut = array( $tagName ); //Add the two values onto the array $tagOut[] = $tag[0][$tagName][0]; $tagOut[] = $tag[1][$tagName][0]; //Drop the keys from the array $aDataTableDetailHTML[$key] = array_values( $tagOut ); } echo json_encode($aDataTableDetailHTML); 这给了我输出： [ [ "header", "content", "test" ], [ "header", "content", "test" ], [ "header", "content", "test" ] ] 这似乎符合您的需求。希望这有帮助。我还测试了一些附加值，并且该模式继续存在。我知道这个答案来晚了，但我为此目的开发了一个包。它被称为TableDude。对于您的情况，这个 PHP 片段将起作用。 // Including TableDude require __DIR__ . "/../src/autoload.php"; $html = "<html> <head> </head> <body> <table> <tbody> <tr> <td>header</td> <td>header</td> <td>header</td> </tr> <tr> <td>content</td> <td>content</td> <td>content</td> </tr> <tr> <td>test</td> <td>test</td> <td>test</td> </tr> </tbody> </table> </body> </html>"; // Parses the HTML to array table $simpleParser = new \TableDude\Parser\SimpleParser($html); $parsedTables = $simpleParser->parseHTMLTables(); if(count($parsedTables) > 0) { $firstTable = $parsedTables[0]; $tableOrderedByColumn = \TableDude\Tools\ArrayTool::swapArray($firstTable); print_r($tableOrderedByColumn); } // This would output /* array( array("header", "content", "test"), array("header", "content", "test"), array("header", "content", "test") ) */ 要维护行和单元格之间的父子关系，请在 td 标签上下文中访问 tr 标签。转置数据结构是通过将第一级键与第二级键交换来完成的。代码：（演示） $dom = new DOMDocument(); $dom->loadHTML($html); $result = []; foreach ($dom->getElementsByTagName('tr') as $i => $row) { foreach ($row->getElementsByTagName('td') as $c => $cell) { $result[$c][$i] = $cell->nodeValue; } } var_export($result);

php arrays html-table html-parsing transpose

回答 4 投票 0

将 2 列 HTML 表格内容转换为 2d 数组

我正在尝试使用 PHP 将 HTML 表的单元格值解析为具有预定键的关联数组的索引数组。 $htmlContent = ' 测试1<... 我正在尝试使用 PHP 将 HTML 表的单元格值解析为具有预定键的关联数组的索引数组。 $htmlContent = '<table> <tr> <th>test1</th> <td>test1-1</td> </tr> <tr> <th>test2</th> <td>test2-2</td> </tr> </table>'; 我想要这个结果： [ ['name' => "test1", 'value' => "test1-1"], ['name' => "test2", 'value' => "test2-2"], ] 我目前的结果只是： [ ['test1' => 'test1-1', 'test2' => 'test2-2'] ]; 这是我的编码尝试： $DOM = new DOMDocument(); $DOM->loadHTML($htmlContent); $Header = $DOM->getElementsByTagName('th'); $Detail = $DOM->getElementsByTagName('td'); //#Get header name of the table foreach($Header as $NodeHeader) { $aDataTableHeaderHTML[] = trim($NodeHeader->textContent); } //print_r($aDataTableHeaderHTML); die(); //#Get row data/detail table without header name as key $i = 0; $j = 0; foreach($Detail as $sNodeDetail) { $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent); $i = $i + 1; $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j; } //print_r($aDataTableDetailHTML); die(); //#Get row data/detail table with header name as key and outer array index as row number for($i = 0; $i < count($aDataTableDetailHTML); $i++) { for($j = 0; $j < count($aDataTableHeaderHTML); $j++) { $aTempData[$i][$aDataTableHeaderHTML[$j]] = $aDataTableDetailHTML[$i][$j]; } } $aDataTableDetailHTML = $aTempData; unset($aTempData); print_r($aDataTableDetailHTML); die(); 您的代码工作得太辛苦，无法尝试将柱状数据保留在相应的行中。为了让事情变得更容易，迭代行 (<tr>) 元素，然后访问给定行中的元素。代码（演示）或（替代演示） $dom = new DOMDocument(); $dom->loadHTML($html); $result = []; foreach ($dom->getElementsByTagName('tr') as $row) { $result[] = [ 'name' => $row->getElementsByTagName('th')->item(0)->nodeValue, 'value' => $row->getElementsByTagName('td')->item(0)->nodeValue, ]; } var_export($result); 我这样做只是因为使用 explode 和 str_replace 很有趣——无需 PHP DOM 解析器.. 基本上使用 explode( '</tr>', $table ); 创建一个起始的 Main 空数组，并循环遍历它，在剥离不需要的内容后向其中添加临时数组（IE <tr> 和 trimming） <?php $table = <<<HTML <table> <tr> <th>Name</th> <th>Value</th> </tr> <tr> <td>Name One</td> <td>Value One</td> </tr><tr> <td>Name Two</td> <td>Value Two</td> </tr><tr> <td>Name Three</td> <td>Value Three</td> </tr> </table> HTML; $rows = explode( '</tr>', $table ); array_shift($rows); array_pop($rows); $main_arr = []; foreach ($rows as $row){ $name = trim( str_replace(['<td>', '<tr>'], '', explode('</td>', $row)[0] ) ); $value = trim( str_replace(['<td>', '<tr>'], '', explode('</td>', $row)[1] ) ); $tmp_arr = []; $tmp_arr['name'] = $name; $tmp_arr['value'] = $value; $main_arr[] = $tmp_arr; } print_r($main_arr); 你的输出应该是： Array ( [0] => Array ( [name] => Name One [value] => Value One ) [1] => Array ( [name] => Name Two [value] => Value Two ) [2] => Array ( [name] => Name Three [value] => Value Three ) ) 更新这是执行相同操作的 PHP DOM 代码： <?php $DOM = new DOMDocument(); $DOM->loadHTML("<table> <tr> <th>Name</th> <th>Value</th> </tr> <tr> <td>Name One</td> <td>Value One</td> </tr><tr> <td>Name Two</td> <td>Value Two</td> </tr><tr> <td>Name Three</td> <td>Value Three</td> </tr> </table>"); $main_arr = []; $rows = $DOM->getElementsByTagName("tr"); for ($i = 0; $i < $rows->length; $i++) { $cols = $rows->item($i)->getElementsbyTagName("td"); $tmp_arr = []; if ($cols->item(0)->nodeValue){ $tmp_arr['name'] = $cols->item(0)->nodeValue; $tmp_arr['value'] = $cols->item(1)->nodeValue; $main_arr[] = $tmp_arr; } } print_r( $main_arr );

php html arrays html-parsing domdocument

回答 2 投票 0

html-parsing 相关问题

最新问题