从 Powershell 中获取修剪文本

问题描述 投票:0回答:1

我正在从网站上抓取版本信息。我能够获取信息,但无法在不格式化的情况下获取信息。当前的目标是 ID 为 j_idt19 的 DIV 标记。有没有办法从 id 为 page_footer 的 td withing 表中获取信息。我无法通过文本找到特定的 TD。

我想将结果放入 csv 中,然后将文本作为 Num.NumNum.NumNumNum 放入文本文件中

# Retrieve the front page of Reddit
$response = Invoke-WebRequest -Uri "https://www.somesite.com/index.xhtml"

# Select the titles and URLs of the top stories
$results1 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent
$results2 = $response.ParsedHtml.getElementsByTagName(“Div”) | Where-Object {$_.id -eq “j_idt19”} | Select-Object -Property TextContent | Out-String

Write-Output $results
$results1 | Export-Csv -Path “C:\Users\ASTRTW3\Desktop\David_Scripts\URL_TEST5.csv"
$results2 | Out-File -FilePath “C:\Users\ASTRTW3\Desktop\David_Scripts\URL_TEST5.txt"

Html 代码被抓取

<div id="j_idt19" class="ui-layout-unit ui-widget ui-widget-content ui-corner-all ui-layout-south ui-layout-pane ui-layout-pane-south" style="position: absolute; margin: 0px; inset: auto 5px 0px; width: auto; z-index: 0; height: 26px; display: block; visibility: visible;"><div class="ui-layout-unit-content ui-widget-content" style="position: relative; height: 22px; visibility: visible;">

  <table id="page_footer" style="width: 100%; border-top: 1px solid #cbc3be !important;">
    <tbody><tr>
      <td style="width: 30%;">
        
      </td>

      <td style="width: 40%; text-align: center;"><span style="font-weight: bold;">1.14.012</span>
      </td>

      <td style="width: 15%; text-align: right;">&nbsp;</td>

      <td style="text-align: right; width: 20px; margin-top: 2px;"><div id="j_idt23" style="width:18px;height:18px;position:fixed;right:130px;bottom:2px"><div id="j_idt23_start" style="display:none"><img id="progressBar" src="/CSDB/resources/images/loader_footer.gif"></div><div id="j_idt23_complete" style="display:none"></div></div>
      </td>
    </tr>
  </tbody></table></div></div>

csv 结果

#TYPE Selected.System.__ComObject
"textContent"
"

  
    
      
        
      

      1.14.012
      

      ?

      
      
    
  "

文字结果

textContent                                                                           
-----------                                                                           
...                                                                                   

预期结果
CSV

#TYPE Selected.System.__ComObject
"textContent"
1.14.012

文字

1.14.012
html powershell web-scraping screen-scraping powershell-5.0
1个回答
0
投票

我假设你所追求的始终是

version
包含在
<span>
中的
<td>
中,在这种情况下,你可以使用的代码是:

$response.ParsedHtml.getElementById('j_idt19') | ForEach-Object {
    $ver = $null
    foreach ($td in $_.getElementsByTagName('td')) {
        $td.getElementsByTagName('span') |
            Where-Object { [version]::TryParse($_.textContent, [ref] $ver) } |
            Select-Object textContent
    }
} | Export-Csv path\to\csv
© www.soinside.com 2019 - 2024. All rights reserved.