从网页中提取URL

问题描述 投票:0回答:1

我想从包含多个URL的网页中提取URL并将提取的内容保存到txt文件中。

网页中的 URL 以“127.0.0.1”开头,但我想从中删除“127.0.0.1”并仅提取 URL。当我运行下面的 ps 脚本时,它只保存“127.0.0.1”。请帮忙解决这个问题。

$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
html powershell uri
1个回答
0
投票
'127\.0\.0\.1(?:[^\s]*)'
  • 您错误地使用了非捕获组 (

    (?:…)
    ) 而不是捕获组 (
    (…)
    )

  • 下载的内容中,127.0.0.1

    后面有一个
    空格

$matches = …
  • 虽然从技术上讲它不会造成问题,但
    $matches
    automatic
    $Matches
    变量
    的名称,因此不应用于自定义目的。
$urlList += $match.Value

$match.Value
是您的正则表达式匹配的 整个 文本,而您只需要 捕获组 的文本。

$urlList += 

+=

迭代
构建数组是低效,因为每次迭代都必须在幕后分配一个 new 数组;只需使用
foreach
语句作为表达式,然后让 PowerShell 为您收集结果。请参阅此答案了解更多信息。

将所有内容放在一起:

$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
    
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 ([^\s]+)'
    
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
    
# Create and populate the list with matched URLs
$urlList = 
foreach ($match in $matchList) {
  $match.Groups[1].Value
}
    
# Specify the output file path
$outputFilePath = 'output.txt'
    
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
    
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
© www.soinside.com 2019 - 2024. All rights reserved.