我想从包含多个URL的网页中提取URL并将提取的内容保存到txt文件中。
网页中的 URL 以“127.0.0.1”开头,但我想从中删除“127.0.0.1”并仅提取 URL。当我运行下面的 ps 脚本时,它只保存“127.0.0.1”。请帮忙解决这个问题。
$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
# Download the threat feed data
$threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1(?:[^\s]*)'
# Use the regular expression to find matches in the threat feed data
$matches = [regex]::Matches($threatFeedData.Content, $pattern)
# Create a list to store the matched URLs
$urlList = @()
# Populate the list with matched URLs
foreach ($match in $matches) {
$urlList += $match.Value
}
# Specify the output file path
$outputFilePath = "output.txt"
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
'127\.0\.0\.1(?:[^\s]*)'
您错误地使用了非捕获组 (
(?:…)
) 而不是捕获组 ((…)
)
下载的内容中,127.0.0.1
后面有一个空格
$matches = …
$matches
是 automatic $Matches
变量的名称,因此不应用于自定义目的。$urlList += $match.Value
$match.Value
是您的正则表达式匹配的 整个 文本,而您只需要 捕获组 的文本。
$urlList +=
用 +=
迭代构建数组是低效,因为每次迭代都必须在幕后分配一个 new 数组;只需使用
foreach
语句作为表达式,然后让 PowerShell 为您收集结果。请参阅此答案了解更多信息。
将所有内容放在一起:
$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 ([^\s]+)'
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
# Create and populate the list with matched URLs
$urlList =
foreach ($match in $matchList) {
$match.Groups[1].Value
}
# Specify the output file path
$outputFilePath = 'output.txt'
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."