String.IndexOf（）返回意外值 - 无法在两个搜索字符串之间提取子字符串

Question

脚本操作网络故事中的一些专有名称，以帮助我的阅读工具正确发音。

我通过网页获取内容

$webpage = (Invoke-WebRequest -URI 'https://wanderinginn.com/2018/03/20/4-20-e/').Content

此$网页应为String类型。

现在

$webpage.IndexOf('<div class="entry-content">')

返回正确的值

$webpage.IndexOf("Previous Chapter")

返回意外的值，我需要一些解释为什么或如何自己找到错误。

在理论上它应该切断页面的“主体”运行它通过我想要替换的专有名词列表并将其推入htm文件。这一切都有效，但IndexOf（“Prev ...”）的价值却没有。

编辑：在调用webrequest之后我可以

Set-Clipboard $webrequest

并在记事本++中发布，我可以找到'div class =“entry-content”'和'Previous Chapter'。如果我做的事情

Set-Clipboard $webpage.substring(
     $webpage.IndexOf('<div class="entry-content">'),
     $webpage.IndexOf('PreviousChapter')
   )

我希望Powershell能够正确地确定这些字符串的第一个实例并在它们之间切换。因此，我的剪贴板现在应该有我想要的内容，但字符串比第一次出现的更远。

Answer 1

tl;博士

您对String.Substring() method的工作方式存在误解：第二个参数必须是要提取的子字符串的长度，而不是结束索引（字符位置） - 请参阅下文。
作为替代方案，您可以使用更简洁（尽管更复杂）的正则表达式操作 -replace在一次操作中提取感兴趣的子串 - 见下文。
总的来说，最好使用HTML解析器来提取所需的信息，因为字符串处理很脆弱（HTML允许空格的变化，引用样式......）。

正如Lee_Dailey指出的那样，你对String.Substring() method的工作方式存在误解：它的论点是：

一个起始指数（基于0的角色位置），
从中返回给定长度的子字符串。

相反，您尝试将另一个索引作为length参数传递。

要解决此问题，必须从较高的索引中减去较低的索引，以便获取要提取的子字符串的长度：

一个简化的例子：

# Sample input from which to extract the substring 
#   '>>this up to here' 
# or, better,
#   'this up to here'.
$webpage = 'Return from >>this up to here<<'


# WRONG (your attempt): 
# *index* of 2nd substring is mistakenly used as the *length* of the
# substring to extract, which in this even *breaks*, because a length
# that exceeds the bounds of the string is specified.
$webpage.Substring(
  $webpage.IndexOf('>>'),
  $webpage.IndexOf('<<')
)

# OK, extracts '>>this up to here'
# The difference between the two indices is the correct length
# of the substring to extract.
$webpage.Substring(
  ($firstIndex = $webpage.IndexOf('>>')),
  $webpage.IndexOf('<<') - $firstIndex
)

# BETTER, extracts 'this up to here'
$startDelimiter = '>>'
$endDelimiter = '<<'
$webpage.Substring(
  ($firstIndex = $webpage.IndexOf($startDelimiter) + $startDelimiter.Length),
  $webpage.IndexOf($endDelimiter) - $firstIndex
)

一般警告re .Substring()：

在下列情况下，此.NET方法抛出异常，PowerShell表示为语句终止错误;也就是说，默认情况下语句本身会终止，但执行会继续：

如果指定的索引超出字符串的边界（基于0的字符位置小于0或大于字符串的长度）： 'abc'.Substring(4) # ERROR "startIndex cannot be larger than length of string"
如果指定一个长度，其端点将超出字符串的边界（如果索引加上长度产生的索引大于字符串的长度）。 'abc'.Substring(1, 3) # ERROR "Index and length must refer to a location within the string"

也就是说，您可以使用单个正则表达式（regular expression）通过-replace operator提取感兴趣的子字符串：

$webpage = 'Return from >>this up to here<<'

# Outputs 'this up to here'
$webpage -replace '^.*?>>(.*?)<<.*', '$1'

关键是让正则表达式匹配整个字符串并通过捕获组（(...)）提取感兴趣的子字符串，然后可以将其值（$1）用作替换字符串，从而有效地返回。

有关-replace的更多信息，请参阅this answer。

注意：在您的特定情况下，需要进行额外的调整，因为您正在处理多行字符串：

$webpage -replace '(?s).*?<div class="entry-content">(.*?)Previous Chapter.*', '$1'

内联选项（(?...)）s确保元字符.也匹配换行符（以便.*匹配跨行），默认情况下不会。
请注意，如果它们恰好包含正则表达式元字符（在正则表达式的上下文中具有特殊含义的字符），则可能必须将转义应用于要嵌入正则表达式的搜索字符串：使用嵌入的文字字符串，\-escape字符根据需要;例如，逃离.txt作为\.txt 如果要嵌入的字符串来自变量，请先将[regex]::Escape()应用于其值;例如。： $var = '.txt' # [regex]::Escape() yields '\.txt', which ensures # that '.txt' doesn't also match '_txt" 'a_txt a.txt' -replace ('a' + [regex]::Escape($var)), 'a.csv'

String.IndexOf（）返回意外值 - 无法在两个搜索字符串之间提取子字符串

问题描述投票：1回答：1

1个回答

最新问题

String.IndexOf（）返回意外值 - 无法在两个搜索字符串之间提取子字符串

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1