有人能想到比我创建的更好的正则表达式模式文本吗?
我已经测试了我能找到的关于这个主题的所有相关正则表达式代码,它们都不适用于具有许多空格和文本前导和尾随的许多空白行的多行,以及每行文本之前和之后,以及多个空行行(行上有空格),我需要规范化的真正垃圾文本。我还需要删除重复的单词、空格和逗号。
在 Excel 中,我调用 RegEx() 函数来修剪前导/尾随空白字符(所有空白)以及前导/尾随空白行,并且如果文本中存在多个空白行,则在中间仅保留 1 个空白行。细胞。
A1 值是具有上述垃圾文本的多行文本。
我的目标是修剪所有空白行,包括不间断字符,并删除所有前导/尾随空白行,并只留下 1 个空白行。所以它会删除 2 个以上的空白行,使它们只剩下 1 个。
我希望代码高效,并且如果可能的话可以组合一些步骤,从而减少步骤。
此代码确实修剪了前导/尾随空白字符和空行并留下一行;它在每一行中留下多个空白,但它非常好。
A1 = see below
It's difficult to input because the multiple lines, but I will type some text that shows, imagine this below as in cell A1:
Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of
white space all over the. place, and and , , ,,,,, lots of commas, , ,
and , text text Text before and after , commas
many blank characters on every single blank line
在F1中找到的预期结果应该是这样的:
Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of
white space all over the. place, and, lots of commas,
and, text before and after, commas
many blank characters on every single blank line
B1 = RegExp(A1, "(\r?\n\s*){2,}", CHAR(10) & CHAR(10), TRUE, TRUE, TRUE)
C1 = RegExp(B1, "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", TRUE, FALSE, TRUE)
注意:此时 C1 已经修剪了文本并且只有 1 行,所以这是一个好的开始。
D1 = RegExp(C1, "[ \xA0]+", " ", TRUE, TRUE, TRUE)
注意:D1 现在已删除重复的空格。
E1 = RegExp(D1, "[ ,]{2,}", ", ", TRUE, TRUE, TRUE)
注意:E1现在已经删除了重复的空格和逗号,我的文本中有很多垃圾
F1 = RegExp(E1, "\b(\w+)\b[\s\xA0]+(?=\1\b)", "", TRUE, FALSE, TRUE)
注意:F1 现在删除了重复的单词,因此只有一个单词没有删除,2+
这是工作代码,如果有人能想到更高效、更简洁的代码,请告诉我。
Public Function RegExp(ByVal text As String, _
ByVal pattern As String, _
ByVal strReplace As String, _
Optional ByVal caseInsensitive As Boolean = True, _
Optional ByVal multiLines As Boolean = True, _
Optional ByVal allOccurrences As Boolean = True) As String
Dim regEx As New RegExp
With regEx
.pattern = pattern
.MultiLine = multiLines
.IgnoreCase = caseInsensitive
.Global = allOccurrences
End With
RegExp = regEx.Replace(text, strReplace) 'returns modified text (or original text if no modifications)
End Function
只是为了记录失败的尝试,但效果很好:
这里的代码开始工作,但中间留下了多个空行
B1 = RegExp(A1, "^[\s\xA0]+|[\s\xA0]+$", "", TRUE, FALSE, TRUE)
C1 = RegExp(B1, "^[\s\xA0]+|[\s\xA0]+$", CHAR(10), TRUE, TRUE, TRUE)
这个答案不是关于模式,而是关于执行效率和速度。如果您可以使用较小的替代品但仍然获得良好的性能,那么在我看来,这会简化整个过程。
这里有一些示例代码,展示了如何在调用之间重用 RegExp 对象,而不是为每个调用创建新的实例。
'Run a performance test
Sub Tester()
Dim n As Long, txt As String, res As String, t
txt = [a1].Value 'your example input
t = Timer
For n = 1 To 1000
res = FixText(txt)
Next n
Debug.Print Timer - t '10-11 sec with OPTIMIZE = False, 0.1 sec when True
Debug.Print res
End Sub
Function FixText(ByVal text As String)
Const OPTIMIZE As Boolean = True 'caches objects when True
Dim arr, n, re As Object, i As Long, res As String
Static col As Collection 'cache prepared RegExp objects between calls
'array of alternating pattern and repalcement text pairs
arr = Array("(\r?\n\s*){2,}", vbLf, _
"^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", _
"[ \xA0]+", " ", _
"[ ,]{2,}", ", ")
If OPTIMIZE And col Is Nothing Then 'need to intialize the collection?
Set col = New Collection
For n = LBound(arr) To UBound(arr) Step 2
col.Add GetRegExp(arr(n), True, True, True)
Next n
End If
If OPTIMIZE Then
'using cached RegExp objects
i = 1 'index of first replacement text
For Each re In col
text = re.Replace(text, CStr(arr(i)))
i = i + 2
Next re
Else
'using freshly-created objects for each call
For n = LBound(arr) To UBound(arr) Step 2
text = GetRegExp(arr(n), True, True, True).Replace(text, arr(n + 1))
Next n
End If
FixText = text
End Function
'return a configured RegExp object
Public Function GetRegExp(ByVal pattern As String, _
Optional ByVal caseInsensitive As Boolean = True, _
Optional ByVal multiLines As Boolean = True, _
Optional ByVal allOccurrences As Boolean = True) As RegExp
Set GetRegExp = New RegExp
With GetRegExp
.MultiLine = multiLines
.IgnoreCase = caseInsensitive
.pattern = pattern
.Global = allOccurrences
End With
End Function