Excel VBA 正则表达式代码可修剪多行文本的前导/尾随空格和空行，以及重复的单词/空格/逗号

Question

有人能想到比我创建的更好的正则表达式模式文本吗？

我已经测试了我能找到的关于这个主题的所有相关正则表达式代码，它们都不适用于具有许多空格和文本前导和尾随的许多空白行的多行，以及每行文本之前和之后，以及多个空行行（行上有空格），我需要规范化的真正垃圾文本。我还需要删除重复的单词、空格和逗号。

在 Excel 中，我调用 RegEx() 函数来修剪前导/尾随空白字符（所有空白）以及前导/尾随空白行，并且如果文本中存在多个空白行，则在中间仅保留 1 个空白行。细胞。

A1 值是具有上述垃圾文本的多行文本。

我的目标是修剪所有空白行，包括不间断字符，并删除所有前导/尾随空白行，并只留下 1 个空白行。所以它会删除 2 个以上的空白行，使它们只剩下 1 个。

我希望代码高效，并且如果可能的话可以组合一些步骤，从而减少步骤。

此代码确实修剪了前导/尾随空白字符和空行并留下一行；它在每一行中留下多个空白，但它非常好。

A1 = see below
It's difficult to input because the multiple lines, but I will type some text that shows, imagine this below as in cell A1:
           
            
           
          
         Several blank lines with spaces before this text and now this. Mark     is a  really      nice guy.       And     
             there is a lot of             
               
       
            
                
       white     space all     over the.     place, and and    ,  , ,,,,,  lots of commas,  ,   , 

      
                   
and , text   text  Text before and  after , commas
          
                         
                     
            many blank characters on every single blank line

在F1中找到的预期结果应该是这样的：

Several blank lines with spaces before this text and now this. Mark is a really nice guy. And
there is a lot of      
 
white space all over the. place, and, lots of commas,
 
and, text before and after, commas
 
many blank characters on every single blank line

B1 = RegExp(A1, "(\r?\n\s*){2,}", CHAR(10) & CHAR(10), TRUE, TRUE, TRUE)

C1 = RegExp(B1, "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", TRUE, FALSE, TRUE)

注意：此时 C1 已经修剪了文本并且只有 1 行，所以这是一个好的开始。

D1 = RegExp(C1, "[ \xA0]+", " ", TRUE, TRUE, TRUE)

注意：D1 现在已删除重复的空格。

E1 = RegExp(D1, "[ ,]{2,}", ", ", TRUE, TRUE, TRUE)

注意：E1现在已经删除了重复的空格和逗号，我的文本中有很多垃圾

F1 = RegExp(E1, "\b(\w+)\b[\s\xA0]+(?=\1\b)", "", TRUE, FALSE, TRUE)

注意：F1 现在删除了重复的单词，因此只有一个单词没有删除，2+

这是工作代码，如果有人能想到更高效、更简洁的代码，请告诉我。

Public Function RegExp(ByVal text As String, _
    ByVal pattern As String, _
    ByVal strReplace As String, _
    Optional ByVal caseInsensitive As Boolean = True, _
    Optional ByVal multiLines As Boolean = True, _
    Optional ByVal allOccurrences As Boolean = True) As String
    Dim regEx As New RegExp
    With regEx
        .pattern = pattern
        .MultiLine = multiLines
        .IgnoreCase = caseInsensitive
        .Global = allOccurrences
    End With
    RegExp = regEx.Replace(text, strReplace) 'returns modified text (or original text if no modifications)
End Function

只是为了记录失败的尝试，但效果很好：

这里的代码开始工作，但中间留下了多个空行

B1 = RegExp(A1, "^[\s\xA0]+|[\s\xA0]+$", "", TRUE, FALSE, TRUE)
C1 = RegExp(B1, "^[\s\xA0]+|[\s\xA0]+$", CHAR(10), TRUE, TRUE, TRUE)

Answer 1

这个答案不是关于模式，而是关于执行效率和速度。如果您可以使用较小的替代品但仍然获得良好的性能，那么在我看来，这会简化整个过程。

这里有一些示例代码，展示了如何在调用之间重用 RegExp 对象，而不是为每个调用创建新的实例。


'Run a performance test
Sub Tester()
    Dim n As Long, txt As String, res As String, t
    
    txt = [a1].Value      'your example input
    t = Timer
    For n = 1 To 1000
        res = FixText(txt)
    Next n
    
    Debug.Print Timer - t  '10-11 sec with OPTIMIZE = False, 0.1 sec when True
    Debug.Print res
    
End Sub


Function FixText(ByVal text As String)
    Const OPTIMIZE As Boolean = True 'caches objects when True
    Dim arr, n, re As Object, i As Long, res As String
    Static col As Collection 'cache prepared RegExp objects between calls
    
    'array of alternating pattern and repalcement text pairs
    arr = Array("(\r?\n\s*){2,}", vbLf, _
                "^(?:[\t\s\xA0]+)|(?:[\t\s\xA0]+)$", "", _
                "[ \xA0]+", " ", _
                "[ ,]{2,}", ", ")
    
    If OPTIMIZE And col Is Nothing Then      'need to intialize the collection?
        Set col = New Collection
        For n = LBound(arr) To UBound(arr) Step 2
            col.Add GetRegExp(arr(n), True, True, True)
        Next n
    End If
    
    If OPTIMIZE Then
        'using cached RegExp objects
        i = 1 'index of first replacement text
        For Each re In col
            text = re.Replace(text, CStr(arr(i)))
            i = i + 2
        Next re
    Else
        'using freshly-created objects for each call
        For n = LBound(arr) To UBound(arr) Step 2
            text = GetRegExp(arr(n), True, True, True).Replace(text, arr(n + 1))
        Next n
    End If
    
    FixText = text
End Function

'return a configured RegExp object
Public Function GetRegExp(ByVal pattern As String, _
                          Optional ByVal caseInsensitive As Boolean = True, _
                          Optional ByVal multiLines As Boolean = True, _
                          Optional ByVal allOccurrences As Boolean = True) As RegExp
    Set GetRegExp = New RegExp
    With GetRegExp
        .MultiLine = multiLines
        .IgnoreCase = caseInsensitive
        .pattern = pattern
        .Global = allOccurrences
    End With
End Function

Excel VBA 正则表达式代码可修剪多行文本的前导/尾随空格和空行，以及重复的单词/空格/逗号

问题描述投票：0回答：1

1个回答

最新问题

Excel VBA 正则表达式代码可修剪多行文本的前导/尾随空格和空行，以及重复的单词/空格/逗号

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1