快速删除(非常大的)字符串中不必要的空格

问题描述 投票:0回答:3

我正在 VBA 中处理非常大的(45,000,000+ 个字符)字符串,我需要删除多余的 空白

一个空格(又名,ASCII Code 32)可以,但任何具有两个或多个连续空格的部分应减少到只有一个。

我发现了一个类似的问题here,尽管OP对“非常长的字符串”的定义是只有39,000个字符。接受的答案是使用

Replace
:

的循环
Function MyTrim(s As String) As String
    Do While InStr(s, "  ") > 0
        s = Replace$(s, "  ", " ")
    Loop
    MyTrim = Trim$(s)
End Function

我尝试了这个方法,它“有效”,但速度痛苦慢:

Len In:  44930886 
Len Out: 35322469
Runtime: 247.6 seconds

有没有更快的方法来从“非常大”的字符串中删除空格?

excel vba performance trim removing-whitespace
3个回答
6
投票

我怀疑性能问题是由于创建了大量的大型中间字符串造成的。因此,任何无需创建中间字符串或更少的中间字符串即可执行操作的方法都会表现更好。

正则表达式替换很有可能实现这一点。

Option Explicit

Sub Test(ByVal text As String)

  Static Regex As Object
  If Regex Is Nothing Then
    Set Regex = CreateObject("VBScript.RegExp")
    Regex.Global = True
    Regex.MultiLine = True
  End If

  Regex.Pattern = " +" ' space, one or more times

  Dim result As String: result = Regex.Replace(text, " ")
  Debug.Print Len(result), Left(result, 20)
End Sub

输入 4500 万个字符的字符串大约需要一秒钟。

跑步者:

Sub Main()

  Const ForReading As Integer = 1
  Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
  Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
  Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
  Dim text As String: text = file.ReadAll()
  Set file = Nothing
  Set fso = Nothing
  Debug.Print Len(text), Left(text, 20)

  Test (text)

End Sub

测试数据创建器(C#):

var substring = "××\n× ××   ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);

顺便说一句——由于 VBA(VB4、Java、JavaScript、C#、VB...)使用 UTF-16,因此空格字符是一个 UTF-16 代码单元

ChrW(32)
。 (任何与 ASCII 的相似或比较都是不必要的脑力体操,如果将其作为 ANSI [
Chr(32)
] 写入代码,则会在幕后进行不必要的转换,并且对于不同的机器、用户和时间会有不同的行为。)


1
投票

在 VBA 中,

String
的大小限制为大约 20 亿个字符。上面的“
Replace
-
Loop
”方法对于4500万个字符串花了247秒,超过4分钟。

理论上,这意味着 20 亿个字符串至少需要 3 个小时——如果它甚至没有崩溃的话——所以它并不完全实用。

Excel 有一个内置的工作表函数

Trim
,它与 VBA 的 Trim
 函数
不一样

工作表功能

Trim
删除文本中的所有空格(单词之间的单个空格除外)。

问题

Trim
,像所有用
Application.WorksheetFunction
调用的函数一样,有32,767个字符的大小限制,并且[不幸的是]这也适用即使在使用字符串从VBA调用函数时那甚至不在牢房中

但是,如果我们使用它来循环遍历我们的“巨大字符串”,我们仍然可以使用该函数,如下所示:

编辑:甚至不用理会这些废话(我的函数,在下面)!参见RegEx答案上面

Function bigTrim(strIn As String) As String

    Const maxLen = 32766
    Dim loops As Long, x As Long
    loops = Int(Len(strIn) / maxLen)
    If (Len(strIn) / maxLen) <> loops Then loops = loops + 1

    For x = 1 To loops
        bigTrim = bigTrim & _
            Application.WorksheetFunction.Trim(Mid(strIn, _
            ((x - 1) * maxLen) + 1, maxLen))
    Next x

End Function

在与“

Replace
-
Loop
”方法使用的同一字符串上运行此函数会产生更多更好的结果:

Len In:  44930886 
Len Out: 35321845
Runtime: 33.6 seconds

这比“

Replace
-
Loop
”方法快了 7 倍以上,and 成功删除了其他方法遗漏的 624 个空格。

(我想调查为什么第一种方法丢失了字符,但因为我知道我的字符串没有丢失任何东西,而且这个练习的目的是为了节省时间,那太愚蠢了!)


0
投票

这个问题比答案看起来要有趣得多,因为OP提出的解决方案应该没有任何问题,因为理论上该算法非常高效。

事实证明,这里的问题是 VBA 内置

Replace
函数在底层的实现不佳,这导致它在大量替换的大字符串上完全卡住。

可以轻松地手动实现具有线性运行时间的

Replace
函数,其性能远远优于大字符串的内置函数。这里提供了此类实现的示例:

'Works like the inbuilt 'Replace', but only allocates the buffer once and is
'therefore much, much faster on large strings with many replacements
'This function is the renamed function `ReplaceFast` from here:
'https://github.com/guwidoe/VBA-StringTools
'Note that this implementation is slightly slower than the inbuilt 'Replace'
'function for short strings with few replacements
Public Function Replace(ByRef str As String, _
                        ByRef sFind As String, _
                        ByRef sReplace As String, _
               Optional ByVal lStart As Long = 1, _
               Optional ByVal lCount As Long = -1, _
               Optional ByVal lCompare As VbCompareMethod _
                                       = vbBinaryCompare) As String
    Const methodName As String = "Replace"
    If lStart < 1 Then Err.Raise 5, methodName, _
        "Argument 'lStart' = " & lStart & " < 1, invalid"
    If lCount < -1 Then Err.Raise 5, methodName, _
        "Argument 'lCount' = " & lCount & " < -1, invalid"
    lCount = lCount And &H7FFFFFFF

    If Len(str) = 0 Or Len(sFind) = 0 Then
        Replace = Mid$(str, lStart)
        Exit Function
    End If

    Dim lenFind As Long:         lenFind = Len(sFind)
    Dim lenReplace As Long:      lenReplace = Len(sReplace)
    Dim bufferSizeChange As Long
    bufferSizeChange = CountSubstring(str, sFind, lStart, lCount, lCompare) _
                                       * (lenReplace - lenFind) - lStart + 1

    If Len(str) + bufferSizeChange < 0 Then Exit Function

    Replace = Space$(Len(str) + bufferSizeChange)

    Dim i As Long:              i = InStr(lStart, str, sFind, lCompare)
    Dim j As Long:              j = 1
    Dim lastOccurrence As Long: lastOccurrence = lStart
    Dim count As Long:          count = 1

    Do Until i = 0 Or count > lCount
        Dim diff As Long: diff = i - lastOccurrence
        If diff > 0 Then _
            Mid$(Replace, j, diff) = Mid$(str, lastOccurrence, diff)
        j = j + diff
        If lenReplace <> 0 Then
            Mid$(Replace, j, lenReplace) = sReplace
            j = j + lenReplace
        End If
        count = count + 1
        lastOccurrence = i + lenFind
        i = InStr(lastOccurrence, str, sFind, lCompare)
    Loop
    If j <= Len(Replace) Then Mid$(Replace, j) = Mid$(str, lastOccurrence)
End Function
Public Function CountSubstring(ByRef str As String, _
                               ByRef subStr As String, _
                      Optional ByVal lStart As Long = 1, _
                      Optional ByVal lLimit As Long = -1, _
                      Optional ByVal lCompare As VbCompareMethod _
                                                 = vbBinaryCompare) As Long
    Const methodName As String = "CountSubstring"
    If lStart < 1 Then Err.Raise 5, methodName, _
        "Argument 'Start' = " & lStart & " < 1, invalid"
    If lLimit < -1 Then Err.Raise 5, methodName, _
        "Argument 'lLimit' = " & lLimit & " < -1, invalid"
    If subStr = vbNullString Then Exit Function

    Dim lenSubStr As Long: lenSubStr = Len(subStr)
    Dim i As Long:         i = InStr(lStart, str, subStr, lCompare)

    CountSubstring = 0
    Do Until i = 0 Or lLimit = CountSubstring
        CountSubstring = CountSubstring + 1
        i = InStr(i + lenSubStr, str, subStr, lCompare)
    Loop
End Function

只需将此代码粘贴到项目中即可修复原始代码的性能问题,而无需对其进行任何更改,只需覆盖内置的

Replace
函数即可。

在我的测试中,当改进的

Replace
函数存在时,原始代码应该只需要大约 1.5 秒来处理类似于 OP 示例的字符串,改进超过 100 倍!:

Sub DemoMyTrim()
    Const LEN_INPUT_STR = 45000000
    
    Dim inputStr As String: inputStr = RepeatString("  aaa", LEN_INPUT_STR / 5)
    
    Dim t As Single: t = Timer()
    
    Dim outStr As String: outStr = MyTrim(inputStr)

    Debug.Print "Trimming took " & Timer() - t & " seconds."
    Debug.Print "Len Out: " & Len(outStr)
End Sub

''RepeatString' function source:
'https://github.com/guwidoe/VBA-StringTools
Private Function RepeatString(ByRef str As String, _
                     Optional ByVal repeatTimes As Long = 2) As String
    Const methodName As String = "RepeatString"
    If repeatTimes < 0 Then Err.Raise 5, methodName, _
        "Argument 'repeatTimes' = " & repeatTimes & " < 0, invalid"
    If repeatTimes = 0 Then Exit Function
    If LenB(str) = 2 Then
        RepeatString = String$(repeatTimes, str)
        Exit Function
    End If

    Dim newLength As Long: newLength = LenB(str) * repeatTimes
    RepeatString = Space$((newLength + 1) \ 2)
    If newLength Mod 2 = 1 Then RepeatString = MidB$(RepeatString, 2)
    
    MidB$(RepeatString, 1) = str
    If repeatTimes > 1 Then MidB$(RepeatString, LenB(str) + 1) = RepeatString
End Function

Public Function MyTrim(ByRef s As String) As String
    MyTrim = s
    Do While InStr(MyTrim, "  ") > 0
        MyTrim = Replace(MyTrim, "  ", " ")
    Loop
End Function

虽然这非常有趣,但它仍然比已接受的答案提出的正则表达式解决方案慢。

由于接受的答案使用 Mac 上不可用的正则表达式,我想提出另一种替代方案,它比原始算法更快,具有改进的

Replace
函数,并且仍然可以使用在任何平台上可用的 VBA 内置函数。

这可以通过 LibStringTools 库中的另一个函数实现:

'Replaces consecutive occurrences of 'substring' that repeat more than 'limit'
'times with exactly 'limit' consecutive occurrences
'Source:
'https://github.com/guwidoe/VBA-StringTools
Public Function LimitConsecutiveSubstringRepetition( _
                                           ByRef str As String, _
                                  Optional ByRef subStr As String = vbNewLine, _
                                  Optional ByVal limit As Long = 1, _
                                  Optional ByVal Compare As VbCompareMethod _
                                                          = vbBinaryCompare) _
                                           As String
    Const methodName As String = "LimitConsecutiveSubstringRepetition"

    If limit < 0 Then Err.Raise 5, methodName, _
        "Argument 'limit' = " & limit & " < 0, invalid"
    If limit = 0 Then
        LimitConsecutiveSubstringRepetition = Replace(str, subStr, _
                                                      vbNullString, , , Compare)
        Exit Function
    Else
        LimitConsecutiveSubstringRepetition = str
    End If
    If Len(str) = 0 Then Exit Function
    If Len(subStr) = 0 Then Exit Function

    Dim i As Long:                i = InStr(1, str, subStr, Compare)
    Dim j As Long:                j = 1
    Dim lenSubStr As Long:        lenSubStr = Len(subStr)
    Dim lastOccurrence As Long:   lastOccurrence = 1 - lenSubStr
    Dim copyChunkSize As Long
    Dim consecutiveCount As Long
    Dim occurrenceDiff As Long

    Do Until i = 0
        occurrenceDiff = i - lastOccurrence
        If occurrenceDiff = lenSubStr Then
            consecutiveCount = consecutiveCount + 1
            If consecutiveCount <= limit Then
                copyChunkSize = copyChunkSize + occurrenceDiff
            ElseIf consecutiveCount = limit + 1 Then
                Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
                    Mid$(str, i - copyChunkSize, copyChunkSize)
                j = j + copyChunkSize
                copyChunkSize = 0
            End If
        Else
            copyChunkSize = copyChunkSize + occurrenceDiff
            consecutiveCount = 1
        End If
        lastOccurrence = i
        i = InStr(i + lenSubStr, str, subStr, Compare)
    Loop

    copyChunkSize = copyChunkSize + Len(str) - lastOccurrence - lenSubStr + 1
    Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
        Mid$(str, Len(str) - copyChunkSize + 1)

    LimitConsecutiveSubstringRepetition = _
        Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
End Function
Public Function CountSubstring(ByRef str As String, _
                               ByRef subStr As String, _
                      Optional ByVal lStart As Long = 1, _
                      Optional ByVal lLimit As Long = -1, _
                      Optional ByVal lCompare As VbCompareMethod _
                                                 = vbBinaryCompare) As Long
    Const methodName As String = "CountSubstring"
    If lStart < 1 Then Err.Raise 5, methodName, _
        "Argument 'Start' = " & lStart & " < 1, invalid"
    If lLimit < -1 Then Err.Raise 5, methodName, _
        "Argument 'lLimit' = " & lLimit & " < -1, invalid"
    If subStr = vbNullString Then Exit Function

    Dim lenSubStr As Long: lenSubStr = Len(subStr)
    Dim i As Long:         i = InStr(lStart, str, subStr, lCompare)

    CountSubstring = 0
    Do Until i = 0 Or lLimit = CountSubstring
        CountSubstring = CountSubstring + 1
        i = InStr(i + lenSubStr, str, subStr, lCompare)
    Loop
End Function

使用此功能,可以达到预期的效果,如下:

Dim inputStr as String
'... somehow populate input string

dim outStr as String
outStr = LimitConsecutiveSubstringRepetition(inputStr, " ", 1)
© www.soinside.com 2019 - 2024. All rights reserved.