如何从任意字符串中删除base64字符串？

我有一个Python字符串。我想从中删除 base64 字符串。我阅读了有关 base64 规范的内容，并且环顾四周，但看起来我无法找到一种干净的方法来删除它们。

我尝试了几个 hacky 正则表达式，但这让我的字符串变得更糟；例如，它将单词

problem

 更改为

lem

def remove_base64_strings(text: str) -> str:
    """
    Remove base64 encoded strings from a string.
    """
    # Regular expression for matching potential base64 strings
    base64_pattern = r"(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"
    # Replace found base64 strings with an empty string
    return re.sub(base64_pattern, "", text)

或

import re

base64_regex = r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$'

base64_strings = re.findall(base64_regex, text)

是否有一种可靠的方法来删除 Base64 字符串？

我正在考虑用空格分割单词。然后找到一个与上述模式匹配且至少有 12 个字符的字符串，因为 base64 字符串看起来像随机长字符串，我想确定删除它们。

我试过这个：

def remove_base64_words(text: str, threshold_length: int = 24) -> str:
    """
    Remove words that are suspected to be Base64 encoded strings from a sentence.

    Args:
    sentence (str): The sentence from which to remove Base64 encoded words.
    threshold_length (int): The minimum length of a word to be considered a Base64 encoded string.

    Returns:
    str: The sentence with suspected Base64 encoded words removed.
    """
    # # Regex pattern for Base64 encoded strings
    # base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b"
    # base64_pattern = r"^([A-Za-z0-9+/]{4}){5,}([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$"
    # base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b"

    # # Function to replace suspected Base64 encoded words
    # def replace_base64_word(matchobj):
    #     word = matchobj.group(0)
    #     if len(word) >= threshold_length:
    #         return ""
    #     else:
    #         return word

    # # Replace words in the sentence that match the pattern and are above the threshold length
    # return re.sub(base64_pattern, replace_base64_word, sentence)
    """
    Remove words from the text that are of length 28 or more, 
    are multiples of 4, and not found in the English dictionary.

    Args:
    text (str): The input text.

    Returns:
    str: The text with suspected Base64-like non-dictionary words removed.
    """
    import nltk
    nltk.download('words')
    from nltk.corpus import words

    # Set of English words
    english_words = set(words.words())

    # Split the text into words
    words_in_text = text.split()

    # Filter out words of specific length properties that are not in the English dictionary
    filtered_words = [word for word in words_in_text if not (len(word) >= threshold_length and len(word) % 4 == 0 and word.lower() not in english_words)]

    # Reassemble the text
    return ' '.join(filtered_words)

单元测试：

    # base64
    # Unit tests
    test_sentences = [
        ("This is a test with no base64", "This is a test with no base64"),
        ("Base64 example: TWFuIGlzIGRpc3Rpbmd1aXNoZWQ=", "Base64 example: "),
        ("Short== but not base64", "Short== but not base64"),
        ("ValidBase64== but too short", "ValidBase64== but too short"),
        ("Mixed example with TWFuIGlzIGRpc3Rpbmd1aXNoZWQ= base64", "Mixed example with  base64"),
    ]
    for input_sentence, expected_output in test_sentences:
        our_output: str = remove_base64_words(input_sentence)
        print(f'Trying to remove Base64: {input_sentence=} --> {our_output=} {expected_output=}')
        # print(f'Trying to remove Base64: {input_sentence=} {expected_output=}')

0
投票

这不是很明显吗？您的正则表达式没有实现“至少 12 个字符”的要求，并且将替换任何与一般模式匹配的字符串，无论长度如何。

对模式进行一个非常小的调整，要求至少 12 个字符，即将第一个

*

 更改为

{3,}

，即要求在开头至少重复 3 次由四个字符组成的组。

    base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"

您的模式还有另一个缺陷，即它错过了末尾带有三个填充字符的任何 Base64 字符串。解决这个问题的方法应该同样显而易见：

    base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]={3}|[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"

问题描述投票：0回答：1

1个回答

最新问题

如何从任意字符串中删除base64字符串？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1