如何从任意字符串中删除base64字符串?

问题描述 投票:0回答:1

我有一个Python字符串。我想从中删除 base64 字符串。我阅读了有关 base64 规范 的内容,并且 环顾四周,但看起来我无法找到一种干净的方法来删除它们。

我尝试了几个 hacky 正则表达式,但这让我的字符串变得更糟;例如,它将单词

problem

 更改为 
lem
:

def remove_base64_strings(text: str) -> str: """ Remove base64 encoded strings from a string. """ # Regular expression for matching potential base64 strings base64_pattern = r"(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?" # Replace found base64 strings with an empty string return re.sub(base64_pattern, "", text)

import re base64_regex = r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$' base64_strings = re.findall(base64_regex, text)
是否有一种可靠的方法来删除 Base64 字符串?

我正在考虑用空格分割单词。然后找到一个与上述模式匹配且至少有 12 个字符的字符串,因为 base64 字符串看起来像随机长字符串,我想确定删除它们。


我试过这个:

def remove_base64_words(text: str, threshold_length: int = 24) -> str: """ Remove words that are suspected to be Base64 encoded strings from a sentence. Args: sentence (str): The sentence from which to remove Base64 encoded words. threshold_length (int): The minimum length of a word to be considered a Base64 encoded string. Returns: str: The sentence with suspected Base64 encoded words removed. """ # # Regex pattern for Base64 encoded strings # base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b" # base64_pattern = r"^([A-Za-z0-9+/]{4}){5,}([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$" # base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b" # # Function to replace suspected Base64 encoded words # def replace_base64_word(matchobj): # word = matchobj.group(0) # if len(word) >= threshold_length: # return "" # else: # return word # # Replace words in the sentence that match the pattern and are above the threshold length # return re.sub(base64_pattern, replace_base64_word, sentence) """ Remove words from the text that are of length 28 or more, are multiples of 4, and not found in the English dictionary. Args: text (str): The input text. Returns: str: The text with suspected Base64-like non-dictionary words removed. """ import nltk nltk.download('words') from nltk.corpus import words # Set of English words english_words = set(words.words()) # Split the text into words words_in_text = text.split() # Filter out words of specific length properties that are not in the English dictionary filtered_words = [word for word in words_in_text if not (len(word) >= threshold_length and len(word) % 4 == 0 and word.lower() not in english_words)] # Reassemble the text return ' '.join(filtered_words)
单元测试:

# base64 # Unit tests test_sentences = [ ("This is a test with no base64", "This is a test with no base64"), ("Base64 example: TWFuIGlzIGRpc3Rpbmd1aXNoZWQ=", "Base64 example: "), ("Short== but not base64", "Short== but not base64"), ("ValidBase64== but too short", "ValidBase64== but too short"), ("Mixed example with TWFuIGlzIGRpc3Rpbmd1aXNoZWQ= base64", "Mixed example with base64"), ] for input_sentence, expected_output in test_sentences: our_output: str = remove_base64_words(input_sentence) print(f'Trying to remove Base64: {input_sentence=} --> {our_output=} {expected_output=}') # print(f'Trying to remove Base64: {input_sentence=} {expected_output=}')
    
python regex replace base64 data-cleaning
1个回答
0
投票
这不是很明显吗?您的正则表达式没有实现“至少 12 个字符”的要求,并且将替换任何与一般模式匹配的字符串,无论长度如何。

对模式进行一个非常小的调整,要求至少 12 个字符,即将第一个

*

 更改为 
{3,}
,即要求在开头至少重复 3 次由四个字符组成的组。

base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"
您的模式还有另一个缺陷,即它错过了末尾带有三个填充字符的任何 Base64 字符串。解决这个问题的方法应该同样显而易见:

base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]={3}|[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"
    
© www.soinside.com 2019 - 2024. All rights reserved.