我有一个Python字符串。我想从中删除 base64 字符串。我阅读了有关 base64 规范 的内容,并且 环顾四周,但看起来我无法找到一种干净的方法来删除它们。
我尝试了几个 hacky 正则表达式,但这让我的字符串变得更糟;例如,它将单词problem
更改为
lem
:
def remove_base64_strings(text: str) -> str:
"""
Remove base64 encoded strings from a string.
"""
# Regular expression for matching potential base64 strings
base64_pattern = r"(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"
# Replace found base64 strings with an empty string
return re.sub(base64_pattern, "", text)
或
import re
base64_regex = r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$'
base64_strings = re.findall(base64_regex, text)
是否有一种可靠的方法来删除 Base64 字符串?我正在考虑用空格分割单词。然后找到一个与上述模式匹配且至少有 12 个字符的字符串,因为 base64 字符串看起来像随机长字符串,我想确定删除它们。
def remove_base64_words(text: str, threshold_length: int = 24) -> str:
"""
Remove words that are suspected to be Base64 encoded strings from a sentence.
Args:
sentence (str): The sentence from which to remove Base64 encoded words.
threshold_length (int): The minimum length of a word to be considered a Base64 encoded string.
Returns:
str: The sentence with suspected Base64 encoded words removed.
"""
# # Regex pattern for Base64 encoded strings
# base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b"
# base64_pattern = r"^([A-Za-z0-9+/]{4}){5,}([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$"
# base64_pattern = r"\b(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?\b"
# # Function to replace suspected Base64 encoded words
# def replace_base64_word(matchobj):
# word = matchobj.group(0)
# if len(word) >= threshold_length:
# return ""
# else:
# return word
# # Replace words in the sentence that match the pattern and are above the threshold length
# return re.sub(base64_pattern, replace_base64_word, sentence)
"""
Remove words from the text that are of length 28 or more,
are multiples of 4, and not found in the English dictionary.
Args:
text (str): The input text.
Returns:
str: The text with suspected Base64-like non-dictionary words removed.
"""
import nltk
nltk.download('words')
from nltk.corpus import words
# Set of English words
english_words = set(words.words())
# Split the text into words
words_in_text = text.split()
# Filter out words of specific length properties that are not in the English dictionary
filtered_words = [word for word in words_in_text if not (len(word) >= threshold_length and len(word) % 4 == 0 and word.lower() not in english_words)]
# Reassemble the text
return ' '.join(filtered_words)
单元测试:
# base64
# Unit tests
test_sentences = [
("This is a test with no base64", "This is a test with no base64"),
("Base64 example: TWFuIGlzIGRpc3Rpbmd1aXNoZWQ=", "Base64 example: "),
("Short== but not base64", "Short== but not base64"),
("ValidBase64== but too short", "ValidBase64== but too short"),
("Mixed example with TWFuIGlzIGRpc3Rpbmd1aXNoZWQ= base64", "Mixed example with base64"),
]
for input_sentence, expected_output in test_sentences:
our_output: str = remove_base64_words(input_sentence)
print(f'Trying to remove Base64: {input_sentence=} --> {our_output=} {expected_output=}')
# print(f'Trying to remove Base64: {input_sentence=} {expected_output=}')
对模式进行一个非常小的调整,要求至少 12 个字符,即将第一个
*
更改为
{3,}
,即要求在开头至少重复 3 次由四个字符组成的组。
base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"
您的模式还有另一个缺陷,即它错过了末尾带有三个填充字符的任何 Base64 字符串。解决这个问题的方法应该同样显而易见:
base64_pattern = r"(?:[A-Za-z0-9+/]{4}){3,}(?:[A-Za-z0-9+/]={3}|[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?"