Python 中的霍夫曼树编码

问题描述 投票:0回答:1

我一直在尝试自学霍夫曼树,并且认为我应该做一些东西来测试我的理解。一切都很顺利,直到我的上一个版本出现问题,我不知道为什么

我一直来来回回,不断引入新的错误,或者以新的方式犯同样的错误

这是我正在运行的Python脚本:

import heapq
from collections import Counter

def read_file(filename):
    with open(filename, 'r') as file:
        text = file.read()
    return text

def write_file(filename, encoded_text, codes):
    with open(filename, 'w') as file:
        file.write(''.join(encoded_text))
        for char, code in codes.items():
            file.write(f'{char}: {code}\n') 

def build_huffman_tree(freq_dict):
    heap = [[wt, [sym, '']] for sym, wt in freq_dict.items()]
    heapq.heapify(heap)
    
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]  
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    
    return sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))

def huffman_encode(input_filename, output_filename):
    text = read_file(input_filename)
    if not text:
        print("Input file is empty.")
        return
    freq_dict = Counter(text)
    if '"' not in freq_dict:
        freq_dict['"'] = 0 
    codes = {sym: code for code, sym in build_huffman_tree(freq_dict)}

    print("Characters in input text:", set(text))
    print("Characters in Huffman codes dictionary:", set(codes.keys()))

    encoded_text = [codes[char] for char in text]
    write_file(output_filename, encoded_text, codes)

input_filename = '../lorem.txt'
output_filename = 'encoded.txt'
huffman_encode(input_filename, output_filename)

运行上述命令时的终端输出是:

输入文本中的字符:{'a', 'v', 'e', 'g', 'i', 'm', ' ', 'b', '?', 'h', 'l', 's'、'Q'、'x'、'n'、't'、'd'、'、'、'S'、'p'、'r'、'c'、' '、'N'、'u'、'o'、'q'、'f'、'U'、'.'、'"'} Huffman代码中的字符字典中的字符:{'0011000','00111','011011100','1110','1110','10000','011010111','000','0110111101','01101101','01101101' , '11011', '101', '1111', '011011100', '0110100', '10001', '0011001', '1100', '001101', '0100', '0101', '01101010', ' 011011101'、'0111'、'1001'、'011011111'、'011010110'、'110101'、'01101100'、'01100'} 回溯(最近一次调用最后一次): 文件“/Users/username/Documents/fizzbuzzers-and-more/huffman_tree/python/python_huffman.py”,第 51 行,位于 huffman_encode(输入文件名,输出文件名) 文件“/Users/username/Documents/fizzbuzzers-and-more/huffman_tree/python/python_huffman.py”,第 46 行,huffman_encode encoded_text = [文本中的字符的代码[字符]] ~~~~~~^^^^^^ 关键错误:'"'

我认为我犯了一个愚蠢的错误,但到目前为止我还没有找到它。我已经尝试了相同想法的一些变体,但我得到了相同的错误或更糟的错误。我还尝试记录字符、字典等。有一次我尝试记录“代码”中找不到的字符,并将所有字符按顺序排列,这让我认为我没有创建字典。当我打印日志“代码”时,我得到:

{'101': ' ', '000': 'e', '1110': 'a', '1111': 'i', '0100': 'm', '0111': 'o', ' 0010':'r','0101':'s','1001':'t','1100':'u','00111':'d','10001':'l','11011' :'n','01100':'p','10000':'q','001101':',','110101':'c','110100':'v','0110100':' ', '0011001': 'b', '0011000': 'g', '01101010': '.', '01101100': 'h', '01101101': 'x', '011010111': '"', '011011100':'?','011011101':'N','011010110':'U','011011111':'f','0110111100':'Q','0110111101':'S'}

我的原始代码,在我开始尝试修复问题之前:

import heapq
from collections import Counter

def read_file(filename):
    with open(filename, 'r') as file:
        text = file.read()
    return text

def write_file(filename, encoded_text, codes):
    with open(filename, 'w') as file:
        file.write(''.join(encoded_text))
        file.write('\n')
        for char, code in codes.items():
            file.write(f'{char}: {code}\n')

def build_huffman_tree(freq_dict):
    heap = [[wt, [sym, '']] for sym, wt in freq_dict.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))

def huffman_encode(input_filename, output_filename):
    text = read_file(input_filename)
    freq_dict = Counter(text)
    codes = {sym: code for code, sym in build_huffman_tree(freq_dict)}
    encoded_text = [codes[char] for char in text]
    write_file(output_filename, encoded_text, codes)

input_filename = '../lorem.txt'
output_filename = 'encoded.txt'
huffman_encode(input_filename, output_filename)

该代码给了我很多相同的错误:

回溯(最近一次调用最后一次): 文件“/Users/username/Documents/fizzbuzzers-and-more/huffman_tree/python/python_huffman.py”,第 38 行,位于 huffman_encode(输入文件名,输出文件名) 文件“/Users/username/Documents/fizzbuzzers-and-more/huffman_tree/python/python_huffman.py”,第 33 行,huffman_encode encoded_text = [文本中的字符的代码[字符]] ~~~~~~^^^^^^ 关键错误:'"'

正在转换的文本(lorem.txt)包含:

“Sed ut perspiciatis undeomnis iste natus error sat voluptatem Accusantium doloremque laudantium、totam rem aperiam、eaque ipsa quae 从发明的真实性和准建筑师的人生观点出发 必须解释。 Nemo enim ipsam voluptatem quia voluptas 坐 aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui Ratione voluptatem sequi nesciunt。 Neque porro quisquam est, qui dolorem ipsum quia dolor sat amet、consectetur、adipisci velit、sed quia non numquam eius modi tempora incidunt ut laboure et dolore magnam aliquam quaerat voluptatem。 Ut enim ad minima veniam, quis 药方锻炼 ullam corporis suscipit labiosam, nisi ut aliquid ex ea commodi consequatur?是的 Reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla 帕里图尔?”

python tree binary huffman-code
1个回答
0
投票

这段代码似乎有效:

import heapq
from collections import Counter

def read_file(filename):
    with open(filename, 'r') as file:
        text = file.read()
    return text

def write_file(filename, encoded_text, codes):
    with open(filename, 'w') as file:
        file.write(''.join(encoded_text))
        file.write('\n')
        for char, code in codes.items():
            file.write(f'{char}: {code}\n')

def build_huffman_tree(freq_dict):
    heap = [[wt, [sym, '']] for sym, wt in freq_dict.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0], *lo[1:], *hi[1:]])
    return sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))

def huffman_encode(input_filename, output_filename):
    text = read_file(input_filename)
    freq_dict = Counter(text)
    codes = {sym: code for sym, code in build_huffman_tree(freq_dict)}
    # print(codes)
    encoded_text = [codes[char] for char in text]
    write_file(output_filename, encoded_text, codes)

input_filename = '../lorem.txt'
output_filename = 'python_huffman_encoded.txt'
huffman_encode(input_filename, output_filename)

注意事项是:

heapq.heappush(heap, [lo[0] + hi[0], *lo[1:], *hi[1:]])

以及前面提到的:

codes = {sym: code for sym, code in build_huffman_tree(freq_dict)}

感谢任何看过的人,我想这个问题现在可能已经解决了

如果有人发现任何其他错误,无论是愚蠢的还是其他错误,我都会将其保持打开状态

© www.soinside.com 2019 - 2024. All rights reserved.