如何将哈夫曼树生成的字符串存储到文本文件中

Question

我正在尝试在 C++ 中实现 Huffman 编码 以实现 文本文件压缩。我能够根据文件中每个字符的频率构建霍夫曼树。当我尝试遍历树并获取不同字符的霍夫曼代码时，我将霍夫曼代码存储为字符串，因此输出字符串变得比输入字符串大。

unordered_map<char, string> encoding;

void store_huffman_codes(Node* root, string s){
    if(root == NULL) return;
    if(root->val != '$') encoding[root->val] = s;
    store_huffman_codes(root->left, s + '0');
    store_huffman_codes(root->right, s + '1');
}

unordered_map<char, int> m;
for(char c : test) m[c]++;
priority_queue<pair<int, Node*>, vector<pair<int, Node*>>, greater<pair<int, Node*>>> pq;
for(auto x : m){
    Node* temp = new Node(x.first);
    pq.push({x.second, temp});
}
while(pq.size() > 1){
    pair<int, Node*> a = pq.top(); pq.pop();
    pair<int, Node*> b = pq.top(); pq.pop();
    Node* temp = new Node('$');
    int val = a.first + b.first;
    temp->left = a.second; temp->right = b.second;
    pq.push({val, temp});
}
Node* root = pq.top().second;
store_huffman_codes(root, "");

string output = "";
for(char c : test){
    output += encoding[c];
}

如何将代码存储为二进制而不是字符串？

Answer 1

这里的重点是你使用整个字节来存储一个位。

相反，您应该将多个位压缩为一个字节；但是，出现了一个问题：如何处理因没有足够的数据而无法填充的未使用位（即数据长度不是位字节大小的倍数）？

你可以做一些类似于 utf-8 的事情来编码多字节序列：一个字节中前导一位的数量表示未使用的位数。优点：编码所需的所有信息都存储在一个字节中。缺点：您只能使用 7 位来编码最后一个字节之前的所有字节——这可能会高估优势。

或者，您将已使用或未使用的位数存储在单独的字节中；我的建议：第一个数据字节中未使用的位数，并在开头跳过未使用的字节（即输出第二个字节中的最低有效位），然后可能如下所示：

uint8_t byte = (8 - numberOfNodes % 8) % 8;
// assuming you tracked...
// second modulo maps 8 to 0, if that occurs

// output byte to file (number of unused bits)!

unsigned index = byte;
byte = 0;

auto append = [&encoded, &byte, &index](uint8_t bit)
{
    byte |= bit << index;
    if(++index == 8)
    {
        encoded.push_back(byte);
        index = 0;
        byte = 0;
    }
}

// replace s + 'X' by append(X)

此时您会注意到，除了已经编码的数据之外，您还需要将

byte

和

index

从一个递归调用转发到下一个递归调用；通过参数这样做对我来说似乎很不方便，但我建议为整个过程编写一个专用的class：

class Encoder
{
public:
    // suitable interface allowing to add new bytes

    // part of the public interface is a member function to trigger encoding,
    // e.g. one of:
    std::vector<uint8_t> const& encode();
    bool /* or some kind of error code */
    writeToFile(std::string path);

private:
    Node* root; // implementation detail, should not be accessible from outside!

    std::vector<uint8_t> encoded;
    // preferably instead of std::string – you're not storing text, but binary data!

    uint8_t byte;
    unsigned int index;

    // now instead of the lambda above I'd actually prefer a member function:
    void append(uint8_t bit)
    {
        // ...
    }
};

encode

现在将计算并附加指示未使用位数的第一个字节，并如前所示适当地初始化

byte

和

index

，然后开始在节点上递归迭代，从

root

开始，就像您所做的一样你自己也是——按照上面的说明应用最小的变化。

有了这个，解码变得非常简单：读取这个初始字节，将一些

index

初始化为这个数字并开始迭代其他字节，对于每个字节通过

(byte & 1u << index++) != 0

或

uint8_t bit = byte & 1u; ++index; byte >>= 1;

获得位（虽然构建tree top down 可能不是最有效的变体，但至少它很容易实现）。

如何将哈夫曼树生成的字符串存储到文本文件中

问题描述投票：0回答：1

1个回答

最新问题

如何将哈夫曼树生成的字符串存储到文本文件中

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1