我该如何检查 UTF8 字符串的有效性而不发生缓冲区溢出？

Question

在寻找 C 的 UTF8 处理库时，我发现了 this，它似乎是 GitHub 上最受欢迎的。它会溢出您传递给 UTF8 验证函数的缓冲区：

#include "utf8.h"
int main()
{
    unsigned char c[] = { 0b11110000, 0b10000000, 0 };
    

    utf8_int8_t* is_valid = utf8valid((char*)c);

}

超出缓冲区的代码部分是：

if (0xf0 == (0xf8 & *str)) {
    /* ensure that there's 4 bytes or more remaining */
    if (remaining < 4) {
        return (utf8_int8_t*)str;
    }
    
    /* ensure each of the 3 following bytes in this 4-byte
     * utf8 codepoint began with 0b10xxxxxx */
    if ((0x80 != (0xc0 & str[1])) || (0x80 != (0xc0 & str[2])) ||
        (0x80 != (0xc0 & str[3]))) {
        return (utf8_int8_t*)str;
    }

当它读取 str[3] 时，它会读取我分配的缓冲区，即使我的字符串缓冲区以 null 终止。对于验证 UTF8 函数来说，此行为是正常的还是预期的？有一个 utf8nvalid() 函数，您可以在其中传递最大缓冲区大小，但在该函数中它显式检查空终止符，因此它似乎认为它可以防止缓冲区溢出。我认为它坏了。该代码是单个标头，位于here

Answer 1

当输入字符串时，此代码不会读取

str[3]

，因为

||

在

(0x80 != (0xc0 & str[2]))

处短路，其计算结果为

。

valgrind

和

libasan

都没有发现这段代码有什么问题。

我该如何检查 UTF8 字符串的有效性而不发生缓冲区溢出？

问题描述投票：0回答：1

1个回答

最新问题

我该如何检查 UTF8 字符串的有效性而不发生缓冲区溢出？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1