C++中读取字符实体引用标签内容

Question

附注只要认为文件的内容是

&#110

（之后或之前没有其他内容；让我们尽可能简单）。

readCharacter()

将返回正确解码的

'n'

字符，但它也会到达文件末尾。因此

getTagContent()

方法将返回空字符串，但事实并非如此。

附注2 我找到了一个解决方案，但在我看来它看起来不太好。

if

方法中的

while

循环中的

getTagContentLength()

可能如下所示：

if (ch == '<' || is.eof())
{
    if (ch != EOF && ch != '<')
    {
        tagContent[i++] = ch;
    }

    break;
}

我正在努力实现以下目标：

我们有一个 HTML 标签内容，例如让标签为

<th>some value</th>

。

当我调用方法

getTagContent()

时，

is.get()

将返回

's'

符号，因此内容的第一个字符（我已经处理过）。

我也希望能够做的是处理字符实体引用，因此

some value

可以写成

&#115ome value

或

&#115&#111&#109&#101&#32value

。这就是

readCharacter()

方法的用途。

char* getTagContent(std::istream& is, int maxTagContentLength)
{
    char* tagContent = new char[maxTagContentLength + 1];
    int i = 0;

    char ch;

    while (true)
    {
        ch = readCharacter(is);

        if (ch == '<' || is.eof())
        {
            break;
        }

        tagContent[i++] = ch;
    }

    tagContent[i] = '\0';

    return tagContent;
}

char readCharacter(std::istream& is)
{
    char ch = is.get();

    if (ch == '&' && is.peek() == '#')
    {
        is.get();

        char charEntityRef;
        int number = 0;

        while (true)
        {
            charEntityRef= is.get();

            if (is.eof())
            {
                break;
            }

            if (!isDigit(charEntityRef))
            {
                is.unget();             
                break;
            }

            number = number * 10 + charEntityRef- '0';
        }

        ch = (char)(number);
    }

    return ch;
}

不过我遇到了一些问题。假设我们有以下内容

&#110&#105&#110&#101&#116&#101&#101&#110

，即字符串

nineteen

。我的代码将返回字符串

ninetee

而没有最后一个

。问题是，在

while

方法中

getTagContent()

循环的最后一次迭代中，该字符实际上正是结果中缺少的最后一个

'n'

，但在

readCharacter()

方法中引发了 eof 位并且它不会被写入结果（由于

break

语句，我们将退出循环）。

我不知道如何在不搞乱逻辑的情况下修复它（例如，我们需要在遇到开始标签时准确停止，因为那时标签内容结束，并且可能是结束标签）。

Answer 1

您的代码有很多问题：

您没有正确处理 EOF。
您没有正确处理实体末尾的终止
```
;
```
。它是实体的一部分，不应放回输入流中。
您仅处理十进制代码的实体，而不处理十六进制代码或名称的实体。
如果内容长度超过
```
getTagContent()
```
个字符，则
```
maxTagContentLength
```
中存在缓冲区溢出。
getTagContent()
的实体（如
```
'<'
```
），
```
&lt;
```
将过早中断。在检查内容中的任何实体之前，您需要检查读取的字符是否是内容末尾的终止符
```
'<'
```
。

std::string getTagContent(std::istream& is) { std::string tagContent; std::string value; while (((value = readCharacterOrEntity(is)) != "") && (value != "<")) { tagContent += decodeCharacterEntity(value); } return tagContent; } std:string readCharacterOrEntity(std::istream& is) { char result; char ch; if (is.get(ch)) { if (ch == '&' && is.peek() == '#') { is.get(); std::string value; if (std::getline(is, value, ';')) result = "&#" + value + ';'; } else { // TODO: handle named entity that begins with just '&' and not '&#'... result = ch; } } return result; } std::string decodeCharacterEntity(const std::string &entity) { if (entity.compare(0, 2, "&#") == 0) { int i; if (entity[2] == 'x') i = std::stoi(entity.substr(3, entity.size()-4), 16); else i = std::stoi(entity.substr(2, entity.size()-3), 10); // TODO: handle non-ASCII characters if (i > 127) return "?"; return (char) i; } else if (entity[0] == '&') { std::string entity_name = entity.substr(2, entity.size()-1); if (entity_name == "lt") return "<"; if (entity_name == "gt") return ">"; // TODO: look up other names as needed... return ...; } else return value; }

话虽如此，这并不是解析 HTML 的好方法。您确实应该使用实际的 HTML 解析器库。但如果您不能/不会，那么至少将 HTML 读入更大的内存缓冲区，您可以更好地标记化，而不是一次处理 1 个字符。

C++中读取字符实体引用标签内容

问题描述投票：0回答：1

1个回答

最新问题

C++中读取字符实体引用标签内容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1