无法在 C 中使用 PCRE 正则表达式匹配整个字符串

问题描述 投票:0回答:1

这个正则表达式

'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
按预期匹配
Ġmeousrtr
,这可以在共享链接中看到https://regex101.com/r/UR0P6T/1

但是当我尝试在 C 中使用 PCRE 库时,我得到 3 个单独的匹配项,而不是 1 个。我得到 unicode 字符

Ġ
是 2 字节宽度,并且表达式匹配这两个字节,但这不应该匹配整个字符串如https://regex101.com/r/UR0P6T/1

# Output of regex expression
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

# Matches
Match Succeeded at 0
�x
Match Succeeded at 1
�x
Match Succeeded at 2
meousrtrx

以下是C代码:

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <string.h>
#include <iostream>
using namespace std;

int main(int argc, char **argv)
{
    PCRE2_SPTR expression = (PCRE2_SPTR) "'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+";
    PCRE2_SPTR text = (PCRE2_SPTR) "Ġmeousrtr";
    PCRE2_SIZE eoffset;
    PCRE2_SIZE *ovector;
    pcre2_code *re;
    pcre2_match_data *match_data;
    char *c = (char *)expression;
    while (*c)
        printf("%c", (unsigned int)*c++);
    printf("\n");

    int error_number;
    int result;
    size_t start_offset = 0;
    size_t text_len;
    u_int32_t options = 0;

    text_len = strlen((char *)text);

    re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, 0, &error_number, &eoffset, NULL);
    if (re == NULL)
    {
        PCRE2_UCHAR buffer[256];
        pcre2_get_error_message(error_number, buffer, sizeof(buffer));
        cout << buffer;
        return 1;
    }
    match_data = pcre2_match_data_create_from_pattern(re, NULL);
    while (true)
    {
        result = pcre2_match(re, text, text_len, start_offset, options, match_data, NULL);
        if (result < 0)
        {
            switch (result)
            {
            case PCRE2_ERROR_NOMATCH:
                cout << "No matches found!";
                return 0;

            default:
                cout << "Matching Error" << result;
                return -1;
            }
            pcre2_match_data_free(match_data);
            pcre2_code_free(re);
        }
        ovector = pcre2_get_ovector_pointer(match_data);
        printf("Match Succeeded at %d\n", ovector[0]);
        int i;
        for (i = 0; i < result; i++)
        {
            PCRE2_SPTR substring_start = text + ovector[2 * i];
            PCRE2_SIZE substring_length = ovector[2 * i + 1] - ovector[2 * i];
            printf("%.*s\n", (int)substring_length, (char *)substring_start);
        }
        start_offset = ovector[1];
    }
}
c++ regex pcre
1个回答
0
投票

您需要在

PCRE2_UTF
选项中使用
pcre2_compile()
才能识别 UTF-8 编码文本(这可能是您的源文件编码的内容):

re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, PCRE2_UTF,
                   &error_number, &eoffset, nullptr);

您的代码还有其他问题 - 例如,

printf()
值的
size_t
格式是
%zu
,而不是
%d
(最好始终使用C++风格的iostream输出函数,而不是混合和例如,iostreams 和 stdio 之间的匹配),但告诉 PCRE2 它的输入是 Unicode 是最相关的。该更改使您的程序输出

'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
Match Succeeded at 0
Ġmeousrtr
No matches found!
© www.soinside.com 2019 - 2024. All rights reserved.