这个正则表达式
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
按预期匹配Ġmeousrtr
,这可以在共享链接中看到https://regex101.com/r/UR0P6T/1
但是当我尝试在 C 中使用 PCRE 库时,我得到 3 个单独的匹配项,而不是 1 个。我得到 unicode 字符
Ġ
是 2 字节宽度,并且表达式匹配这两个字节,但这不应该匹配整个字符串如https://regex101.com/r/UR0P6T/1
# Output of regex expression
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
# Matches
Match Succeeded at 0
�x
Match Succeeded at 1
�x
Match Succeeded at 2
meousrtrx
以下是C代码:
#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <string.h>
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
PCRE2_SPTR expression = (PCRE2_SPTR) "'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+";
PCRE2_SPTR text = (PCRE2_SPTR) "Ġmeousrtr";
PCRE2_SIZE eoffset;
PCRE2_SIZE *ovector;
pcre2_code *re;
pcre2_match_data *match_data;
char *c = (char *)expression;
while (*c)
printf("%c", (unsigned int)*c++);
printf("\n");
int error_number;
int result;
size_t start_offset = 0;
size_t text_len;
u_int32_t options = 0;
text_len = strlen((char *)text);
re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, 0, &error_number, &eoffset, NULL);
if (re == NULL)
{
PCRE2_UCHAR buffer[256];
pcre2_get_error_message(error_number, buffer, sizeof(buffer));
cout << buffer;
return 1;
}
match_data = pcre2_match_data_create_from_pattern(re, NULL);
while (true)
{
result = pcre2_match(re, text, text_len, start_offset, options, match_data, NULL);
if (result < 0)
{
switch (result)
{
case PCRE2_ERROR_NOMATCH:
cout << "No matches found!";
return 0;
default:
cout << "Matching Error" << result;
return -1;
}
pcre2_match_data_free(match_data);
pcre2_code_free(re);
}
ovector = pcre2_get_ovector_pointer(match_data);
printf("Match Succeeded at %d\n", ovector[0]);
int i;
for (i = 0; i < result; i++)
{
PCRE2_SPTR substring_start = text + ovector[2 * i];
PCRE2_SIZE substring_length = ovector[2 * i + 1] - ovector[2 * i];
printf("%.*s\n", (int)substring_length, (char *)substring_start);
}
start_offset = ovector[1];
}
}
您需要在
PCRE2_UTF
选项中使用 pcre2_compile()
才能识别 UTF-8 编码文本(这可能是您的源文件编码的内容):
re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, PCRE2_UTF,
&error_number, &eoffset, nullptr);
您的代码还有其他问题 - 例如,
printf()
值的size_t
格式是%zu
,而不是%d
(最好始终使用C++风格的iostream输出函数,而不是混合和例如,iostreams 和 stdio 之间的匹配),但告诉 PCRE2 它的输入是 Unicode 是最相关的。该更改使您的程序输出
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
Match Succeeded at 0
Ġmeousrtr
No matches found!