为什么LF和CRLF在/ ^ \ s * $ / gm regex中表现不同?

问题描述 投票:1回答:1

我一直在Windows上看到此问题。当我尝试在Unix的每一行上清除任何空格时:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

这将产生我期望的结果:

===

HELLO

WOLRD

===

即如果空白行上有空格,则会将其删除。另一方面,在Windows上,正则表达式清除WHOLE字符串。举例说明:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(模板文字在JS中始终只显示\n,因此我不得不替换为\r\n来模拟Windows(?之后的\r是为了确保那些不相信的人。)结果:

===
HELLO
WOLRD
===

整行不见了!但是我的正则表达式具有^$并设置了m标志,因此有点/^-to-$/m\r\r\n有什么区别,然后产生不同的结果?

当我做一些记录时

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

我正在看到\ r \ n

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

并且仅带\ n

matched
matched
matched
===

HELLO

WOLRD

===
javascript regex newline carriage-return linefeed
1个回答
1
投票

首先,让我们实际检查一下替换时有哪些字符,哪些不存在。以仅使用换行符的字符串开头:

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");

console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每行以字符代码10结尾,这是换行(LF)字符,用\n字符串表示。在替换前后,两个字符串是相同的-不仅look相同,而且actually彼此相等,所以替换没有任何作用。

现在让我们研究另一种情况:

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);


function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

这次每行以字符代码13结尾,这是回车(CR)字符,用字符串文字表示为\r,并且then跟随LF。替换后,不只是具有=\r\n\r\nH的序列,而不仅仅是=\r\nH。让我们看看为什么。

[[MDN所说的是关于元字符^的]:

匹配输入的开始。如果多行标志设置为true,则也将在换行符后立即匹配。

这是MDN关于元字符$的说法

匹配输入的结尾。如果多行标志设置为true,则也将在换行符前紧接匹配。

因此它们与afterbefore换行符匹配。在此,它们表示LF CR。如果我们测试包含不同换行符的字符串,则可以看出这一点:

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

如果我们尝试在行尾匹配空白,则如果有LF,则不会进行任何处理,但does将CR与CRLF匹配。因此,在这种情况下,$将在此处匹配:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ whay `\s$` matches

因此^$都将CRLF序列中的任何一个都识别为行尾。当您进行搜索和替换时,这将有所作为。由于您的正则表达式指定^\s+$,这意味着当您的行完全为\r\n时,则它匹配。但是由于一个不明显的原因:

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

因此,正则表达式与\r\n不匹配,而与其他两个换行符之间的\n\r(两个空格字符)匹配。这是因为+渴望并且会消耗尽可能多的字符序列。这是正则表达式引擎将尝试的。为简洁起见,经过简化:

input = "hello\r\n\r\nworld"
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld"
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld"
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld"
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

0
投票

首先,让我们实际检查一下替换时有哪些字符,哪些不存在。以仅使用换行符的字符串开头:

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");


console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每行以字符代码10结尾,这是换行(LF)字符,用\n字符串表示。在替换前后,两个字符串是相同的-不仅look相同,而且actually彼此相等,所以替换没有任何作用。

现在让我们研究另一种情况:

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);


function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

这次每行以字符代码13结尾,这是回车(CR)字符,用字符串文字表示为\r,并且then跟随LF。替换后,不只是具有=\r\n\r\nH的序列,而不仅仅是=\r\nH。让我们看看为什么。

[[MDN所说的是关于元字符^的]:

匹配输入的开始。如果多行标志设置为true,则也将在换行符后立即匹配。

这是MDN关于元字符$的说法

匹配输入的结尾。如果多行标志设置为true,则也将在换行符前紧接匹配。

因此它们与afterbefore换行符匹配。在此,它们表示LF CR。如果我们测试包含不同换行符的字符串,则可以看出这一点:

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

如果我们尝试在行尾匹配空白,则如果有LF,则不会进行任何处理,但does将CR与CRLF匹配。因此,在这种情况下,$将在此处匹配:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ whay `\s$` matches

因此^$都将CRLF序列中的任何一个都识别为行尾。当您进行搜索和替换时,这将有所作为。由于您的正则表达式指定^\s+$,这意味着当您的行完全为\r\n时,则它匹配。但是由于一个不明显的原因:

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

因此,正则表达式与\r\n不匹配,而与其他两个换行符之间的\n\r(两个空格字符)匹配。这是因为+渴望并且会消耗尽可能多的字符序列。这是正则表达式引擎将尝试的。为了简洁起见,已进行了一些简化-当前考虑了?,并且已匹配!

input = "hello\r\n\r\nworld"
regex = /^\s+$/

Step 1
"hello\r\n\r\nworld"
      ?? matches `^` -> continue with next symbol in regex

Step 2
"hello\r\n\r\nworld"
      !!?? matches `\s` -> continue matching to satisfy `+` quantifier

Step 3
"hello\r\n\r\nworld"
      !!!!?? matches `\s` -> continue matching to satisfy `+` quantifier
Step 4
"hello\r\n\r\nworld"
      !!!!!!?? matches `\s` -> continue matching to satisfy `+` quantifier

Step 5
"hello\r\n\r\nworld"
      !!!!!!!!? does not match `\s` -> backtrack

Step 6
"hello\r\n\r\nworld"
      !!!!!!!!  `\s+` satisfied -> continue to next symbol in regex

Step 7
"hello\r\n\r\nworld"
      !!!!!!!!? does not match `$` -> backtrack

Step 8
"hello\r\n\r\nworld"
      !!!!!!?? matches `$` -> finish
© www.soinside.com 2019 - 2024. All rights reserved.