为什么LF和CRLF在/ ^ \ s * $ / gm regex中表现不同？

Question

我一直在Windows上看到此问题。当我尝试在Unix的每一行上清除任何空格时：

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

这将产生我期望的结果：

===

HELLO

WOLRD

===

即如果空白行上有空格，则会将其删除。另一方面，在Windows上，正则表达式清除WHOLE字符串。举例说明：

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

（模板文字在JS中始终只显示\n，因此我不得不替换为\r\n来模拟Windows（?之后的\r是为了确保那些不相信的人。）结果：

===
HELLO
WOLRD
===

整行不见了！但是我的正则表达式具有^和$并设置了m标志，因此有点/^-to-$/m。 \r和\r\n有什么区别，然后产生不同的结果？

当我做一些记录时

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

我正在看到\ r \ n

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

并且仅带\ n

matched
matched
matched
===

HELLO

WOLRD

===

Answer 1

首先，让我们实际检查一下替换时有哪些字符，哪些不存在。以仅使用换行符的字符串开头：

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");

console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每行以字符代码10结尾，这是换行（LF）字符，用\n字符串表示。在替换前后，两个字符串是相同的-不仅look相同，而且actually彼此相等，所以替换没有任何作用。

现在让我们研究另一种情况：

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);


function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

这次每行以字符代码13结尾，这是回车（CR）字符，用字符串文字表示为\r，并且then跟随LF。替换后，不只是具有=\r\n\r\nH的序列，而不仅仅是=\r\nH。让我们看看为什么。

[[MDN所说的是关于元字符^的]：

匹配输入的开始。如果多行标志设置为true，则也将在换行符后立即匹配。

这是MDN关于元字符$的说法

匹配输入的结尾。如果多行标志设置为true，则也将在换行符前紧接匹配。

因此它们与after和before换行符匹配。在此，它们表示LF 或 CR。如果我们测试包含不同换行符的字符串，则可以看出这一点：

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

如果我们尝试在行尾匹配空白，则如果有LF，则不会进行任何处理，但does将CR与CRLF匹配。因此，在这种情况下，$将在此处匹配：

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ whay `\s$` matches

因此^和$都将CRLF序列中的任何一个都识别为行尾。当您进行搜索和替换时，这将有所作为。由于您的正则表达式指定^\s+$，这意味着当您的行完全为\r\n时，则它匹配。但是由于一个不明显的原因：

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

因此，正则表达式与\r\n不匹配，而与其他两个换行符之间的\n\r（两个空格字符）匹配。这是因为+渴望并且会消耗尽可能多的字符序列。这是正则表达式引擎将尝试的。为简洁起见，经过简化：

input = "hello\r\n\r\nworld"
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld"
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world"
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld"
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld"
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

Answer 2

首先，让我们实际检查一下替换时有哪些字符，哪些不存在。以仅使用换行符的字符串开头：

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");


console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

每行以字符代码10结尾，这是换行（LF）字符，用\n字符串表示。在替换前后，两个字符串是相同的-不仅look相同，而且actually彼此相等，所以替换没有任何作用。

现在让我们研究另一种情况：

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);


function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

这次每行以字符代码13结尾，这是回车（CR）字符，用字符串文字表示为\r，并且then跟随LF。替换后，不只是具有=\r\n\r\nH的序列，而不仅仅是=\r\nH。让我们看看为什么。

[[MDN所说的是关于元字符^的]：

匹配输入的开始。如果多行标志设置为true，则也将在换行符后立即匹配。

这是MDN关于元字符$的说法

匹配输入的结尾。如果多行标志设置为true，则也将在换行符前紧接匹配。

因此它们与after和before换行符匹配。在此，它们表示LF 或 CR。如果我们测试包含不同换行符的字符串，则可以看出这一点：

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

如果我们尝试在行尾匹配空白，则如果有LF，则不会进行任何处理，但does将CR与CRLF匹配。因此，在这种情况下，$将在此处匹配：

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ whay `\s$` matches

因此^和$都将CRLF序列中的任何一个都识别为行尾。当您进行搜索和替换时，这将有所作为。由于您的正则表达式指定^\s+$，这意味着当您的行完全为\r\n时，则它匹配。但是由于一个不明显的原因：

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

因此，正则表达式与\r\n不匹配，而与其他两个换行符之间的\n\r（两个空格字符）匹配。这是因为+渴望并且会消耗尽可能多的字符序列。这是正则表达式引擎将尝试的。为了简洁起见，已进行了一些简化-当前考虑了?，并且已匹配!：

input = "hello\r\n\r\nworld"
regex = /^\s+$/

Step 1
"hello\r\n\r\nworld"
      ?? matches `^` -> continue with next symbol in regex

Step 2
"hello\r\n\r\nworld"
      !!?? matches `\s` -> continue matching to satisfy `+` quantifier

Step 3
"hello\r\n\r\nworld"
      !!!!?? matches `\s` -> continue matching to satisfy `+` quantifier
Step 4
"hello\r\n\r\nworld"
      !!!!!!?? matches `\s` -> continue matching to satisfy `+` quantifier

Step 5
"hello\r\n\r\nworld"
      !!!!!!!!? does not match `\s` -> backtrack

Step 6
"hello\r\n\r\nworld"
      !!!!!!!!  `\s+` satisfied -> continue to next symbol in regex

Step 7
"hello\r\n\r\nworld"
      !!!!!!!!? does not match `$` -> backtrack

Step 8
"hello\r\n\r\nworld"
      !!!!!!?? matches `$` -> finish

为什么LF和CRLF在/ ^ \ s * $ / gm regex中表现不同？

问题描述投票：1回答：1

1个回答

最新问题

为什么LF和CRLF在/ ^ \ s * $ / gm regex中表现不同？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1