我有一个查找和替换脚本,当单词没有任何特殊字符时,该脚本没有问题。但是,很多时候will是特殊字符,因为它正在查找名称。到目前为止,这正在破坏脚本。
脚本查找{<some-text>}
,并尝试替换内容(以及删除花括号)。
示例:
text.rtf
Here's a name with special char {Kotouč}
script.ts
import * as fs from "fs";
// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::\n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
// It correctly identifies the targeted text.
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
// Here I need a way to escape `plainText` string so that it matches the source.
console.log("currMatch::", currMatch);
console.log("currMatch === plainText::", currMatch === plainText);
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("newContent:", newContent);
}
}
输出
content::
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}
currMatch:: {Kotou\uc0\u269 \}
currMatch === plainText:: false
[看起来像是ANSI的转义符,我尝试使用jsesc,但是会产生不同的字符串,{Kotou\u010D}
,而不是文档产生的字符串{Kotou\uc0\u269 \}
。
如何动态转义plainText
字符串变量,使其与文档中找到的变量匹配?
我需要加深我对rtf格式以及常规文本编码的了解。
从文件中读取的原始RTF文本为我们提供了一些提示:
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...
rtf文件元数据的这一部分告诉我们一些事情。
它使用的是RTF文件格式版本1。编码为ANSI,特别是cpg1252
,也称为Windows-1252
或CP-1252
,即:
...拉丁字母的单字节字符编码
(source)
有价值的信息是,我们知道它使用的是拉丁字母,稍后将使用。
知道使用的特定RTF版本时,我偶然发现了RTF 1.5 Spec
在该规范中快速搜索了我正在寻找的一个转义序列,发现这是RTF特定的转义控制序列,即\uc0
。因此,知道我能够解析之后的实际情况,\u269
。现在我知道它是unicode,并且很直觉\u269
代表unicode character code 269
。所以我查了一下...
\u269
(字符代码269
)shows up on this page to confirm。现在,我知道了字符集以及获取等效的纯文本(未转义)需要做些什么,并且有一个基本的SO post I used here来启动该功能。
使用所有这些知识,我就可以从那里将其拼凑在一起。这是完整的更正脚本,它的输出是:
script.ts
import * as fs from "fs";
// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;
// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;
/**
* Util function to strip junk characters from string for comparison.
* @param {string} str
* @returns {string}
*/
const cleanupRtfStr = (str: string): string => {
return str
.replace(/\s/g, "")
.replace(/\\/g, "");
};
/**
* Detects escaped unicode and looks up the character by that code.
* @param {string} str
* @returns {string}
*/
const unescapeString = (str: string): string => {
const unescaped = str.replace(matchEscapedChars, (cc: string) => {
const stripped: string = cc.replace(unicodeControlReg, "");
const charCode: number = Number(stripped);
// See unicode character codes here:
// https://unicodelookup.com/#latin/11
return String.fromCharCode(charCode);
});
// Remove all whitespace.
return unescaped;
};
// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::\n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("\n\nnewContent:", newContent);
break;
}
const unescapedMatch: string = unescapeString(currMatch);
const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
if (cleanedMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
console.log("\n\nnewContent:", newContent);
break;
}
}
输出
content::
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}
newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}
希望对其他不熟悉字符编码/转义的人有所帮助,并且可以在rtf格式的文档中使用它!