NodeJS RTF ANSI使用特殊字符查找和替换单词

问题描述 投票:0回答:1

我有一个查找和替换脚本,当单词没有任何特殊字符时,该脚本没有问题。但是,很多时候will是特殊字符,因为它正在查找名称。到目前为止,这正在破坏脚本。

脚本查找{<some-text>},并尝试替换内容(以及删除花括号)。

示例:

text.rtf

Here's a name with special char {Kotouč}

script.ts

import * as fs from "fs";

// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {

    // It correctly identifies the targeted text.
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    // Here I need a way to escape `plainText` string so that it matches the source.
    console.log("currMatch::", currMatch);
    console.log("currMatch === plainText::", currMatch === plainText);
    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("newContent:", newContent);
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}

currMatch:: {Kotou\uc0\u269 \}

currMatch === plainText:: false

[看起来像是ANSI的转义符,我尝试使用jsesc,但是会产生不同的字符串,{Kotou\u010D},而不是文档产生的字符串{Kotou\uc0\u269 \}

如何动态转义plainText字符串变量,使其与文档中找到的变量匹配?

node.js escaping rtf ansi ansi-escape
1个回答
0
投票

我需要加深我对rtf格式以及常规文本编码的了解。

从文件中读取的原始RTF文本为我们提供了一些提示:

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...

rtf文件元数据的这一部分告诉我们一些事情。

它使用的是RTF文件格式版本1。编码为ANSI,特别是cpg1252,也称为Windows-1252CP-1252,即:

...拉丁字母的单字节字符编码

source

有价值的信息是,我们知道它使用的是拉丁字母,稍后将使用。

知道使用的特定RTF版本时,我偶然发现了RTF 1.5 Spec

在该规范中快速搜索了我正在寻找的一个转义序列,发现这是RTF特定的转义控制序列,即\uc0。因此,知道我能够解析之后的实际情况,\u269。现在我知道它是unicode,并且很直觉\u269代表unicode character code 269。所以我查了一下...

\u269(字符代码269shows up on this page to confirm。现在,我知道了字符集以及获取等效的纯文本(未转义)需要做些什么,并且有一个基本的SO post I used here来启动该功能。

使用所有这些知识,我就可以从那里将其拼凑在一起。这是完整的更正脚本,它的输出是:

script.ts

import * as fs from "fs";


// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;

// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;

/**
 * Util function to strip junk characters from string for comparison.
 * @param {string} str
 * @returns {string}
 */
const cleanupRtfStr = (str: string): string => {
    return str
        .replace(/\s/g, "")
        .replace(/\\/g, "");
};

/**
 * Detects escaped unicode and looks up the character by that code.
 * @param {string} str
 * @returns {string}
 */
const unescapeString = (str: string): string => {
    const unescaped = str.replace(matchEscapedChars, (cc: string) => {
        const stripped: string = cc.replace(unicodeControlReg, "");
        const charCode: number = Number(stripped);

        // See unicode character codes here:
        //  https://unicodelookup.com/#latin/11
        return String.fromCharCode(charCode);
    });

    // Remove all whitespace.
    return unescaped;
};

// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("\n\nnewContent:", newContent);
        break;
    }

    const unescapedMatch: string = unescapeString(currMatch);
    const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
    if (cleanedMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
        console.log("\n\nnewContent:", newContent);
        break;
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}


newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}

希望对其他不熟悉字符编码/转义的人有所帮助,并且可以在rtf格式的文档中使用它!

© www.soinside.com 2019 - 2024. All rights reserved.