我有一个 Windows-1255(希伯来语)编码的文件,我希望能够在 Node.js 中访问它。
我尝试使用
fs.readFile
打开文件,它给了我一个 Buffer
,我无法对其执行任何操作。我尝试将编码设置为Windows-1255
,但无法识别。
windows-1255
包,但我无法用它进行解码,因为 fs.readFile
要么给出 Buffer
要么给出 UTF8
字符串,并且该包需要 1255 编码的字符串。
如何在 Node.js 中读取
Windows-1255
编码的文件?
看来使用 node-iconv 包是最好的方法。不幸的是,更容易包含在代码中的iconv-lite似乎没有实现CP1255的转码。
此线程和答案显示了简单的示例并简洁地演示了如何使用这两个模块。
回到 iconv,我在使用 npm 前缀的 debian 上安装时遇到了一些问题,我向维护者提交了一个问题here。我设法解决了这个问题 sudo-ing 安装,并“sudo chown”-ing 返回给我已安装的模块。
我已经测试了可以访问(西欧+东欧样本)的各种 win-xxxx 编码和代码页。
但是我无法让它与 CP1255 一起工作,尽管它列在他们支持的编码中,因为我没有在本地安装特定的代码页,并且它全部被破坏了。我尝试从此页面窃取一些希伯来语脚本,但粘贴的版本总是损坏。我不敢在我的 Windows 机器上实际安装该语言,因为担心我不会把它变砖 - 抱歉。
// sample.js
var Iconv = require('iconv').Iconv;
var fs = require('fs');
function decode(content) {
var iconv = new Iconv('CP1255', 'UTF-8//TRANSLIT//IGNORE');
var buffer = iconv.convert(content);
return buffer.toString('utf8');
};
console.log(decode(fs.readFileSync('sample.txt')));
处理文件编码以及如何通过 Node.js 缓冲区读取文件的额外(离题)解释:
// force the data to be string with the second optional argument
fs.readFile(file, {encoding:'utf8'}, function(error, string) {
console.log('raw string:', string);// autoconvert to a native string
});
或
// use the raw return buffer and do bitwise processing on the encoded bytestream
fs.readFile(file, function(error, buffer) {
console.log(buffer.toString('utf8'));// process the binary buffer
});
首先:Windows 代码页是 1255,而不是 1225。
将 CP1255 转换为 Unicode 非常简单;毕竟,所有数据都可以在线找到(我使用了 http://www.unicode.org 的表格),而且我们讨论的是最多 256 个字符的列表。要将包含“原始”字节的字符串转换为正确的 Unicode,只需为解释为 CP1255 字符的每个字节查找正确的 Unicode 代码点即可。
下面的快速而简单的 Javascript 可以做到这一点;作为额外的好处,您可以提供可选的 function 来处理“未定义”的字符代码。默认情况下,这些将被转换为
U+FFFD
(Unicode 的本机占位符代码点),但您可以使用回调将它们转换为其他任何内容 - 我的示例以十六进制插入原始值。
该函数的输出是一个正确的 Unicode 编码字符串,可以根据您的需要进一步处理为 UTF8 或按原样使用。例如,这是一个用 Win1255 编码、转换为 Unicode 并在 InDesign 中排版的网站的简短片段:
(请注意,我强制它读取 RTL,以便包含 HTML 标签内的文本。)
下面代码片段中的示例是来自 http://www.shmuelfomberg.com/perlhebtut/chap9.html 的一些适当文本;我在开头附近插入了一个
\xFF
来演示回调函数。其余的,几乎一直到底部,只是一个大的查找表和悬在最底部的 19 行短代码。
var in_text = 'çñø \xff ìðå îùäå áëì äñéôåø. äîî... òáøéú! ' +
'ìòáåã áàðâìéú æä ðçîã, àê àðçðå áéùøàì. àðçðå øåöéí ì÷øåà, ' +
'ìëúåá åìòáã òáøéú! ìøåò îæìðå, äîçùá áã"ë òåáã îùîàì ìéîéï. ' +
'ëãé ìùëðò àåúå ìòáåã îéîéï ìùîàì, ãåøù òáåãä. ëãé ìùëðò àåúå ìòáåã ' +
'ãå-ëéååðé, æä áëìì îñåëï. àáì ÷åãí ëì àúä öøéê ëîä îåùâé éñåã.';
// Without fallback routine:
out_text = win1255ToUnicode (in_text);
alert (out_text); // just to show the string here
// With fallback routine:
out_text = win1255ToUnicode (in_text, insertHexEscape);
alert (out_text); // just to show the string here
function insertHexEscape (otherCode)
{
return '\\x'+otherCode.charCodeAt(0).toString(16);
}
function win1255ToUnicode (source, undefHandler /* function! */)
{
/* From http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1255.TXT
#
# Name: cp1255 to Unicode table
# Unicode version: 2.0
# Table version: 2.01
# Table format: Format A
# Date: 1/7/2000
#
# Contact: [email protected]
#
# General notes: none
#
# Format: Three tab-separated columns
# Column #1 is the cp1255 code (in hex)
# Column #2 is the Unicode (in hex as 0xXXXX)
# Column #3 is the Unicode name (follows a comment sign, '#')
#
# The entries are in cp1255 order
*/
var win1255Encoding = {
"\x00":"\u0000", // #NULL
"\x01":"\u0001", // #START OF HEADING
"\x02":"\u0002", // #START OF TEXT
"\x03":"\u0003", // #END OF TEXT
"\x04":"\u0004", // #END OF TRANSMISSION
"\x05":"\u0005", // #ENQUIRY
"\x06":"\u0006", // #ACKNOWLEDGE
"\x07":"\u0007", // #BELL
"\x08":"\u0008", // #BACKSPACE
"\x09":"\u0009", // #HORIZONTAL TABULATION
"\x0A":"\u000A", // #LINE FEED
"\x0B":"\u000B", // #VERTICAL TABULATION
"\x0C":"\u000C", // #FORM FEED
"\x0D":"\u000D", // #CARRIAGE RETURN
"\x0E":"\u000E", // #SHIFT OUT
"\x0F":"\u000F", // #SHIFT IN
"\x10":"\u0010", // #DATA LINK ESCAPE
"\x11":"\u0011", // #DEVICE CONTROL ONE
"\x12":"\u0012", // #DEVICE CONTROL TWO
"\x13":"\u0013", // #DEVICE CONTROL THREE
"\x14":"\u0014", // #DEVICE CONTROL FOUR
"\x15":"\u0015", // #NEGATIVE ACKNOWLEDGE
"\x16":"\u0016", // #SYNCHRONOUS IDLE
"\x17":"\u0017", // #END OF TRANSMISSION BLOCK
"\x18":"\u0018", // #CANCEL
"\x19":"\u0019", // #END OF MEDIUM
"\x1A":"\u001A", // #SUBSTITUTE
"\x1B":"\u001B", // #ESCAPE
"\x1C":"\u001C", // #FILE SEPARATOR
"\x1D":"\u001D", // #GROUP SEPARATOR
"\x1E":"\u001E", // #RECORD SEPARATOR
"\x1F":"\u001F", // #UNIT SEPARATOR
"\x20":"\u0020", // #SPACE
"\x21":"\u0021", // #EXCLAMATION MARK
"\x22":"\u0022", // #QUOTATION MARK
"\x23":"\u0023", // #NUMBER SIGN
"\x24":"\u0024", // #DOLLAR SIGN
"\x25":"\u0025", // #PERCENT SIGN
"\x26":"\u0026", // #AMPERSAND
"\x27":"\u0027", // #APOSTROPHE
"\x28":"\u0028", // #LEFT PARENTHESIS
"\x29":"\u0029", // #RIGHT PARENTHESIS
"\x2A":"\u002A", // #ASTERISK
"\x2B":"\u002B", // #PLUS SIGN
"\x2C":"\u002C", // #COMMA
"\x2D":"\u002D", // #HYPHEN-MINUS
"\x2E":"\u002E", // #FULL STOP
"\x2F":"\u002F", // #SOLIDUS
"\x30":"\u0030", // #DIGIT ZERO
"\x31":"\u0031", // #DIGIT ONE
"\x32":"\u0032", // #DIGIT TWO
"\x33":"\u0033", // #DIGIT THREE
"\x34":"\u0034", // #DIGIT FOUR
"\x35":"\u0035", // #DIGIT FIVE
"\x36":"\u0036", // #DIGIT SIX
"\x37":"\u0037", // #DIGIT SEVEN
"\x38":"\u0038", // #DIGIT EIGHT
"\x39":"\u0039", // #DIGIT NINE
"\x3A":"\u003A", // #COLON
"\x3B":"\u003B", // #SEMICOLON
"\x3C":"\u003C", // #LESS-THAN SIGN
"\x3D":"\u003D", // #EQUALS SIGN
"\x3E":"\u003E", // #GREATER-THAN SIGN
"\x3F":"\u003F", // #QUESTION MARK
"\x40":"\u0040", // #COMMERCIAL AT
"\x41":"\u0041", // #LATIN CAPITAL LETTER A
"\x42":"\u0042", // #LATIN CAPITAL LETTER B
"\x43":"\u0043", // #LATIN CAPITAL LETTER C
"\x44":"\u0044", // #LATIN CAPITAL LETTER D
"\x45":"\u0045", // #LATIN CAPITAL LETTER E
"\x46":"\u0046", // #LATIN CAPITAL LETTER F
"\x47":"\u0047", // #LATIN CAPITAL LETTER G
"\x48":"\u0048", // #LATIN CAPITAL LETTER H
"\x49":"\u0049", // #LATIN CAPITAL LETTER I
"\x4A":"\u004A", // #LATIN CAPITAL LETTER J
"\x4B":"\u004B", // #LATIN CAPITAL LETTER K
"\x4C":"\u004C", // #LATIN CAPITAL LETTER L
"\x4D":"\u004D", // #LATIN CAPITAL LETTER M
"\x4E":"\u004E", // #LATIN CAPITAL LETTER N
"\x4F":"\u004F", // #LATIN CAPITAL LETTER O
"\x50":"\u0050", // #LATIN CAPITAL LETTER P
"\x51":"\u0051", // #LATIN CAPITAL LETTER Q
"\x52":"\u0052", // #LATIN CAPITAL LETTER R
"\x53":"\u0053", // #LATIN CAPITAL LETTER S
"\x54":"\u0054", // #LATIN CAPITAL LETTER T
"\x55":"\u0055", // #LATIN CAPITAL LETTER U
"\x56":"\u0056", // #LATIN CAPITAL LETTER V
"\x57":"\u0057", // #LATIN CAPITAL LETTER W
"\x58":"\u0058", // #LATIN CAPITAL LETTER X
"\x59":"\u0059", // #LATIN CAPITAL LETTER Y
"\x5A":"\u005A", // #LATIN CAPITAL LETTER Z
"\x5B":"\u005B", // #LEFT SQUARE BRACKET
"\x5C":"\u005C", // #REVERSE SOLIDUS
"\x5D":"\u005D", // #RIGHT SQUARE BRACKET
"\x5E":"\u005E", // #CIRCUMFLEX ACCENT
"\x5F":"\u005F", // #LOW LINE
"\x60":"\u0060", // #GRAVE ACCENT
"\x61":"\u0061", // #LATIN SMALL LETTER A
"\x62":"\u0062", // #LATIN SMALL LETTER B
"\x63":"\u0063", // #LATIN SMALL LETTER C
"\x64":"\u0064", // #LATIN SMALL LETTER D
"\x65":"\u0065", // #LATIN SMALL LETTER E
"\x66":"\u0066", // #LATIN SMALL LETTER F
"\x67":"\u0067", // #LATIN SMALL LETTER G
"\x68":"\u0068", // #LATIN SMALL LETTER H
"\x69":"\u0069", // #LATIN SMALL LETTER I
"\x6A":"\u006A", // #LATIN SMALL LETTER J
"\x6B":"\u006B", // #LATIN SMALL LETTER K
"\x6C":"\u006C", // #LATIN SMALL LETTER L
"\x6D":"\u006D", // #LATIN SMALL LETTER M
"\x6E":"\u006E", // #LATIN SMALL LETTER N
"\x6F":"\u006F", // #LATIN SMALL LETTER O
"\x70":"\u0070", // #LATIN SMALL LETTER P
"\x71":"\u0071", // #LATIN SMALL LETTER Q
"\x72":"\u0072", // #LATIN SMALL LETTER R
"\x73":"\u0073", // #LATIN SMALL LETTER S
"\x74":"\u0074", // #LATIN SMALL LETTER T
"\x75":"\u0075", // #LATIN SMALL LETTER U
"\x76":"\u0076", // #LATIN SMALL LETTER V
"\x77":"\u0077", // #LATIN SMALL LETTER W
"\x78":"\u0078", // #LATIN SMALL LETTER X
"\x79":"\u0079", // #LATIN SMALL LETTER Y
"\x7A":"\u007A", // #LATIN SMALL LETTER Z
"\x7B":"\u007B", // #LEFT CURLY BRACKET
"\x7C":"\u007C", // #VERTICAL LINE
"\x7D":"\u007D", // #RIGHT CURLY BRACKET
"\x7E":"\u007E", // #TILDE
"\x7F":"\u007F", // #DELETE
"\x80":"\u20AC", // #EURO SIGN
"\x82":"\u201A", // #SINGLE LOW-9 QUOTATION MARK
"\x83":"\u0192", // #LATIN SMALL LETTER F WITH HOOK
"\x84":"\u201E", // #DOUBLE LOW-9 QUOTATION MARK
"\x85":"\u2026", // #HORIZONTAL ELLIPSIS
"\x86":"\u2020", // #DAGGER
"\x87":"\u2021", // #DOUBLE DAGGER
"\x88":"\u02C6", // #MODIFIER LETTER CIRCUMFLEX ACCENT
"\x89":"\u2030", // #PER MILLE SIGN
"\x8B":"\u2039", // #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
"\x91":"\u2018", // #LEFT SINGLE QUOTATION MARK
"\x92":"\u2019", // #RIGHT SINGLE QUOTATION MARK
"\x93":"\u201C", // #LEFT DOUBLE QUOTATION MARK
"\x94":"\u201D", // #RIGHT DOUBLE QUOTATION MARK
"\x95":"\u2022", // #BULLET
"\x96":"\u2013", // #EN DASH
"\x97":"\u2014", // #EM DASH
"\x98":"\u02DC", // #SMALL TILDE
"\x99":"\u2122", // #TRADE MARK SIGN
"\x9B":"\u203A", // #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
"\xA0":"\u00A0", // #NO-BREAK SPACE
"\xA1":"\u00A1", // #INVERTED EXCLAMATION MARK
"\xA2":"\u00A2", // #CENT SIGN
"\xA3":"\u00A3", // #POUND SIGN
"\xA4":"\u20AA", // #NEW SHEQEL SIGN
"\xA5":"\u00A5", // #YEN SIGN
"\xA6":"\u00A6", // #BROKEN BAR
"\xA7":"\u00A7", // #SECTION SIGN
"\xA8":"\u00A8", // #DIAERESIS
"\xA9":"\u00A9", // #COPYRIGHT SIGN
"\xAA":"\u00D7", // #MULTIPLICATION SIGN
"\xAB":"\u00AB", // #LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
"\xAC":"\u00AC", // #NOT SIGN
"\xAD":"\u00AD", // #SOFT HYPHEN
"\xAE":"\u00AE", // #REGISTERED SIGN
"\xAF":"\u00AF", // #MACRON
"\xB0":"\u00B0", // #DEGREE SIGN
"\xB1":"\u00B1", // #PLUS-MINUS SIGN
"\xB2":"\u00B2", // #SUPERSCRIPT TWO
"\xB3":"\u00B3", // #SUPERSCRIPT THREE
"\xB4":"\u00B4", // #ACUTE ACCENT
"\xB5":"\u00B5", // #MICRO SIGN
"\xB6":"\u00B6", // #PILCROW SIGN
"\xB7":"\u00B7", // #MIDDLE DOT
"\xB8":"\u00B8", // #CEDILLA
"\xB9":"\u00B9", // #SUPERSCRIPT ONE
"\xBA":"\u00F7", // #DIVISION SIGN
"\xBB":"\u00BB", // #RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
"\xBC":"\u00BC", // #VULGAR FRACTION ONE QUARTER
"\xBD":"\u00BD", // #VULGAR FRACTION ONE HALF
"\xBE":"\u00BE", // #VULGAR FRACTION THREE QUARTERS
"\xBF":"\u00BF", // #INVERTED QUESTION MARK
"\xC0":"\u05B0", // #HEBREW POINT SHEVA
"\xC1":"\u05B1", // #HEBREW POINT HATAF SEGOL
"\xC2":"\u05B2", // #HEBREW POINT HATAF PATAH
"\xC3":"\u05B3", // #HEBREW POINT HATAF QAMATS
"\xC4":"\u05B4", // #HEBREW POINT HIRIQ
"\xC5":"\u05B5", // #HEBREW POINT TSERE
"\xC6":"\u05B6", // #HEBREW POINT SEGOL
"\xC7":"\u05B7", // #HEBREW POINT PATAH
"\xC8":"\u05B8", // #HEBREW POINT QAMATS
"\xC9":"\u05B9", // #HEBREW POINT HOLAM
"\xCB":"\u05BB", // #HEBREW POINT QUBUTS
"\xCC":"\u05BC", // #HEBREW POINT DAGESH OR MAPIQ
"\xCD":"\u05BD", // #HEBREW POINT METEG
"\xCE":"\u05BE", // #HEBREW PUNCTUATION MAQAF
"\xCF":"\u05BF", // #HEBREW POINT RAFE
"\xD0":"\u05C0", // #HEBREW PUNCTUATION PASEQ
"\xD1":"\u05C1", // #HEBREW POINT SHIN DOT
"\xD2":"\u05C2", // #HEBREW POINT SIN DOT
"\xD3":"\u05C3", // #HEBREW PUNCTUATION SOF PASUQ
"\xD4":"\u05F0", // #HEBREW LIGATURE YIDDISH DOUBLE VAV
"\xD5":"\u05F1", // #HEBREW LIGATURE YIDDISH VAV YOD
"\xD6":"\u05F2", // #HEBREW LIGATURE YIDDISH DOUBLE YOD
"\xD7":"\u05F3", // #HEBREW PUNCTUATION GERESH
"\xD8":"\u05F4", // #HEBREW PUNCTUATION GERSHAYIM
"\xE0":"\u05D0", // #HEBREW LETTER ALEF
"\xE1":"\u05D1", // #HEBREW LETTER BET
"\xE2":"\u05D2", // #HEBREW LETTER GIMEL
"\xE3":"\u05D3", // #HEBREW LETTER DALET
"\xE4":"\u05D4", // #HEBREW LETTER HE
"\xE5":"\u05D5", // #HEBREW LETTER VAV
"\xE6":"\u05D6", // #HEBREW LETTER ZAYIN
"\xE7":"\u05D7", // #HEBREW LETTER HET
"\xE8":"\u05D8", // #HEBREW LETTER TET
"\xE9":"\u05D9", // #HEBREW LETTER YOD
"\xEA":"\u05DA", // #HEBREW LETTER FINAL KAF
"\xEB":"\u05DB", // #HEBREW LETTER KAF
"\xEC":"\u05DC", // #HEBREW LETTER LAMED
"\xED":"\u05DD", // #HEBREW LETTER FINAL MEM
"\xEE":"\u05DE", // #HEBREW LETTER MEM
"\xEF":"\u05DF", // #HEBREW LETTER FINAL NUN
"\xF0":"\u05E0", // #HEBREW LETTER NUN
"\xF1":"\u05E1", // #HEBREW LETTER SAMEKH
"\xF2":"\u05E2", // #HEBREW LETTER AYIN
"\xF3":"\u05E3", // #HEBREW LETTER FINAL PE
"\xF4":"\u05E4", // #HEBREW LETTER PE
"\xF5":"\u05E5", // #HEBREW LETTER FINAL TSADI
"\xF6":"\u05E6", // #HEBREW LETTER TSADI
"\xF7":"\u05E7", // #HEBREW LETTER QOF
"\xF8":"\u05E8", // #HEBREW LETTER RESH
"\xF9":"\u05E9", // #HEBREW LETTER SHIN
"\xFA":"\u05EA", // #HEBREW LETTER TAV
"\xFD":"\u200E", // #LEFT-TO-RIGHT MARK
"\xFE":"\u200F" // #RIGHT-TO-LEFT MARK
}, i=0,l,dest = '';
l = source.length;
while (l--)
{
if (win1255Encoding[source[i]])
dest += win1255Encoding[source[i]];
else
{
if (undefHandler)
{
dest += undefHandler (source[i]);
} else
{
dest += '\uFFFD';
}
}
i++;
}
return dest;
}
这样对我有用:
const fs = require('fs');
var Iconv = require('iconv').Iconv;
var iconv = new Iconv('cp1251', 'utf-8');
const encoded = fs.readFileSync('sample.txt');
const decoded = iconv.convert(encoded).toString();