我有一个从大型机生成的 EBCDIC 文件,需要将其转换为 ASCII 进行数据处理。
任何帮助,将不胜感激。
自 [Ruby 2.3 起,EBCDIC 编码可用][1]:
编码
新编码::IBM037(别名 ebcdic-cp-us;虚拟)
所以这应该有效:
src = 'out_26877296.tst'
content = File.read(src, encoding: 'IBM037:ASCII')
为了使其保持最新,对于 Ruby 3.1.2p20,所有可用的编码都是(为了易读而换行):
irb(main):015> Encoding.name_list.sort.join ", "
=> "646, ANSI_X3.4-1968, ASCII, ASCII-8BIT, BINARY, Big5, Big5-HKSCS,
Big5-HKSCS:2008, Big5-UAO, CESU-8, CP1250, CP1251, CP1252, CP1253,
CP1254, CP1255, CP1256, CP1257, CP1258, CP437, CP50220, CP50221,
CP51932, CP65000, CP65001, CP720, CP737, CP775, CP850, CP852,
CP855, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP869, CP874, CP878, CP932,
CP936, CP949, CP950, CP951, EUC-CN, EUC-JIS-2004, EUC-JISX0213, EUC-JP, EUC-KR, EUC-TW,
Emacs-Mule, GB12345, GB18030, GB1988, GB2312, GBK, IBM037, IBM437, IBM720, IBM737, IBM775,
IBM850, IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866,
IBM869, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-KDDI, ISO-8859-1, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO2022-JP, ISO2022-JP2, ISO8859-1, ISO8859-10, ISO8859-11,
ISO8859-13, ISO8859-14, ISO8859-15, ISO8859-16, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5,
ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, KOI8-R, KOI8-U, MacJapan, MacJapanese, PCK, SJIS,
SJIS-DoCoMo, SJIS-KDDI, SJIS-SoftBank, Shift_JIS, TIS-620, UCS-2BE, UCS-4BE, UCS-4LE,
US-ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-8, UTF-8-HFS,
UTF-8-MAC, UTF8-DoCoMo, UTF8-KDDI, UTF8-MAC, UTF8-SoftBank, Windows-1250, Windows-1251,
Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257,
Windows-1258, Windows-31J, Windows-874, csWindows31J, ebcdic-cp-us, euc-jp-ms, eucCN, eucJP,
eucJP-ms, eucKR, eucTW, external, filesystem, internal, locale, macCentEuro, macCroatian,
macCyrillic, macGreek, macIceland, macRoman, macRomania, macThai, macTurkish, macUkraine,
stateless-ISO-2022-JP, stateless-ISO-2022-JP-KDDI"
EBCDIC 有多种风格:IBM737、IBM775、 IBM850、IBM852、IBM855、IBM857、IBM860、IBM861、IBM862、IBM863、IBM864、IBM865、IBM866 和 IBM869.
我不知道有什么方法可以确定正在使用哪个,除了注意转换时何时出现问题。
从 IBM037 转换为 UTF-8:
File.read('some_ibm_file', encoding: 'IBM037:UTF-8')