全包字符集以避免“java.nio.charset.MalformedInputException：输入长度= 1”？

Question

我正在用 Java 创建一个简单的字数统计程序，用于读取目录中基于文本的文件。

但是，我不断收到错误：

java.nio.charset.MalformedInputException: Input length = 1

从这行代码：

BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));

我知道我可能会得到这个，因为我使用的

Charset

不包含文本文件中的某些字符，其中一些包含其他语言的字符。但我想包括这些角色。

后来在JavaDocs上了解到，

Charset

是可选的，仅用于更有效地读取文件，所以我将代码更改为：

BufferedReader reader = Files.newBufferedReader(file);

但有些文件仍然会抛出

MalformedInputException

。我不知道为什么。

我想知道是否有一个包罗万象的

Charset

可以让我读取具有许多不同类型字符的文本文件？

谢谢。

Answer 1

您可能想要一个支持的编码列表。对于每个文件，依次尝试每种编码，可能从 UTF-8 开始。每次捕获

MalformedInputException

时，请尝试下一个编码。

Answer 2

从 Files.newBufferedReader 创建 BufferedReader

Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);

运行应用程序时可能会抛出以下异常：

java.nio.charset.MalformedInputException: Input length = 1

但是

new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));

效果很好。

不同的是，前者使用 CharsetDecoder 默认动作。

对于格式错误的输入和不可映射的字符错误的默认操作是报告它们。

而后者使用 REPLACE 操作。

cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)

Answer 3

ISO-8859-1 是一个包罗万象的字符集，从某种意义上说，它保证不会抛出 MalformedInputException。因此，即使您的输入不在此字符集中，它也有利于调试。所以：-

req.setCharacterEncoding("ISO-8859-1");

我的输入中有一些双右引号/双左引号字符，US-ASCII 和 UTF-8 都对它们抛出 MalformedInputException，但 ISO-8859-1 有效。

Answer 4

我也遇到了这个异常并带有错误消息，

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)

尝试使用时发现出现一些奇怪的bug

BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));

编写一个从类中的泛型类型转换而来的字符串“orazg 54”。

//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");

该字符串的长度为 9，包含具有以下代码点的字符：

111 114 97 122 103 9 53 52 10

但是，如果类中的BufferedWriter替换为：

FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));

可以成功写入这个String，没有异常。此外，如果我编写从字符创建的相同字符串，它仍然可以正常工作。

String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();

以前我在使用第一个BufferedWriter写入任何字符串时从未遇到过任何异常。这是从 java.nio.file.Files.newBufferedWriter(path, options) 创建的 BufferedWriter 发生的一个奇怪的错误

Answer 5

试试这个..我有同样的问题，下面的实现对我有用

Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);

然后在任何您想要的地方使用 Reader。

国外：

CsvToBean<anyPojo> csvToBean = null;
    try {
        Reader reader = Files.newBufferedReader(Paths.get(csvFilePath), 
                        StandardCharsets.ISO_8859_1);
        csvToBean = new CsvToBeanBuilder(reader)
                .withType(anyPojo.class)
                .withIgnoreLeadingWhiteSpace(true)
                .withSkipLines(1)
                .build();

    } catch (IOException e) {
        e.printStackTrace();
    }

Answer 6

ISO_8859_1 为我工作！我正在读取带有逗号分隔值的文本文件

Answer 7

我编写了以下内容，根据可用的字符集将结果列表打印到标准输出。请注意，它还会告诉您从基于 0 的行号开始哪一行失败，以防您对导致问题的字符进行故障排除。

public static void testCharset(String fileName) {
    SortedMap<String, Charset> charsets = Charset.availableCharsets();
    for (String k : charsets.keySet()) {
        int line = 0;
        boolean success = true;
        try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
            while (b.ready()) {
                b.readLine();
                line++;
            }
        } catch (IOException e) {
            success = false;
            System.out.println(k+" failed on line "+line);
        }
        if (success) 
            System.out.println("*************************  Successs "+k);
    }
}

Answer 8

嗯，问题是

Files.newBufferedReader(Path path)

是这样实现的：

public static BufferedReader newBufferedReader(Path path) throws IOException {
    return newBufferedReader(path, StandardCharsets.UTF_8);
}

所以基本上没有必要指定

UTF-8

，除非你想在代码中进行描述。如果您想尝试“更广泛”的字符集，您可以尝试使用

StandardCharsets.UTF_16

，但无论如何您都不能 100% 确定获得所有可能的字符。

Answer 9

0
投票

UTF-8 适用于波兰语字符

Answer 10

为 quarkus mailer 和 qute 模板添加附加答案，因为无论我搜索堆栈跟踪的哪个部分，这始终是 google 中的第一个结果：

如果您使用 quarkus 邮件程序和 qute 模板并获取此

MalformedInputException

，请检查您的模板文件夹是否包含模板文件以外的其他文件。就我而言，我有一个

.png

文件，我想将其包含在邮件中，并且该文件会自动读取为模板，因此出现了此编码问题。

Answer 11

我尝试使用UTF-8，因为它是关于越南数据的，但它是错误的。

解决方案：检查我正在使用 NPP 读取的文件的编码是否正确，在本例中为 UTF-16 LE BOM。

所以我需要在代码中应用相同的编码

    private static List<String[]> readCsvLinesFromFile(String filePath) {
    List<String[]> lines = new ArrayList<>();
    try (InputStreamReader isr = new InputStreamReader(new FileInputStream(filePath), StandardCharsets.UTF_16LE);
         // do your work
    } catch (IOException e) {
        e.printStackTrace();
    }
    return lines;
}

Answer 12

你可以尝试这样的事情，或者直接复制并粘贴下面的内容。

boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.        
int index = 0;

while(exception) {
    try {
        lines = Files.readAllLines(f.toPath(),charset);
          for (String line: lines) {
              line= line.trim();
              if(line.contains(keyword))
                  values.add(line);
              }           
        //No exception, just returns
        exception = false; 
    } catch (IOException e) {
        exception = true;
        //Try the next charset
        if(index<Charset.availableCharsets().values().size())
            charset = (Charset) Charset.availableCharsets().values().toArray()[index];
        index ++;
    }
}

全包字符集以避免“java.nio.charset.MalformedInputException：输入长度= 1”？

问题描述投票：0回答：12

12个回答

最新问题

全包字符集以避免“java.nio.charset.MalformedInputException：输入长度= 1”？

问题描述 投票：0回答：12

12个回答

最新问题

问题描述投票：0回答：12