遇到java.lang.IllegalArgumentException

问题描述 投票:0回答:1

我正在开发一个 Java 项目,它基本上是一个假新闻检测应用程序。该数据集包含两列文本(新闻文章)和标签(0:假/1:真)。此数据将转换为 JSON 文件。在Java中,我使用Regex将所有停用词替换为空格(“”)。然后我开始使用 Java 进行矢量化。我在 Weka 和 Deeplearning4j 中遇到了内置矢量化技术的问题。现在,我使用“StringToWordVector”过滤器对文本进行矢量化。我将在我的 Java 应用程序中提供 .java 文件的代码。

数据处理器.java

package fnd;

import com.fasterxml.jackson.databind.ObjectMapper;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffSaver;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;

public class DataProcessor {

    public static void main(String[] args) {
        try {
            // Specify the path to your JSON file containing news data
            String jsonFilePath = "src/main/resources/fnd_output.json";

            // Create ObjectMapper instance to read JSON
            ObjectMapper objectMapper = new ObjectMapper();

            // Deserialize JSON array into an array of News objects
            News[] newsArray = objectMapper.readValue(new File(jsonFilePath), News[].class);

            // Prepare attributes for the Instances
            ArrayList<Attribute> attributes = new ArrayList<>();
            attributes.add(new Attribute("text", (ArrayList<String>) null)); // Text attribute as string

            // Define nominal values for the label attribute
            ArrayList<String> labelValues = new ArrayList<>();
            labelValues.add("positive");
            labelValues.add("negative");
            Attribute labelAttribute = new Attribute("label", labelValues); // Label attribute as nominal
            attributes.add(labelAttribute);

            // Create an empty Instances object
            Instances instances = new Instances("TextInstances", attributes, 0);

            // Set the index of the class attribute (label attribute)
            instances.setClassIndex(attributes.size() - 1);

            // Process each News object and add to Instances
            for (News news : newsArray) {
                String processedText = TextPreprocessor.preprocessText(news.getText());

                // Vectorize the processed text
                Instances vectorizedInstance = TextVectorization.vectorizeText(processedText);

                // Create a new Instance
                Instance instance = new DenseInstance(attributes.size());

                // Set the dataset for the instance
                instance.setDataset(instances);

                // Handle text attribute (assuming it's a string attribute)
                Attribute textAttr = attributes.get(0);
                if (textAttr.isString()) {
                    instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));
                } else {
                    System.err.println("Text attribute is not a string attribute.");
                }

                // Handle label attribute (assuming it's a nominal attribute)
                Attribute labelAttr = labelAttribute;
                if (labelAttr.isNominal()) {
                    instance.setValue(labelAttr, news.getLabel());
                } else {
                    System.err.println("Label attribute is not a nominal attribute.");
                }

                // Add the instance to Instances
                instances.add(instance);
            }

            // Output instances to ARFF file
            ArffSaver arffSaver = new ArffSaver();
            arffSaver.setInstances(instances);
            arffSaver.setFile(new File("vectorized_text_with_labels.arff"));
            arffSaver.writeBatch();

            System.out.println("Text vectorization complete with labels. Saved as vectorized_text_with_labels.arff");

        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

文本矢量化

package fnd;

import java.io.File;import java.io.IOException;import java.util.ArrayList;

import weka.core.Attribute;import weka.core.Instances;import weka.core.DenseInstance;import weka.core.converters.ArffSaver;import weka.filters.Filter;import weka.filters.unsupervised.attribute.StringToWordVector;import weka.core.Instance;

public class TextVectorization {

 // Method to perform text vectorization (convert string to word vector)
public static Instances vectorizeText(String text) throws Exception {
    // Create ArrayList to hold attributes
    ArrayList<Attribute> attributes = new ArrayList<>();
    
    // Create a single attribute named "text"
    Attribute textAttribute = new Attribute("text", (ArrayList<String>) null);
    attributes.add(textAttribute);
    
    // Create Instances object with the specified attribute
    Instances instances = new Instances("TextInstances", attributes, 0);
    instances.setClass(textAttribute); // Set the class attribute to "text"
    
    // Create a new Instance with the provided text
 // Create a new Instance
    Instance instance = new DenseInstance(instances.numAttributes());
    instance.setValue(textAttribute, text);
    instances.add(instance);
    
    // Apply StringToWordVector filter to vectorize the text
    StringToWordVector filter = new StringToWordVector();
    filter.setInputFormat(instances);
    Instances vectorizedData = Filter.useFilter(instances, filter);
    
    return vectorizedData;
}

public static void saveInstancesToArff(Instances instances, String filename) throws IOException {
    ArffSaver arffSaver = new ArffSaver();
    arffSaver.setInstances(instances);
    arffSaver.setFile(new File(filename));
    arffSaver.writeBatch();
}
}

文本预处理器

package fnd;

import java.util.regex.Matcher;import java.util.regex.Pattern;

public class TextPreprocessor {

private static final Pattern URL_PATTERN = Pattern.compile("http[s]?://\\S+|www\\.\\S+");
private static final Pattern HTML_TAG_PATTERN = Pattern.compile("<[^>]+>");

public static String preprocessText(String text) {
    if (text == null || text.isEmpty()) {
        return "";
    }

    // Convert text to lowercase
    text = text.toLowerCase();

    // Remove URLs and HTML tags
    text = removeUrlsAndHtmlTags(text);

    // Remove non-word characters (except spaces), digits, and newline characters
    text = removeSpecialCharacters(text);

    return text;
}

private static String removeUrlsAndHtmlTags(String text) {
    Matcher urlMatcher = URL_PATTERN.matcher(text);
    text = urlMatcher.replaceAll("");

    Matcher htmlTagMatcher = HTML_TAG_PATTERN.matcher(text);
    text = htmlTagMatcher.replaceAll("");

    return text;
}

private static String removeSpecialCharacters(String text) {
    StringBuilder processedText = new StringBuilder(text.length());

    for (char ch : text.toCharArray()) {
        if (Character.isLetter(ch) || Character.isWhitespace(ch)) {
            processedText.append(ch);
        }
    }

    return processedText.toString();
}
}

据我所知,有关错误的详细信息。

java.lang.IllegalArgumentException:属性不是名义、字符串或日期!在weka.core.AbstractInstance.stringValue(AbstractInstance.java:674)在weka.core.AbstractInstance.stringValue(AbstractInstance.java:644)在fnd。 DataProcessor.main(DataProcessor.java:60)

取消注释该行将运行代码。

instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));

如何对文本进行矢量化,然后将数据输入模型?

java json vectorization weka illegalargumentexception
1个回答
0
投票

您一次处理一个文档的数据,每次都重新初始化 StringToWordVector 过滤器。但是,过滤器每次都会根据您刚刚推送的文档内容生成不同的词袋。每次,矢量化输出中的列都将与不同的单词相关。要解决此问题,您至少需要将所有文本数据添加到单个 weka.core.Instances 对象,然后应用过滤器。

但是...

由于您计划执行分类,因此您应该将 StringToWordVector 与 FilteredClassifier 元分类器以及选择的基分类器结合使用以进行实际分类。这样您就可以确保使用初始化的 StringToWordVector 过滤器正确预处理后续预测。 在这种情况下,您的文本数据应该是第一个属性,与该文本关联的标签应该是第二个属性(类属性)。

© www.soinside.com 2019 - 2024. All rights reserved.