MapReduce-KNN for Hadoop - 从一个数据文件运行多个测试用例

Question

背景：[跳到下一节了解确切问题]

我目前正在研究Hadoop作为我大学的一个小项目（不是强制性项目，我这样做是因为我想）。

我的计划是在其中一个实验室（Master + 4 Slaves）中使用5台PC在大型数据集上运行KNN算法，以找出运行时间等。

我知道我可以在互联网上找到基本代码，我确实找到了它（https://github.com/matt-hicks/MapReduce-KNN）。它适用于单个测试用例，但我拥有的是一个包含数百个测试用例的非常大的测试用例。因此，我需要为每个测试用例迭代相同的代码。

问题：

tl; dr：我有一个KNN程序，一次只需要一个测试用例，但是我想让它迭代，以便它可以处理多个测试用例。

我的解决方案

我对此并不是很有经验，从我知道的基础知识来看，我决定将变量和映射到变量数组和映射数组中。

所以这：

    public static class KnnMapper extends Mapper<Object, Text, NullWritable, DoubleString>
    {
        DoubleString distanceAndModel = new DoubleString();
        TreeMap<Double, String> KnnMap = new TreeMap<Double, String>();

        // Declaring some variables which will be used throughout the mapper
        int K;

        double normalisedSAge;
        double normalisedSIncome;
        String sStatus;
        String sGender;
double normalisedSChildren;

成了这个：

DoubleString distanceAndModel = new DoubleString();
    TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];



    // Declaring some variables which will be used throughout the mapper
    int[] K = new int[1000];

    double[] normalisedSAge = new double[1000];
    double[] normalisedSIncome = new double[1000];
    String[] sStatus = new String[1000];
    String[] sGender = new String[1000];
    double[] normalisedSChildren = new double[1000];
    int n = 0;

还有这个：

        protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            StringTokenizer st = new StringTokenizer(knnParams, ",");

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K = Integer.parseInt(st.nextToken());
            normalisedSAge = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus = st.nextToken();
            sGender = st.nextToken();
            normalisedSChildren = normalisedDouble(st.nextToken(), minChildren, maxChildren);
        }

}

成了这个：

protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            //Splitting input File if we hit a newline character or return carriage i.e., Windown Return Key as input
            StringTokenizer lineSt = new StringTokenizer(knnParams, "\n\r");

            //Running a loop to tokennize each line of inputs or test cases
            while(lineSt.hasMoreTokens()){
            String nextLine = lineSt.nextToken();   //Converting current line to a string
            StringTokenizer st = new StringTokenizer(nextLine, ","); // Tokenizing the current string or singular data

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K[n] = Integer.parseInt(st.nextToken());
            normalisedSAge[n] = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome[n] = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus[n] = st.nextToken();
            sGender[n] = st.nextToken();
            normalisedSChildren[n] = normalisedDouble(st.nextToken(), minChildren, maxChildren);
            n++;
        }}
    }

对于减速器类也是如此。

这是我第一次使用TreeMaps。我之前研究过并使用过树木，但不是地图或TreeMaps。我仍然试图制作它和数组，结果证明是错误的：

/home/hduser/Desktop/knn/KnnPattern.java:81：错误：通用数组创建TreeMap [] KnnMap = new TreeMap [1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:198：错误：不兼容的类型：double []无法转换为double normalisedRChildren，normalisedSAge，normalisedSIncome，sStatus，sGender，normalisedSChildren）; ^

/home/hduser/Desktop/knn/KnnPattern.java:238：错误：泛型数组创建TreeMap [] KnnMap = new TreeMap [1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:283：错误：二元运算符的错误操作数类型'>'if（KnnMap [num] .size（）> K）^ first type：int second type：int []

现在，我想如果我尝试使用TreeMaps的链接列表，它可能会起作用。

但是，到目前为止，我基本上在Uni中使用过C / C ++和Python。 OOP在这里似乎让人们的生活更轻松，但我并不是100％确定如何使用它。

我的问题：

是否可以制作TreeMaps的链接列表？

是否有链接列表替代：

TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];

我的方法是正确的吗？使代码迭代应该有助于遍历所有测试用例，对吧？

我将通过尝试和错误尝试从那里开始工作。但这是我几天以来一直坚持的事情。

我很抱歉，如果有人之前已经问过这个，但我找不到任何东西，所以我不得不写一个问题。如果您认为之前已经回答过，请分享相关答案的链接。

谢谢！并且，在旁注：在使用TreeMaps时，我应该记住的任何其他内容，特别是TreeMaps的链接列表。

Answer 1

关于错误消息

/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

和

/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

发生这些错误的原因是您尝试从Java不支持的通用组件类型创建实例，因为泛型类型在运行时丢失。解决方法（如果你真的需要一个数组）将创建一个List的TreeMap对象，然后将其转换为数组：

// TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
List<TreeMap<Double, String>> KnnMapList = new LinkedList<>();
TreeMap<Double, String>[] KnnMap = (TreeMap<Double, String>[]) KnnMapList.toArray();

有关详细信息，请参阅this问题。

/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^

通过查看GitHub上的源代码，我意识到您可能没有在方法KnnMapper#map(Object, Text, Context)中修改以下方法调用：

double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                    normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren);

应该

double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                    normalisedRChildren, normalisedSAge[n], normalisedSIncome[n], sStatus[n], sGender[n], normalisedSChildren[n]);

但我想这些修改不会给你所需的功能，因为KnnMapper#map(Object, Text, Context)每个键/值对只被调用一次，如here所述，你可能想称它为n次。

具体问题

为了避免进一步的麻烦，我建议你保持GitHub类的高级代码不变，只修改KnnPattern#main(String[])方法，以便它按照this回答中的描述调用n次。

编辑：示例

这是一个修改过的KnnPattern#main(String[])方法，它逐行读取您的数据文件，创建一个临时文件，当前行作为内容，并以临时文件作为缓存文件启动作业。（假设您至少使用Java 7）

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
...
public class KnnPattern
{
  ...
    public static void main(String[] args) throws Exception {
        // Create configuration
        Configuration conf = new Configuration();

        if (args.length != 3) {
            System.err.println("Usage: KnnPattern <in> <out> <parameter file>");
            System.exit(2);
        }

        try (final BufferedReader br = new BufferedReader(new FileReader(args[2]))) {
            int n = 1;
            String line;
            while ((line = br.readLine()) != null) {
                // create temporary file with content of current line
                final File tmpDataFile = File.createTempFile("hadoop-test-", null);
                try (BufferedWriter tmpDataWriter = new BufferedWriter(new FileWriter(tmpDataFile))) {
                    tmpDataWriter.write(line);
                    tmpDataWriter.flush();
                }

                // Create job
                Job job = Job.getInstance(conf, "Find K-Nearest Neighbour #" + n);
                job.setJarByClass(KnnPattern.class);
                // Set the third parameter when running the job to be the parameter file and give it an alias
                job.addCacheFile(new URI(tmpDataFile.getAbsolutePath() + "#knnParamFile")); // Parameter file containing test data

                // Setup MapReduce job
                job.setMapperClass(KnnMapper.class);
                job.setReducerClass(KnnReducer.class);
                job.setNumReduceTasks(1); // Only one reducer in this design

                // Specify key / value
                job.setMapOutputKeyClass(NullWritable.class);
                job.setMapOutputValueClass(DoubleString.class);
                job.setOutputKeyClass(NullWritable.class);
                job.setOutputValueClass(Text.class);

                // Input (the data file) and Output (the resulting classification)
                FileInputFormat.addInputPath(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1] + "_" + n));

                // Execute job
                final boolean jobSucceeded = job.waitForCompletion(true);

                // clean up
                tmpDataFile.delete();

                if (!jobSucceeded) {
                    // return error status if job failed
                    System.exit(1);
                }

                ++n;
            }
        }
    }

}

MapReduce-KNN for Hadoop - 从一个数据文件运行多个测试用例

问题描述投票：1回答：1

1个回答

最新问题

MapReduce-KNN for Hadoop - 从一个数据文件运行多个测试用例

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1