我正在用Java尝试Spark。我用以下典型示例在Eclipse中创建了一个Maven项目:
package maver4spark.exampleSpark;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.log4j.*;
public class Main {
public static void main(String[] args) {
String logFile = "/folder/file.txt";
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("yarn-client");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
在Eclipse中运行此应用程序时出现错误
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Unable to load YARN support
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:399)
如果我将setmaster
替换为setMaster("local[2]");
,则效果很好。pom.xml包含以下依赖项:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.6.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
我还安装了hadoop和spark。 Hadoop(单节点)和yarn守护程序正在运行,并且spark在shell上运行良好。问题是我不知道如何将Java应用程序与Hadoop文件系统连接。我希望在Eclipse中尝试该应用程序,而不要使用spark-submit。
是什么引起堆栈跟踪?就我而言,我需要添加spark-yarn依赖项