我正在通过读取avro文件来创建数据帧,但是在scala IDE的spark应用程序中读取文件时出现错误。
package dataFrameBasics
import org.apache.spark.sql.SparkSession
object WorkingWithAvroFile {
def main(ar : Array[String]): Unit={
val ss= SparkSession.builder().master("local")
.appName("Working with Avro File")
.getOrCreate()
val avroDF= ss.read
.format("com.databricks.spark.avro")
.load("C:/Spark_Files/userdata1.avro")
avroDF.printSchema()
avroDF.show(10)
println("Count:"+avroDF.count())
}
}
在控制台上,出现以下错误:线程“主”中的异常java.lang.ClassNotFoundException:无法找到数据源:org.apache.spark.sql.avro.AvroFileFormat。请在http://spark.apache.org/third-party-projects.html]中找到软件包
在“问题”选项卡上,给出以下错误:
spark-avro_2.11-3.2.0.jar的SparkCourseAsMavenProject构建路径与不兼容的Scala版本(2.11.0)交叉编译。如果此报告有误,可以在编译器首选项页面中禁用此检查。
在pom.xml中,添加了依赖项:
<dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.11</artifactId> <version>3.2.0</version> </dependency>
尝试过该库的不同版本,但仍然给出相同的错误。
我正在通过读取avro文件来创建数据帧,但是在scala IDE的spark应用程序中读取文件时出错。包dataFrameBasics导入org.apache.spark.sql.SparkSession对象...
Srinivas-下面是pom.xml
首先,您无法导入com.databricks的Avro jar,并期望其中包含org.apache.spark.sql的Avro文件格式。最好选择以下内容。注意:使用正确的scala和spark版本
@QuickSilver -- Please find below full stacktrace
******************
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/05/28 19:53:10 INFO SparkContext: Running Spark version 2.4.5
20/05/28 19:53:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/05/28 19:53:12 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:9)
at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
20/05/28 19:53:12 INFO SparkContext: Submitted application: Working with Avro File
20/05/28 19:53:12 INFO SecurityManager: Changing view acls to: santosh
20/05/28 19:53:12 INFO SecurityManager: Changing modify acls to: santosh
20/05/28 19:53:12 INFO SecurityManager: Changing view acls groups to:
20/05/28 19:53:12 INFO SecurityManager: Changing modify acls groups to:
20/05/28 19:53:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(santosh); groups with view permissions: Set(); users with modify permissions: Set(santosh); groups with modify permissions: Set()
20/05/28 19:53:16 INFO Utils: Successfully started service 'sparkDriver' on port 50965.
20/05/28 19:53:16 INFO SparkEnv: Registering MapOutputTracker
20/05/28 19:53:16 INFO SparkEnv: Registering BlockManagerMaster
20/05/28 19:53:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/05/28 19:53:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/05/28 19:53:16 INFO DiskBlockManager: Created local directory at C:\Users\santosh\AppData\Local\Temp\blockmgr-0efb8730-052b-474d-ad33-669ad1b5d5ed
20/05/28 19:53:16 INFO MemoryStore: MemoryStore started with capacity 351.3 MB
20/05/28 19:53:16 INFO SparkEnv: Registering OutputCommitCoordinator
20/05/28 19:53:17 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/05/28 19:53:17 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://Lenovo-PC:4040
20/05/28 19:53:18 INFO Executor: Starting executor ID driver on host localhost
20/05/28 19:53:18 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50976.
20/05/28 19:53:18 INFO NettyBlockTransferService: Server created on Lenovo-PC:50976
20/05/28 19:53:18 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/28 19:53:18 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManagerMasterEndpoint: Registering block manager Lenovo-PC:50976 with 351.3 MB RAM, BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:19 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/Users/santosh/Scala/Workspace/SparkCourseAsMavenProject/spark-warehouse').
20/05/28 19:53:19 INFO SharedState: Warehouse path is 'file:/C:/Users/santosh/Scala/Workspace/SparkCourseAsMavenProject/spark-warehouse'.
20/05/28 19:53:23 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/05/28 19:53:24 INFO InMemoryFileIndex: It took 232 ms to list leaf files for 1 paths.
**root
|-- registration_dttm: string (nullable = true)
|-- id: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- ip_address: string (nullable = true)
|-- cc: long (nullable = true)
|-- country: string (nullable = true)
|-- birthdate: string (nullable = true)
|-- salary: double (nullable = true)
|-- title: string (nullable = true)
|-- comments: string (nullable = true)**
20/05/28 19:53:31 INFO FileSourceStrategy: Pruning directories with:
20/05/28 19:53:31 INFO FileSourceStrategy: Post-Scan Filters:
20/05/28 19:53:31 INFO FileSourceStrategy: Output Data Schema: struct<registration_dttm: string, id: bigint, first_name: string, last_name: string, email: string ... 11 more fields>
20/05/28 19:53:31 INFO FileSourceScanExec: Pushed Filters:
20/05/28 19:53:32 INFO CodeGenerator: Code generated in 790.65082 ms
20/05/28 19:53:35 INFO CodeGenerator: Code generated in 188.480854 ms
20/05/28 19:53:36 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.3 KB, free 351.1 MB)
20/05/28 19:53:42 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.6 KB, free 351.1 MB)
20/05/28 19:53:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.3 MB)
20/05/28 19:53:42 INFO SparkContext: Created broadcast 0 from show at WorkingWithAvroFile.scala:24
20/05/28 19:53:42 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:42 INFO SparkContext: Starting job: show at WorkingWithAvroFile.scala:24
20/05/28 19:53:42 INFO DAGScheduler: Got job 0 (show at WorkingWithAvroFile.scala:24) with 1 output partitions
20/05/28 19:53:42 INFO DAGScheduler: Final stage: ResultStage 0 (show at WorkingWithAvroFile.scala:24)
20/05/28 19:53:42 INFO DAGScheduler: Parents of final stage: List()
20/05/28 19:53:42 INFO DAGScheduler: Missing parents: List()
20/05/28 19:53:42 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at WorkingWithAvroFile.scala:24), which has no missing parents
20/05/28 19:53:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 14.1 KB, free 351.1 MB)
20/05/28 19:53:43 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.0 KB, free 351.0 MB)
20/05/28 19:53:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on Lenovo-PC:50976 (size: 6.0 KB, free: 351.3 MB)
20/05/28 19:53:43 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:43 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at WorkingWithAvroFile.scala:24) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/05/28 19:53:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7732 bytes)
20/05/28 19:53:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/05/28 19:53:44 INFO FileScanRDD: Reading File path: file:///C:/Spark_Files/userdata1.avro, range: 0-93561, partition values: [empty row]
20/05/28 19:53:44 INFO CodeGenerator: Code generated in 85.990927 ms
20/05/28 19:53:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3182 bytes result sent to driver
20/05/28 19:53:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2830 ms on localhost (executor driver) (1/1)
20/05/28 19:53:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/05/28 19:53:46 INFO DAGScheduler: ResultStage 0 (show at WorkingWithAvroFile.scala:24) finished in 3.362 s
20/05/28 19:53:46 INFO DAGScheduler: Job 0 finished: show at WorkingWithAvroFile.scala:24, took 4.340480 s
**+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
| registration_dttm| id|first_name|last_name| email|gender| ip_address| cc| country| birthdate| salary| title|comments|
+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
|2016-02-03T07:55:29Z| 1| Amanda| Jordan| [email protected]|Female| 1.197.201.2|6759521864920116| Indonesia| 3/8/1971| 49756.53| Internal Auditor| 1E+02|
|2016-02-03T17:04:03Z| 2| Albert| Freeman| [email protected]| Male|218.111.175.34| null| Canada| 1/16/1968|150280.17| Accountant IV| |
|2016-02-03T01:09:31Z| 3| Evelyn| Morgan|emorgan2@altervis...|Female| 7.161.136.94|6767119071901597| Russia| 2/1/1960|144972.51| Structural Engineer| |
|2016-02-03T12:36:21Z| 4| Denise| Riley| [email protected]|Female| 140.35.109.83|3576031598965625| China| 4/8/1997| 90263.05|Senior Cost Accou...| |
|2016-02-03T05:05:31Z| 5| Carlos| Burns|cburns4@miitbeian...| |169.113.235.40|5602256255204850| South Africa| | null| | |
|2016-02-03T07:22:34Z| 6| Kathryn| White| [email protected]|Female|195.131.81.179|3583136326049310| Indonesia| 2/25/1983| 69227.11| Account Executive| |
|2016-02-03T08:33:08Z| 7| Samuel| Holmes|[email protected]| Male|232.234.81.197|3582641366974690| Portugal|12/18/1987| 14247.62|Senior Financial ...| |
|2016-02-03T06:47:06Z| 8| Harry| Howell| [email protected]| Male| 91.235.51.73| null|Bosnia and Herzeg...| 3/1/1962|186469.43| Web Developer IV| |
|2016-02-03T03:52:53Z| 9| Jose| Foster| [email protected]| Male| 132.31.53.61| null| South Korea| 3/27/1992|231067.84|Software Test Eng...| 1E+02|
|2016-02-03T18:29:47Z| 10| Emily| Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671| Nigeria| 1/28/1997| 27234.28| Health Coach IV| |
+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
only showing top 10 rows**
20/05/28 19:53:47 INFO FileSourceStrategy: Pruning directories with:
20/05/28 19:53:47 INFO FileSourceStrategy: Post-Scan Filters:
20/05/28 19:53:47 INFO FileSourceStrategy: Output Data Schema: struct<>
20/05/28 19:53:47 INFO FileSourceScanExec: Pushed Filters:
20/05/28 19:53:48 INFO CodeGenerator: Code generated in 50.675225 ms
20/05/28 19:53:48 INFO CodeGenerator: Code generated in 45.623914 ms
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 220.3 KB, free 350.8 MB)
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.6 KB, free 350.8 MB)
20/05/28 19:53:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.3 MB)
20/05/28 19:53:48 INFO SparkContext: Created broadcast 2 from count at WorkingWithAvroFile.scala:25
20/05/28 19:53:48 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:48 INFO SparkContext: Starting job: count at WorkingWithAvroFile.scala:25
20/05/28 19:53:48 INFO DAGScheduler: Registering RDD 6 (count at WorkingWithAvroFile.scala:25) as input to shuffle 0
20/05/28 19:53:48 INFO DAGScheduler: Got job 1 (count at WorkingWithAvroFile.scala:25) with 1 output partitions
20/05/28 19:53:48 INFO DAGScheduler: Final stage: ResultStage 2 (count at WorkingWithAvroFile.scala:25)
20/05/28 19:53:48 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
20/05/28 19:53:48 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
20/05/28 19:53:48 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[6] at count at WorkingWithAvroFile.scala:25), which has no missing parents
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 10.9 KB, free 350.8 MB)
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.5 KB, free 350.8 MB)
20/05/28 19:53:48 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on Lenovo-PC:50976 (size: 5.5 KB, free: 351.2 MB)
20/05/28 19:53:48 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:48 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[6] at count at WorkingWithAvroFile.scala:25) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/05/28 19:53:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7721 bytes)
20/05/28 19:53:48 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 27
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 17
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 14
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 20
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 22
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 28
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 12
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 13
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 30
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 21
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 9
20/05/28 19:53:49 INFO FileScanRDD: Reading File path: file:///C:/Spark_Files/userdata1.avro, range: 0-93561, partition values: [empty row]
20/05/28 19:53:49 INFO CodeGenerator: Code generated in 33.699831 ms
20/05/28 19:53:49 INFO BlockManagerInfo: Removed broadcast_1_piece0 on Lenovo-PC:50976 in memory (size: 6.0 KB, free: 351.3 MB)
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 19
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 7
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 23
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 16
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 11
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 15
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 26
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 8
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 29
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 10
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 25
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 18
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 5
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 6
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 24
20/05/28 19:53:49 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1638 bytes result sent to driver
20/05/28 19:53:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 618 ms on localhost (executor driver) (1/1)
20/05/28 19:53:49 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/05/28 19:53:49 INFO DAGScheduler: ShuffleMapStage 1 (count at WorkingWithAvroFile.scala:25) finished in 0.778 s
20/05/28 19:53:49 INFO DAGScheduler: looking for newly runnable stages
20/05/28 19:53:49 INFO DAGScheduler: running: Set()
20/05/28 19:53:49 INFO DAGScheduler: waiting: Set(ResultStage 2)
20/05/28 19:53:49 INFO DAGScheduler: failed: Set()
20/05/28 19:53:49 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[9] at count at WorkingWithAvroFile.scala:25), which has no missing parents
20/05/28 19:53:49 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 8.6 KB, free 350.8 MB)
20/05/28 19:53:49 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.4 KB, free 350.8 MB)
20/05/28 19:53:49 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on Lenovo-PC:50976 (size: 4.4 KB, free: 351.3 MB)
20/05/28 19:53:49 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:49 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[9] at count at WorkingWithAvroFile.scala:25) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:49 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
20/05/28 19:53:49 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, ANY, 7246 bytes)
20/05/28 19:53:49 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
20/05/28 19:53:49 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks including 1 local blocks and 0 remote blocks
20/05/28 19:53:49 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 36 ms
20/05/28 19:53:50 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1749 bytes result sent to driver
20/05/28 19:53:50 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 397 ms on localhost (executor driver) (1/1)
20/05/28 19:53:50 INFO DAGScheduler: ResultStage 2 (count at WorkingWithAvroFile.scala:25) finished in 0.477 s
20/05/28 19:53:50 INFO DAGScheduler: Job 1 finished: count at WorkingWithAvroFile.scala:25, took 1.384763 s
20/05/28 19:53:50 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
**Count:1000**
@QuickSilver-- continuing stack trace
**********
20/05/28 19:53:50 INFO FileSourceStrategy: Pruning directories with:
20/05/28 19:53:50 INFO FileSourceStrategy: Post-Scan Filters:
20/05/28 19:53:50 INFO FileSourceStrategy: Output Data Schema: struct<registration_dttm: string, id: bigint, first_name: string, last_name: string, email: string ... 11 more fields>
20/05/28 19:53:50 INFO FileSourceScanExec: Pushed Filters:
20/05/28 19:53:50 INFO deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
20/05/28 19:53:50 INFO AvroFileFormat: Compressing Avro output using the snappy codec
20/05/28 19:53:50 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
20/05/28 19:53:50 INFO CodeGenerator: Code generated in 22.138244 ms
20/05/28 19:53:50 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 220.3 KB, free 350.6 MB)
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 20.6 KB, free 350.6 MB)
20/05/28 19:53:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.2 MB)
20/05/28 19:53:51 INFO SparkContext: Created broadcast 5 from save at WorkingWithAvroFile.scala:27
20/05/28 19:53:51 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:51 INFO SparkContext: Starting job: save at WorkingWithAvroFile.scala:27
20/05/28 19:53:51 INFO DAGScheduler: Got job 2 (save at WorkingWithAvroFile.scala:27) with 1 output partitions
20/05/28 19:53:51 INFO DAGScheduler: Final stage: ResultStage 3 (save at WorkingWithAvroFile.scala:27)
20/05/28 19:53:51 INFO DAGScheduler: Parents of final stage: List()
20/05/28 19:53:51 INFO DAGScheduler: Missing parents: List()
20/05/28 19:53:51 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[11] at save at WorkingWithAvroFile.scala:27), which has no missing parents
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 135.4 KB, free 350.4 MB)
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 48.6 KB, free 350.4 MB)
20/05/28 19:53:51 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on Lenovo-PC:50976 (size: 48.6 KB, free: 351.2 MB)
20/05/28 19:53:51 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[11] at save at WorkingWithAvroFile.scala:27) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:51 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
20/05/28 19:53:51 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, executor driver, partition 0, PROCESS_LOCAL, 7732 bytes)
20/05/28 19:53:51 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
20/05/28 19:53:51 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
**20/05/28 19:53:51 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/28 19:53:52 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/28 19:53:52 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
20/05/28 19:53:52 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
20/05/28 19:53:52 INFO TaskSchedulerImpl: Cancelling stage 3
20/05/28 19:53:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 3: Stage cancelled
20/05/28 19:53:52 INFO DAGScheduler: ResultStage 3 (save at WorkingWithAvroFile.scala:27) failed in 0.880 s due to Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver)**: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
20/05/28 19:53:52 INFO DAGScheduler: Job 2 failed: save at WorkingWithAvroFile.scala:27, took 0.916810 s
20/05/28 19:53:52 ERROR FileFormatWriter: Aborting job 39b6cdd2-d53c-4cb0-afa4-8d21a34d4638.
**org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:27)
at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/28 19:53:52 WARN FileUtil: Failed to delete file or dir [C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\.part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro.crc]: it still exists.
20/05/28 19:53:52 WARN FileUtil: Failed to delete file or dir [C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro]: it still exists.
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:27)
at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1879)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1878)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
... 21 more