Why df write console format not showing anything? - scala

I have a static dataframe, how to write it to the console instead of using df.show()
val sparkConfig = new SparkConf().setAppName("streaming-vertica").setMaster("local[2]")
val sparkSession = SparkSession.builder().master("local[2]").config(sparkConfig).getOrCreate()
val sc = sparkSession.sparkContext
val rows = sc.parallelize(Array(
Row(1,"hello", true),
Row(2,"goodbye", false)
))
val schema = StructType(Array(
StructField("id",IntegerType, false),
StructField("sings",StringType,true),
StructField("still_here",BooleanType,true)
))
val df = sparkSession.createDataFrame(rows, schema)
df.write
.format("console")
.mode("append")
This is writing nothing into console:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/27 00:30:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Process finished with exit code 0
On using save :
df.write
.format("console")
.mode("append")
.save()
It gives :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/27 00:45:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:473)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at rep.StaticDFWrite$.main(StaticDFWrite.scala:35)
at rep.StaticDFWrite.main(StaticDFWrite.scala)
Spark version = 2.2.1
scala version = 2.11.12

You have to call save on DataFrameWriter object.
without save method it will just create DataFrameWriter object & terminate your session.
Check below code, I have checked in spark-shell.
Please note this code is working on spark version 2.4.0 but not working on 2.2.0
console format will not work with write in spark 2.2.0 - https://issues.apache.org/jira/browse/SPARK-20599
scala> df.write.format("console").mode("append")
res5: org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row] = org.apache.spark.sql.DataFrameWriter#148a3112
scala> df.write.format("console").mode("append").save()
+--------+---+
| name|age|
+--------+---+
|srinivas| 20|
+--------+---+

Related

How to fix "java.lang.reflect.InaccessibleObjectException"

I'm traying to create a simple Spark Session, that read a csv file. But it show me this error :
Exception in thread "main" java.lang.reflect.InaccessibleObjectException: Unable to make field private transient java.lang.String java.net.URI.scheme accessible: module java.base does not "opens java.net" to unnamed module #bcec361
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:336)
at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:330)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:330)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:222)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$1.weigh(FileStatusCache.scala:109)
at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$1.weigh(FileStatusCache.scala:107)
at org.spark_project.guava.cache.LocalCache$Segment.setValue(LocalCache.java:2222)
at org.spark_project.guava.cache.LocalCache$Segment.put(LocalCache.java:2944)
at org.spark_project.guava.cache.LocalCache.put(LocalCache.java:4212)
at org.spark_project.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
at org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:152)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$listLeafFiles$2.apply(InMemoryFileIndex.scala:130)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$listLeafFiles$2.apply(InMemoryFileIndex.scala:128)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:128)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:91)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:67)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$createInMemoryFileIndex(DataSource.scala:533)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:371)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:714)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:686)
at org.example.SparkSessionTest$.main(SparkSessionTest.scala:19)
at org.example.SparkSessionTest.main(SparkSessionTest.scala)
This is my code :
import org.apache.spark.sql.SparkSession
object SparkSessionTest {
def main(args:Array[String]): Unit ={
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate();
println("First SparkContext:")
println("APP Name :"+spark.sparkContext.appName);
println("Deploy Mode :"+spark.sparkContext.deployMode);
println("Master :"+spark.sparkContext.master);
val df = spark.read.text("src/data/test.txt")
}
}
I'm using :
jdk-11.0.15.1
scala 2.12.10
spark 3.1.3
And im using Intellij IDE and maven
If you are using intelliJ,
go to file => project structure => Project
check if you have SDK set to 1.8 else download it.

Running Scala Jar with Spark-Submit

I've compiled a spark-scala script to a JAR and I want to run it with spark-submit. But I'm having this error:
2020-01-07 13:03:02,190 WARN util.Utils: Your hostname, nifi resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
2020-01-07 13:03:02,192 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
2020-01-07 13:03:03,109 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-07 13:03:03,826 WARN deploy.SparkSubmit$$anon$2: Failed to load hello.
java.lang.ClassNotFoundException: hello
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2020-01-07 13:03:03,857 INFO util.ShutdownHookManager: Shutdown hook called
2020-01-07 13:03:03,858 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a8cc1ba6-3643-4646-82a3-4b44f4487105
This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
import spark.implicits._
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
val query = census.as("census").join((zip_codes.where($"City" === "Inglewood").where($"County" === "Los Angeles").as("zip")),Seq("Zip_Code"),"inner").select($"census.Total_Males".as("male"),$"census.Total_Females".as("female")).distinct()
query.show()
val queryR = query.repartition(5)
queryR.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
I think my problem is that im using scala object instead of a class, but I'm not sure.
I run the spark-submit like this
spark-submit \
--class hello \
/home/hdfs/IdeaProjects/untitled/out/artifacts/quest_jar/quest.jar
Anyone solved this error before?
I think you need to specify a package name for both spark-submit and your object.
For instance :
spark-submit \
--class com.my.package.hello \
/home/hdfs/IdeaProjects/untitled/out/artifacts/quest_jar/quest.jar
and
package com.my.package
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object hello {
...
}

Test Spark Scala with Maven Got Error: java.lang.NoClassDefFoundError

I tried to Test Spark Scala on Scala IDE (eclipse) with Maven but keep getting error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:904)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:7)
at com.SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
The program I try is the Quick Start code, from the Spark Documentation:
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
I use Spark 2.2.0 and Scala 2.11.7. The pom.xml file is:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
I followed a solution from another thread: NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
But it doesn't work for me. The content in my spark-env.sh file is:
# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/local/hadoop/etc/hadoop classpath)
Could anybody help me with this? Appreciate your help.
Devesh's answer solve parts of my problem. However, I got other problems:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/08/17 10:34:03 INFO SparkContext: Running Spark version 2.2.0
18/08/17 10:34:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/17 10:34:03 WARN Utils: Your hostname, toshiba0 resolves to a loopback address: 127.0.1.1; using 192.168.1.217 instead (on interface wlp2s0)
18/08/17 10:34:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/08/17 10:34:03 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
18/08/17 10:34:03 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.SimpleApp$.main(SimpleApp.scala:11)
at com.SimpleApp.main(SimpleApp.scala)
I don't know why Spark says my loopback address is 127.0.1.1, I checked my configuration: /etc/network/interfaces, it's auto loopback, and I ping 127.0.0.1. It works.
I followed the solution from this link Error initializing SparkContext: A master URL must be set in your configuration
and put the following code, because I use my laptop. It still doesn't work.
val conf = new SparkConf().setMaster("local[2]")
Don't know what happen to my settings. Thank you!
Just add following in maven pom.xml file
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.0</version>
</dependency>
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.0 onwards the same effects can be achieved through SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession
** sample code snippet:-**
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // some file on system
val spark = SparkSession
.builder
.appName("Simple Application")
.master("local[2]")
.getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
}
}

Running first program in Spark

I am trying to run my first program in Spark with scala. Trying to read a csv file and display.
Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import java.io._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object df extends App{
val spark=SparkSession.builder().getOrCreate()
val drf=spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-
Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
Getting the following error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/29 23:10:53 INFO SparkContext: Running Spark version 2.1.0
17/04/29 23:10:56 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/04/29 23:10:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration at org.apache.spark.SparkContext.<init>
(SparkContext.scala:379)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
at df$.delayedEndpoint$df$1(df.scala:11)
at df$delayedInit$body.apply(df.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at df$.main(df.scala:9)
at df.main(df.scala)
Any suggestions would be helpful
You missed .master() function call. for example if you want to run in local mode following is the solution :
object df extends App{
val spark=SparkSession.builder().master("local").getOrCreate()
val drf=spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-
Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
And the error log clearly says that
17/04/29 23:10:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration at org.apache.spark.SparkContext.<init>
(SparkContext.scala:379)
Hope it helps
As previous comment said you should setup master for your spark context, in your case it should be local[1] or local[*]. Also you should set a appName.
You can avoid master and appName specification via code using spark-submit with keys.
import org.apache.spark.sql.SparkSession
object df extends App{
override def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("example").master("local[*]").getOrCreate()
val drf = spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
}

Apache Spark NoSuchMethodError using .reduceByKey()

I'm trying to run some examples in Apache Spark to learn more about it, but when I try to do it (in spark-shell) I'm receiving the error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
There's the full execution and the error trace. I wish you could help me.
pcitbu#pcitbumint /usr/spark/spark-2.0.1-bin-hadoop2.7 $ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/10/25 09:52:38 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '3').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
16/10/25 09:52:38 WARN Utils: Your hostname, pcitbumint resolves to a loopback address: 127.0.1.1; using 192.168.0.119 instead (on interface ens33)
16/10/25 09:52:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/10/25 09:52:38 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.0.119:4040
Spark context available as 'sc' (master = local[*], app id = local-1477381958561).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val file = sc.textFile("README.md")
file: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> val counts = file.flatMap(line => line.split(" "))
counts: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
scala> .map(word => (word, 1))
res0: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:29
scala> .reduceByKey(_ + _)
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
at org.apache.hadoop.hdfs.HdfsConfiguration.addDeprecatedKeys(HdfsConfiguration.java:66)
at org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:31)
at org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:116)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1440)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:563)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:328)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:327)
... 48 elided
please check your hadoop version. More particularly the version of hadoop-common-x.x.x.jar
Couple of points:
Did you built this spark source yourself, if yes check the hadoop version you built with.
If you've a pre-built version, please see the version of hadoop-common inside your install dir i.e. /usr/spark/spark-2.0.1-bin-hadoop2.7