Header 1 #
imported spark code to run on eclipse
Getting build errors
Its working fine on the terminal
Header 2
/*SampleApp.scala:
This application simply counts the number of lines that contain "val"
*/
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val txtFile = "file:///home/edureka/Desktop/readme.txt"
val conf = new SparkConf().setMaster("local[2]").setAppName("Sample Application")
val sc = new SparkContext(conf)
val txtFileLines = sc.textFile(txtFile , 2).cache()
val numAs = txtFileLines.filter(line => line.contains("bash")).count()
println("Lines with bash: %s".format(numAs))
}
}
Header 3 "
LF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/edureka/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/edureka/spark-1.1.1/assembly/target/scala-2.10/spark-assembly-1.1.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/08/16 17:00:16 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.211.130 instead (on interface eth2)
15/08/16 17:00:16 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/16 17:00:16 INFO spark.SecurityManager: Changing view acls to: edureka
15/08/16 17:00:16 INFO spark.SecurityManager: Changing modify acls to: edureka
15/08/16 17:00:16 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(edureka); users with modify permissions: Set(edureka)
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
at akka.actor.ActorCell$.<init>(ActorCell.scala:305)
at akka.actor.ActorCell$.<clinit>(ActorCell.scala)
at akka.actor.RootActorPath.$div(ActorPath.scala:152)
at akka.actor.LocalActorRefProvider.<init>(ActorRefProvider.scala:465)
at akka.remote.RemoteActorRefProvider.<init>(RemoteActorRefProvider.scala:124)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:191)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:230)
at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:550)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1504)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1495)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:204)
at SimpleApp$.main(SampleApp.scala:14)
at SimpleApp.main(SampleApp.scala)
Be careful, this kind of problems happen quite often with Spark. If you don't want other surprises, you can build Spak yourself against the right versions of the dependencies you may be using (Guava, log4j, Scala, Jackson) Also, consider using spark.driver.userClassPathFirst and spark.executor.userClassPathFirst properties in order to make your classpath priotary over Spark bundled dependencies. Personally it only worked for me when passing them as a parameter of spark-submit. It does not work when setting them in SparkConf (which makes sense).
Even with these properties set to true, you may still have problems because Spark uses a separate classloader, which can lead to some issues even if your dependencies have the same version number. In this case, only manually building Spark will allow to fix it (to my knowledge).
I actually did try and install spark with all dependencies and tried to run the code. This actually did work. The main point was to set the directory structure correctly. Create a project the create the file structure src/main/scala inside it and then the actual program(code) file code.scala. And the dependencies file .sbt should be inside the main Project file. Thanks #Dici
Related
I had installed the Hadoop + Spark cluster on the servers.
It is working fine writing scala code in the spark-shell on the master server.
I put the Spark library (the jar files) in my project and I'm writing my first Scala code on my computer through Intellij.
When I run a simple code that just creates a SparkContext object for reading a file from the HDFS through the hdfs protocol, it outputs error messages.
The test function:
import org.apache.spark.SparkContext
class SpcDemoProgram {
def demoPrint(): Unit ={
println("class spe demoPrint")
test()
}
def test(){
var spark = new SparkContext();
}
}
The messages is:
20/11/02 12:36:26 INFO SparkContext: Running Spark version 3.0.0
20/11/02 12:36:26 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException:
HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems at
org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548) at
org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569) at
org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592) at
org.apache.hadoop.util.Shell.(Shell.java:689) at
org.apache.hadoop.util.StringUtils.(StringUtils.java:78) at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1664)
at
org.apache.hadoop.security.SecurityUtil.setConfigurationInternal(SecurityUtil.java:104)
at
org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:88)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:316)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:304)
at
org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1828)
at
org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at
org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412) at
org.apache.spark.SparkContext.(SparkContext.scala:303) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) Caused by:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
unset. at
org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468) at
org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439) at
org.apache.hadoop.util.Shell.(Shell.java:516) ... 19 more
20/11/02 12:36:26 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable 20/11/02 12:36:27 ERROR SparkContext: Error initializing
SparkContext. org.apache.spark.SparkException: A master URL must be
set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) 20/11/02
12:36:27 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master
URL must be set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala)
Does that error message imply that Hadoop and Spark must be installed on my computer?
What configuration do I need to do?
I assume, you are trying to read the file with the path as hdfs://<FILE_PATH> then yes you need to have Hadoop installed else if its just a local directory you could try without "hdfs://" in the file path.
When I run my Spark app using sbt run with configuration pointing to master of a remote cluster nothing useful gets executed by the workers and the following warning is printed in sbt run log repeatedly.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
This is how my spark config looks like:
#transient lazy val conf: SparkConf = new SparkConf()
.setMaster("spark://master-ip:7077")
.setAppName("HelloWorld")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "12g")
#transient lazy val sc: SparkContext = new SparkContext(conf)
val lines = sc.textFile("hdfs://master-public-dns:9000/test/1000.csv")
I know this warning usually appears when the cluster is misconfigured and the workers either don't have the resources or aren't started in the first place. However, according to my Spark UI (on master-ip:8080) the worker nodes seem to be alive with sufficient RAM and cpu cores, they even try to execute my app but they exit and leave this in stderr log:
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled;
users with view permissions: Set(ubuntu, myuser);
groups with view permissions: Set(); users with modify permissions: Set(ubuntu, myuser); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
... 8 more
ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Any ideas?
Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh
Problem summary:
I am unable to read from nested subdirectories using my Spark program, despite setting the required Hadoop configuration (see attempted).
I get the error pasted below.
Any help is appreciated.
Version:
Spark 2.2.0
Input directory layout:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939225073/part-00000-3a44cd00-e895-4a01-9ab9-946064b739d4-c000.parquet
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939234036/part-00000-cbd47353-0590-4cc1-b10d-c18886df1c25-c000.parquet
...
Input directory parameter passed:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/*/*
Attempted (1):
Set parameter in code...
val sparkSession: SparkSession =SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support & loglevel
import sparkSession.implicits._sparkSession.sparkContext.hadoopConfiguration.setBoolean("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", true)
Did not see the configuration in place in Spark UI.
Attempted (2):
Passed the config from the CLI - spark-submit, and set it in code (see below).
spark-submit --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \...
I do see the configuration in the Spark UI, but same error – cannot traverse into the directory structure..
Code:
//Spark Session
val sparkSession: SparkSession=SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support
val conf= new SparkConf()
val cliRecursiveGlobConf=conf.get("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive")
import sparkSession.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", cliRecursiveGlobConf)
Error & overall output:
Full error is at - https://gist.github.com/airawat/77fbdb821410a5a87dfd29ffaf60fdf9
17/08/18 15:59:29 INFO state.StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
Exception in thread "main" java.io.FileNotFoundException: File /user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=*/* does not exist.
I used this code
My error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0
17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you may explicitly enable `spark.memory.useLegacyMode`
(not recommended).
17/02/03 20:39:25 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
17/02/03 20:39:25 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Process finished with exit code 1
If you are running spark stand alone then
val conf = new SparkConf().setMaster("spark://master") //missing
and you can pass parameter while submit job
spark-submit --master spark://master
If you are running spark local then
val conf = new SparkConf().setMaster("local[2]") //missing
you can pass parameter while submit job
spark-submit --master local
if you are running spark on yarn then
spark-submit --master yarn
Error message is pretty clear, you have to provide the address of the Spark Master node, either via the SparkContext or via spark-submit:
val conf =
new SparkConf()
.setAppName("ClusterScore")
.setMaster("spark://172.1.1.1:7077") // <--- This is what's missing
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
SparkConf configuration = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(conf);
It will work...
Most probably you are using Spark 2.x API in Java.
Use code snippet like this to avoid this error. This is true when you are running Spark standalone on your computer using Shade plug-in which will import all the runtime libraries on your computer.
SparkSession spark = SparkSession.builder()
.appName("Spark-Demo")//assign a name to the spark application
.master("local[*]") //utilize all the available cores on local
.getOrCreate();
A task that works in spark local mode is not working for standalone cluster running on the same machine.
The only difference is:
local[*]
vs
spark://<host>.local:7077
for the master
I am able to run spark pi against the master at the above address and also use the spark gui: so the master address is generally working for spark.
Here is the (normal) spark init code:
val sconf = new SparkConf().setMaster(master).setAppName("EpisCatalog")
val sc = new SparkContext(sconf)
Here is the stacktrace from running the program:
15/12/03 03:39:04.746 main WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/03 03:39:07.706 main WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/03 03:39:27.739 appclient-registration-retry-thread ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#b649f0b rejected from java.util.concurrent.ThreadPoolExecutor#5ef7a52b[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am running spark 1.6.0-SNAPSHOT. It has been "installed" to local maven repo and I have verified that the client is using the latest local maven repo version.
I had the same problem. It could be solved by using the full host url (can be found on the Master Web UI, port 18080) instead of just the hostname or localhost.
So I had to use mymachine.mycompany.org instead of mymachine
I got the same problem and in my case there was version mismatch. I had Spark Driver written on 1.5.1 version and Spark Cluster setup on 1.6.0.
Maybe you deploy cluster on stable version which was on that time 1.5.1.