Must the Scala-Spark developer install Spark and Hadoop on his computer? - scala

I had installed the Hadoop + Spark cluster on the servers.
It is working fine writing scala code in the spark-shell on the master server.
I put the Spark library (the jar files) in my project and I'm writing my first Scala code on my computer through Intellij.
When I run a simple code that just creates a SparkContext object for reading a file from the HDFS through the hdfs protocol, it outputs error messages.
The test function:
import org.apache.spark.SparkContext
class SpcDemoProgram {
def demoPrint(): Unit ={
println("class spe demoPrint")
test()
}
def test(){
var spark = new SparkContext();
}
}
The messages is:
20/11/02 12:36:26 INFO SparkContext: Running Spark version 3.0.0
20/11/02 12:36:26 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException:
HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems at
org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548) at
org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569) at
org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592) at
org.apache.hadoop.util.Shell.(Shell.java:689) at
org.apache.hadoop.util.StringUtils.(StringUtils.java:78) at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1664)
at
org.apache.hadoop.security.SecurityUtil.setConfigurationInternal(SecurityUtil.java:104)
at
org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:88)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:316)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:304)
at
org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1828)
at
org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at
org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412) at
org.apache.spark.SparkContext.(SparkContext.scala:303) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) Caused by:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
unset. at
org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468) at
org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439) at
org.apache.hadoop.util.Shell.(Shell.java:516) ... 19 more
20/11/02 12:36:26 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable 20/11/02 12:36:27 ERROR SparkContext: Error initializing
SparkContext. org.apache.spark.SparkException: A master URL must be
set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) 20/11/02
12:36:27 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master
URL must be set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala)
Does that error message imply that Hadoop and Spark must be installed on my computer?
What configuration do I need to do?

I assume, you are trying to read the file with the path as hdfs://<FILE_PATH> then yes you need to have Hadoop installed else if its just a local directory you could try without "hdfs://" in the file path.

Related

HiveException when running a sql example in Spark shell

a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

Error initializing SparkContext: A master URL must be set in your configuration

I used this code
My error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0
17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you may explicitly enable `spark.memory.useLegacyMode`
(not recommended).
17/02/03 20:39:25 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
17/02/03 20:39:25 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Process finished with exit code 1
If you are running spark stand alone then
val conf = new SparkConf().setMaster("spark://master") //missing
and you can pass parameter while submit job
spark-submit --master spark://master
If you are running spark local then
val conf = new SparkConf().setMaster("local[2]") //missing
you can pass parameter while submit job
spark-submit --master local
if you are running spark on yarn then
spark-submit --master yarn
Error message is pretty clear, you have to provide the address of the Spark Master node, either via the SparkContext or via spark-submit:
val conf =
new SparkConf()
.setAppName("ClusterScore")
.setMaster("spark://172.1.1.1:7077") // <--- This is what's missing
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
SparkConf configuration = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(conf);
It will work...
Most probably you are using Spark 2.x API in Java.
Use code snippet like this to avoid this error. This is true when you are running Spark standalone on your computer using Shade plug-in which will import all the runtime libraries on your computer.
SparkSession spark = SparkSession.builder()
.appName("Spark-Demo")//assign a name to the spark application
.master("local[*]") //utilize all the available cores on local
.getOrCreate();

Cannot load main class from JAR file

I have a Spark-scala application. I tried to display a simple message - "Hello my App". When I compile it with sbt compile and run it by sbt run it's fine. I displayed my message with success but he display an error; like this:
Hello my application!
16/11/27 15:17:11 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException
ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67)
16/11/27 15:17:11 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
[success] Total time: 13 s, completed Nov 27, 2016 3:17:12 PM
16/11/27 15:17:12 INFO DiskBlockManager: Shutdown hook called
I can't understand whether it's fine or not!
Also when I try to load my file jar after the run, it displays an error.
My command line look like:
spark-submit "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar
And the error is:
Error: Cannot load main class from JAR file:/root/projectFilms/appfilms
Run with --help for usage help or --verbose for debug output
16/11/27 15:24:11 INFO Utils: Shutdown hook called
Please can you answer me!
The error is due to the fact that the SparkContext is not stopped, this is required in versions higher than Spark 2.x.
This should be stopped to prevent this error by SparkContext.stop(), or sc.stop(). Inspiration for solving this error is gained from own experiences and the following sources: Spark Context, Spark Listener Bus error
You forgot to use --class Parameter
spark-submit "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar
spark-submit --class "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar.
Please note if appfilm belong to any package dont forgot to add package name as below
packagename.appfilms
I believe this will suffice

java.util.concurrent.RejectedExecutionException in Spark although driver/client has precisely same version as Server

A task that works in spark local mode is not working for standalone cluster running on the same machine.
The only difference is:
local[*]
vs
spark://<host>.local:7077
for the master
I am able to run spark pi against the master at the above address and also use the spark gui: so the master address is generally working for spark.
Here is the (normal) spark init code:
val sconf = new SparkConf().setMaster(master).setAppName("EpisCatalog")
val sc = new SparkContext(sconf)
Here is the stacktrace from running the program:
15/12/03 03:39:04.746 main WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/03 03:39:07.706 main WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/03 03:39:27.739 appclient-registration-retry-thread ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#b649f0b rejected from java.util.concurrent.ThreadPoolExecutor#5ef7a52b[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am running spark 1.6.0-SNAPSHOT. It has been "installed" to local maven repo and I have verified that the client is using the latest local maven repo version.
I had the same problem. It could be solved by using the full host url (can be found on the Master Web UI, port 18080) instead of just the hostname or localhost.
So I had to use mymachine.mycompany.org instead of mymachine
I got the same problem and in my case there was version mismatch. I had Spark Driver written on 1.5.1 version and Spark Cluster setup on 1.6.0.
Maybe you deploy cluster on stable version which was on that time 1.5.1.