External dependency for spark job

External dependency for spark job - pyspark

I am new to big data technologies.I have to run a spark job in cluster mode on EMR. The job is written in python and it has dependencies on several libraries and some other tools. I have already written the script and run it in local client mode.But it arising some dependency issue when I am trying to run it using yarn.How do I manage these dependencies?
Log :
"/mnt/yarn/usercache/hadoop/appcache/application_1511680510570_0144/container_1511680510570_0144_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
__import__(name)
ImportError: ('No module named boto3', <function subimport at 0x7f8c3c4f9c80>, ('boto3',))
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

It seems you have not installed Boto 3 library.
Download the compatible one and install it using below
$ pip install boto3
or python -m pip install --user boto3
Hope this helps.You can refer the link-https://github.com/boto/boto3
Then it seems you have not installed the boot 3 on all executors(nodes). Since, you are running spark, python code is running partly on driver and executors.You need to install the library in all nodes if its yarn.
To install the same.Please refer-How to bootstrap installation of Python modules on Amazon EMR?

Yes you can-
aws emr create-cluster --bootstrap-actions Path=<>,Name=BootstrapAction1,Args=[arg1,arg2].. --auto-terminate.Please refer below-http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses

Related

spark & zeppelin problems with integration

I want to connect my locally installed zeppelin 0.10.0 to an also locally installed spark 3.2.0 (I tried the same procedure with spark2.3.0 and it worked.). But it looks like zeppelin itself has an internal spark which uses the internal one every time I try. I have gone through the setting for spark interpreters with no use.
I just want to know if there is anyway I can change the default internal spark that zeppelin uses and change it to a spark 3.2.0 I want to use.
I put the parameters of SPARK_HOME what it is said to be and spark.master local[*] receiving the following error:
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more

I've run into the same issue myself - you won't run Spark 3.2.0 on Zeppelin 0.10.0. Spark 3.1.2 works without any issues and Zeppelin has Spark 2.4.5 included - this is a problem with a tool itself.
According to the ticket ZEPPELIN-5565 version 0.10.0 DOES NOT support Spark 3.2.0. This should be fixed in 0.10.1 and 0.11.0 (info from mentioned ticket and I've also checked the Github repo).
Pull request that fixes this issue is much longer, but in Zeppelin 0.10.0 there is this strategic line:
public static final SparkVersion UNSUPPORTED_FUTURE_VERSION = SPARK_3_2_0;

NoSuchMethod when trying to execute HelloWorld in Scala on Zeppelin with local JAR as dependency

I have a problem with execution of local .jar on Zeppelin. I'm adding dependency jar via this guide, but when I go to notebook and try to execute
println("Hi")
I'm getting the stack listed below:
java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
at scala.tools.nsc.settings.MutableSettings.loop$1(MutableSettings.scala:64)
at scala.tools.nsc.settings.MutableSettings.processArguments(MutableSettings.scala:91)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:706)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When there is no dependencies in interpreter, everything works fine.
I know that might be scala dependencies issues, but I've tried with different versions of scala and that won't help.
Also util.Properties.versionString on Zeppelin notebook returns res1: String = version 2.11.8 - it's the same version as in my test .jar file.

How to set native library path on cloudera spark yarn cluster mode

I want to use jep library in my spark job. The spark is running in yarn-cluster mode. I am using cdh58.
I am getting this at run time:
java.lang.UnsatisfiedLinkError: no jep in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at jep.Jep.<clinit>(Jep.java:217)
I tried passing it through spark.driver.java-opts and spark.executor.java-opts but its of no help. Even tried setting it in spark-env.sh and hadoop-env.sh files, it didnt work. I tried setting it in mapreduce.map.env and mapreduce.map.child.env and restarted CDH services, didnt work.
Any pointers would be very helpful. Thanks.

Cannot run sbt on redhat

I tried downloading and running sbt on RedHat using:
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install sbt
, and I get this error
java.lang.ExceptionInInitializerError
at xsbt.boot.Update.settings$lzycompute(Update.scala:76)
at xsbt.boot.Update.settings(Update.scala:71)
at xsbt.boot.Update.ivyLockFile$lzycompute(Update.scala:93)
at xsbt.boot.Update.apply(Update.scala:100)
at xsbt.boot.Launch.update(Launch.scala:350)
at xsbt.boot.Launch.xsbt$boot$Launch$$retrieve$1(Launch.scala:208)
at xsbt.boot.Launch$$anonfun$3.apply(Launch.scala:216)
at scala.Option.getOrElse(Option.scala:120)
at xsbt.boot.Launch.xsbt$boot$Launch$$getAppProvider0(Launch.scala:216)
at xsbt.boot.Launch$$anon$2.call(Launch.scala:196)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at xsbt.boot.Launch.locked(Launch.scala:238)
at xsbt.boot.Launch.app(Launch.scala:147)
at xsbt.boot.Launch.app(Launch.scala:145)
at xsbt.boot.Launch$.run(Launch.scala:102)
at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:35)
at xsbt.boot.Launch$.launch(Launch.scala:117)
at xsbt.boot.Launch$.apply(Launch.scala:18)
at xsbt.boot.Boot$.runImpl(Boot.scala:41)
at xsbt.boot.Boot$.main(Boot.scala:17)
at xsbt.boot.Boot.main(Boot.scala)
Caused by: java.lang.RuntimeException: The SHA1 algorithm is not available in your classpath
at org.apache.ivy.core.cache.DefaultRepositoryCacheManager.<clinit>(DefaultRepositoryCacheManager.java:86)
... 29 more
Caused by: java.security.NoSuchAlgorithmException: SHA1 MessageDigest not available
at sun.security.jca.GetInstance.getInstance(GetInstance.java:159)
at java.security.Security.getImpl(Security.java:695)
at java.security.MessageDigest.getInstance(MessageDigest.java:167)
at org.apache.ivy.core.cache.DefaultRepositoryCacheManager.<clinit>(DefaultRepositoryCacheManager.java:84)
... 29 more
Error during sbt execution: java.lang.ExceptionInInitializerError
I'm not sure where the error is coming from. Is it an error in the sbt initialization itself?
Since this is the recommended way to install sbt from the official website, what other ways of installing sbt on redhat would you recommend?

I ended up solving this by installing the bouncy castle jars onto my jvm installation, which were not present in my RedHat vms.

I personally prefer the latest SBT version and a proper wrapper run script for it. There is a very nice script for it written by Paul Phillips (one of the Scala lang contributors). I just have a little shell script get_sbt.sh that downloads the latest version for me:
#!/bin/bash
# Downloads the latest version of SBT runner script which in turn downloads
# SBT launcher JAR and provides lots of convenience methods.
curl -s https://raw.githubusercontent.com/paulp/sbt-extras/master/sbt > sbt && chmod 0755 sbt
You can run it like this:
./get_sbt.sh
It will download the runner script to your current directory, afterwards just run the runner script with:
./sbt
This in turn will download the latest SBT JAR and whatever else is needed will be bootstrapped from here.

ClassNotFoundException: org.apache.spark.repl.SparkCommandLine

I am a newbie in Apache Zeppelin and I try to run it locally. I try to run just a simple sanity check to see that sc exists and get the error below.
I compiled it for pyspark and spark 1.5 (I use spark 1.5). I increased the memory to 5 GB and changed the port to 8091.
I am not sure what I did wrong so I get the following error and how should I solve it.
Thanks in advance
java.lang.ClassNotFoundException:
org.apache.spark.repl.SparkCommandLine at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:401)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:485)
at
org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:174)
at
org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:152)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:302)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Update
The solution for me was to degrade my scala version from 2.11.* to 2.10.*, build Apache Spark again and run Zeppelin.

I am making certain assumptions based on what you have answered in comments. It sounds like the Zeppelin setup is good, when I looked at the class SparkCommandLine it's part of Spark's core.
Now Zeppelin has its own minimal embedded Spark classes, which are activated if you don't set SPARK_HOME. So first, per this github page, try not setting SPARK_HOME (which you are setting) and HADOOP_HOME (which I don't think you are setting), to see if eliminating your underlying Spark install "fixes" it:
Without SPARK_HOME and HADOOP_HOME, Zeppelin uses embedded Spark and
Hadoop binaries that you have specified with mvn build option. If you
want to use system provided Spark and Hadoop, export SPARK_HOME and
HADOOP_HOME in zeppelin-env.sh You can use any supported version of
spark without rebuilding Zeppelin.
If that works, then you know we are looking at a Java classpath issue. To try to fix this, there's one more setting that goes in the zeppelin-env.sh file,
ZEPPELIN_JAVA_OPTS
mentioned here on the Zeppelin mailing list, make sure you set that to point to the actual Spark jars so the JVM picks it up with a -classpath
Here's what my zeppelin process looks like for comparison, I think the important part is the -cp argument, do the ps on your system and look through your JVM options to see if it's similarly pointing to
/usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar:/usr/local/spark/conf/:/usr/local/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar
-Xms1g -Xmx1g -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log org.apache.spark.deploy.SparkSubmit --conf spark.driver.extraClassPath=:/usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar
--conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log
--class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar 50309
Hope that helps if that doesn't work please edit your question to show your existing classpath.

Zeppelin recently released version 0.6.1 which supports Scala 2.11 and Spark 2.0. I too was puzzled by this error message, since I could clearly see my Spark home directory in the classpath. The new version of Zeppelin works great; I'm currently running it with Spark 2.0/Scala 2.11.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

External dependency for spark job - pyspark

Yes you can- aws emr create-cluster --bootstrap-actions Path=<>,Name=BootstrapAction1,Args=[arg1,arg2].. --auto-terminate.Please refer below-http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses

Related

spark & zeppelin problems with integration

NoSuchMethod when trying to execute HelloWorld in Scala on Zeppelin with local JAR as dependency

How to set native library path on cloudera spark yarn cluster mode

Cannot run sbt on redhat

ClassNotFoundException: org.apache.spark.repl.SparkCommandLine

Categories

Resources