How to set native library path on cloudera spark yarn cluster mode - scala

I want to use jep library in my spark job. The spark is running in yarn-cluster mode. I am using cdh58.
I am getting this at run time:
java.lang.UnsatisfiedLinkError: no jep in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at jep.Jep.<clinit>(Jep.java:217)
I tried passing it through spark.driver.java-opts and spark.executor.java-opts but its of no help. Even tried setting it in spark-env.sh and hadoop-env.sh files, it didnt work. I tried setting it in mapreduce.map.env and mapreduce.map.child.env and restarted CDH services, didnt work.
Any pointers would be very helpful. Thanks.

Related

Native snappy library not available

I'm trying to do lots of joins on some data frames using spark in scala. When I'm trying to get the count of the final data frame I'm generating here, I'm getting the following exception. I'm running the code using spark-shell.
I've tried some configuration params like following while starting the spark-shell. But none of them worked. Is there anything I'm missing here?
:
--conf "spark.driver.extraLibraryPath=/usr/hdp/2.6.3.0-235/hadoop/lib/native/"
--jars /usr/hdp/current/hadoop-client/lib/snappy-java-1.0.4.1.jar
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
Try to update Hadoop jar file from 2.6.3. to 2.8.0 or 3.0.0. There was the bug in the earlier version of Hadoop: the native snappy library was not available.
After modifying of Hadoop core jar, you should be able to perform snappy compression/decompression.

spark-submit netty NoSuchMethodError when using jdbc

I have been trying to run my job using spark-submit and i have a problem when i use jdbc to fetch a DataFrame from a Postgresql.
First of all the jdbc driver is inside my job jar but i had to load the driver like this inside my code
sparkSession.read.option("driver", "org.postgresql.Driver").jdbc(jdbcdn, query, props)
This works fine and the connection to the database is made, i know this because if the server is not found i recieve the appropriate exception from the driver.
But if the connection succeed i always receive the following exception and the job hangs :
17/05/31 10:56:16 ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/jars/bibi-1.0.0-spark.jar, byteCount=3345077, body=FileSegmentManagedBuffer{file=/srv/jobs/bibi-1.0.0-spark.jar, offset=0, length=3345077}} to /127.0.0.1:50087; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:150)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
... 34 more
I tried the following (i am using Gradle)
Exclude netty dependencies from my project
Include a version of netty in my shadowJar
Relocate the included netty
But everything i tried had no effect.
What i am wondering is the problem i had with registering the driver since all the standard ways you can found online does not work with spark/scala/jdbc and i had to use the code above.
It seems to me that the jdbc call is in its own environment and whatever i do in my project gradle has no effect on this environment.
Since that option("driver", "org.postgresql.Driver") was hard to find, I wonder if there is something undocumented here and if i have to find a way to instruct the jdbc runtime on which netty version to use.
Ok so i continued my search and finally found what was going on.
I installed the spark-master and hadoop server myself, since the hadoop jars were going to be on the same server i installed the spark without hadoop.
Hadoop jars were added to spark classpath using the "hadoop classpath" command.
The thing is that hadoop 2.7.3 ship with netty 3.6.2/4.0.23.Final while spark ship with netty 3.8.0/4.0.42.Final
Both were on the classpath in the end provoking the problem.
What i did was to copy both netty jars from spark to all places in hadoop basically upgrading the netty version used by hadoop.
I don't see a problem so far but i use a fraction of what hadoop can do and issues may arise.
EDIT : Another quick fix is to use the spark-with-hadoop tar and NOT add hadoop classpath, that way both use their own jars without conflicting with each other.
This is actually what i ended up doing because i had another jar conflict when accessing sparkUI and it could not be corrected by copying jars like i did with netty.
The conclusion is : NEVER use spark-without-hadoop download.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

ClassNotFoundException: org.apache.spark.repl.SparkCommandLine

I am a newbie in Apache Zeppelin and I try to run it locally. I try to run just a simple sanity check to see that sc exists and get the error below.
I compiled it for pyspark and spark 1.5 (I use spark 1.5). I increased the memory to 5 GB and changed the port to 8091.
I am not sure what I did wrong so I get the following error and how should I solve it.
Thanks in advance
java.lang.ClassNotFoundException:
org.apache.spark.repl.SparkCommandLine at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:401)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:485)
at
org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:174)
at
org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:152)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:302)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Update
The solution for me was to degrade my scala version from 2.11.* to 2.10.*, build Apache Spark again and run Zeppelin.
I am making certain assumptions based on what you have answered in comments. It sounds like the Zeppelin setup is good, when I looked at the class SparkCommandLine it's part of Spark's core.
Now Zeppelin has its own minimal embedded Spark classes, which are activated if you don't set SPARK_HOME. So first, per this github page, try not setting SPARK_HOME (which you are setting) and HADOOP_HOME (which I don't think you are setting), to see if eliminating your underlying Spark install "fixes" it:
Without SPARK_HOME and HADOOP_HOME, Zeppelin uses embedded Spark and
Hadoop binaries that you have specified with mvn build option. If you
want to use system provided Spark and Hadoop, export SPARK_HOME and
HADOOP_HOME in zeppelin-env.sh You can use any supported version of
spark without rebuilding Zeppelin.
If that works, then you know we are looking at a Java classpath issue. To try to fix this, there's one more setting that goes in the zeppelin-env.sh file,
ZEPPELIN_JAVA_OPTS
mentioned here on the Zeppelin mailing list, make sure you set that to point to the actual Spark jars so the JVM picks it up with a -classpath
Here's what my zeppelin process looks like for comparison, I think the important part is the -cp argument, do the ps on your system and look through your JVM options to see if it's similarly pointing to
/usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar:/usr/local/spark/conf/:/usr/local/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar
-Xms1g -Xmx1g -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log org.apache.spark.deploy.SparkSubmit --conf spark.driver.extraClassPath=:/usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar
--conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log
--class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar 50309
Hope that helps if that doesn't work please edit your question to show your existing classpath.
Zeppelin recently released version 0.6.1 which supports Scala 2.11 and Spark 2.0. I too was puzzled by this error message, since I could clearly see my Spark home directory in the classpath. The new version of Zeppelin works great; I'm currently running it with Spark 2.0/Scala 2.11.

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.