Native snappy library not available - scala

I'm trying to do lots of joins on some data frames using spark in scala. When I'm trying to get the count of the final data frame I'm generating here, I'm getting the following exception. I'm running the code using spark-shell.
I've tried some configuration params like following while starting the spark-shell. But none of them worked. Is there anything I'm missing here?
:
--conf "spark.driver.extraLibraryPath=/usr/hdp/2.6.3.0-235/hadoop/lib/native/"
--jars /usr/hdp/current/hadoop-client/lib/snappy-java-1.0.4.1.jar
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)

Try to update Hadoop jar file from 2.6.3. to 2.8.0 or 3.0.0. There was the bug in the earlier version of Hadoop: the native snappy library was not available.
After modifying of Hadoop core jar, you should be able to perform snappy compression/decompression.

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

ClassNotFoundException while creating Spark Session

I am trying to create a Spark Session in Unit Test case using the below code
val spark = SparkSession.builder.appName("local").master("local").getOrCreate()
but while running the tests, I am getting the below error:
java.lang.ClassNotFoundException: org.apache.hadoop.fs.GlobalStorageStatistics$StorageStatisticsProvider
I have tried to add the dependency but to no avail. Can someone point out the cause and the solution to this issue?
It can be because of two reasons.
1. You may have incompatible versions of spark and Hadoop stacks. For example, HBase 0.9 is incompatible with spark 2.0. It will result in the class/method not found exceptions.
2. You may have multiple version of the same library because of dependency hell. You may need to run the dependency tree to make sure this is not the case.

How to set native library path on cloudera spark yarn cluster mode

I want to use jep library in my spark job. The spark is running in yarn-cluster mode. I am using cdh58.
I am getting this at run time:
java.lang.UnsatisfiedLinkError: no jep in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at jep.Jep.<clinit>(Jep.java:217)
I tried passing it through spark.driver.java-opts and spark.executor.java-opts but its of no help. Even tried setting it in spark-env.sh and hadoop-env.sh files, it didnt work. I tried setting it in mapreduce.map.env and mapreduce.map.child.env and restarted CDH services, didnt work.
Any pointers would be very helpful. Thanks.

spark-submit on standalone cluster complain about scala-2.10 jars not exist

I'm new to Spark and downloaded a pre-compiled Spark binaries from Apache (Spark-2.1.0-bin-hadoop2.7)
When submitting my scala (2.11.8) uber jar the cluster throw and error:
java.lang.IllegalStateException: Library directory '/root/spark/assembly/target/scala-2.10/jars' does not exist; make sure Spark is built
I'm not running Scala 2.10 and Spark isn't compiled (as much as I know) with Scala 2.10
Could it be that one of my dependencies is based on Scala 2.10 ?
Any suggestions what can be wrong ?
Note sure what is wrong with the pre-built spark-2.1.0 but I've just downloaded spark 2.2.0 and it is working great.
Try setting SPARK_HOME="location to your spark installation" on your system or IDE

Spark running Liblinear unable to load JBLAS jar

I'm running spark 1.4.0, hadoop 2.7.0, and JDK 7. I'm trying to run the example code of Liblinear presented here.
The liblinear jar works, however when training the model it can't find the JBLAS library. I've tried including a JBLAS library in the --jars option when launching spark, as well as installing the jar with maven (although I must add I am a newbie to spark as well as maven so I probably did it wrong).
The specific error thrown is this:
java.lang.NoClassDefFoundError: org/jblas/DoubleMatrix
at tw.edu.ntu.csie.liblinear.Tron.tron(Tron.scala:323)
at tw.edu.ntu.csie.liblinear.SparkLiblinear$.tw$edu$ntu$csie$liblinear$SparkLiblinear$$train_one(SparkLiblinear.scala:32)`
when running this line:
val model = SparkLiblinear.train(data, "-s 0 -c 1.0 -e 1e-2")`
Thanks.
java.lang.NoClassDefFoundError: org/jblas/DoubleMatrix
It seems that you did not add jblas jar. The solution could be:
$ export SPARK_CLASSPATH=$SPARK_CLASSPATH:/path/to/jblas-1.2.3.jar
After that, it would work fine.
Hope this helps,
Le Quoc Do