Testing Samza with RocksDB application with SBT - scala

I would like to run a Samza (using RocksDB KV store) application from SBT. When I do ./sbt "run " I receive the following error
java.lang.ExceptionInInitializerError
(snip)
Caused by: java.lang.RuntimeException: librocksdbjni-linux64.so was not found inside JAR.
(snip)
I assume that since I run with ./run, sbt runs the classes directly, without assembling a JAR.
The dependencies are set correctly, and I've got the librocksdbjni-linux64.so inside RocksDB JAR.
Do I have to create an assembly before running?
How can I test in this case without creating an assembly?

Well, librocksdbjni-linux64.so sounds like a native library, and those usually require a little extra fiddling with things, even if they are inside the path, in order to be recognized and added. Check this question.

Related

Unable to find akka Configurations with Spark Submit

I built a fat jar and I am trying to run it with spark-submit on an EMR or locally. here is the command:
spark-submit \
--deploy-mode client \
--class com.stash.data.omni.source.Runner myJar.jar \
<arguments>
I keep getting an error related to akka configurations:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
It seems like the jar cannot find the reference.confs for akka at all. Has anyone dealt with this? I am able to run it without spark-submit on my local machine.
I think the issue is bundling into a jar with all it's dependencies, which causes problems with Akka, as described in the documentation:
Akka’s configuration approach relies heavily on the notion of every
module/jar having its own reference.conf file. All of these will be
discovered by the configuration and loaded. Unfortunately this also
means that if you put/merge multiple jars into the same jar, you need
to merge all the reference.conf files as well: otherwise all defaults
will be lost.
You can follow this documentation to package your application and process to merge the reference.conf resources while bundling.It talks about packaging using sbt, maven, and gradle.
Let me know if it helps!!
it was my merge strategy. i had a catch all case _ => MergeStrategy.first. i changed it to case x => MergeStrategy.defaultMergeStrategy(x) and it worked.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

NoSuchMethod exception in Flink when using dataset with custom object array

I have a problem with Flink
java.lang.NoSuchMethodError: org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(Lorg/apache/flink/api/common/typeinfo/TypeInformation;)Lorg/apache/flink/api/java/typeutils/ObjectArrayTypeInfo;
at LowLevel.FlinkImplementation.FlinkImplementation$$anon$6.<init>(FlinkImplementation.scala:28)
at LowLevel.FlinkImplementation.FlinkImplementation.<init>(FlinkImplementation.scala:28)
at IRLogic.GmqlServer.<init>(GmqlServer.scala:15)
at it.polimi.App$.main(App.scala:20)
at it.polimi.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
the line with the problem is this one
implicit val regionTypeInformation =
api.scala.createTypeInformation[FlinkDataTypes.FlinkRegionType]
in the FlinkRegionType I have an Array of custom object
I developed the app with the maven plugin in the IDE and everything is working good, but when I move to the version I downloaded from the website I get the error above
I am using Flink 0.9
I was thinking that some library may be missing but I am using maven for handling everything. Moreover running through the code of ObjectArrayTypeInfo.java it doesn't seem to be the problem
A NoSuchMethodError commonly indicates a version mismatch between the libraries a Flink program was compiled with and the system the program is executed on. Especially if the same code works in an IDE setup where compile and execution libraries are the same.
In such case, you should check the version of the Flink dependencies, for example in the Maven POM file.

How fix sigar library when I run spray application?

I have a sbt project written in scala. The project uses akka and spray. There is a class with main function. When I run scala console application sometimes I get
[on-spray-can-akka.actor.default-dispatcher-4] [DEBUG] [2014-11-07 16:48:30,336] Sigar: no sigar-amd64-winnt.dll in java.library.path
org.hyperic.sigar.SigarException: no sigar-amd64-winnt.dll in java.library.path
I do not change anything run it again and it runs well. So it can be run successful or fail several times on end. How to fix this?
UPDATED
Also when it start normal there is a message:
[INFO] [11/07/2014 17:02:36.772] [on-spray-can-akka.actor.default-dispatcher-2]
[Cluster(akka://myApp)] Cluster Node [akka.tcp://myApp#127.0.0.1:2551] - Metrics will be
retreived from MBeans, and may be incorrect on some platforms. To increase metric accuracy
add the 'sigar.jar' to the classpath and the appropriate platform-specific native libary to
'java.library.path'. Reason: java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
org.hyperic.sigar.Sigar.getPid()J
Sigar is a native library for gathering performance stats, used by Typesafe Console atmos Scala library. If you're not interested in hooking up Typesafe Console to your application, you can simply remove all references to atmos library from sbt build script and app config files without affecting your app functionality.

Launch a mapreduce job from eclipse

I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps:
export the mapreuce job as a jar (e.g. myMRjob.jar)
submit the job to the remote cluster using the following shell command: hadoop jar myMRjob.jar
I would like to submit the job directly from Eclipse when I try to run the program. How can I do this?
I am currently using CDH3, and an abridged version of my conf is:
conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);
// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
CountRows.MyMapper.class, ImmutableBytesWritable.class,
ImmutableBytesWritable.class, job);
// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);
// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.submit();
When I run this directly from Eclipse, the job is launched but Hadoop cannot find the mappers/reducers. I get the following errors:
12/06/27 23:23:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/27 23:23:37 INFO mapred.JobClient: Task Id : attempt_201206152147_0645_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mypkg.mapreduce.CountRows$MyMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
...
Does anyone know how to get past these errors? If I can fix this, I can integrate more MR jobs into my scripts which would be awesome!
If you're submitting the hadoop job from within the Eclipse project that defines the classes for the job then you most probably have a classpath problem.
The job.setjarByClass(CountRows.class) call is finding the class file on the build classpath, and not in the CountRows.jar (which may or may not have been built yet, or even on the classpath).
You should be able to assert this is true by printing out the result of job.getJar() after you call job.setjarByClass(..), and if it doesn't display a jar filepath, then it's found the build class, rather than the jar'd class
What worked for me was exporting a runnable JAR (the difference between it and a JAR is that the first defines the class, which has the main method) and selecting the "packaging required libraries into JAR" option (choosing the "extracting..." option leads to duplicate errors and it also has to extract the class files from the jars, which, ultimately, in my case, resulted in not resolving the class not found exception).
After that, you can just set the jar, as was suggested by Chris White. For Windows it would look like this: job.setJar("C:\\\MyJar.jar");
If it helps anybody, I made a presentation on what I learned from creating a MapReduce project and running it in Hadoop 2.2.0 in Windows 7 (in Eclipse Luna)
I have used this method from the following website to configure a Map/Reduce project of mine to run the project using Eclipse (w/o exporting project as JAR)
Configuring Eclipse to run Hadoop Map/Reduce project
Note: If you decide to debug you program, your Mapper class and Reducer class won't be debug-able.
Hope it helps. :)