Running a mapreduce jar on Hadoop cluster

Running a mapreduce jar on Hadoop cluster - netbeans

I'm trying to run the map reduce implementation of quadratic sieve algorithm on Hadoop. For this purpose I'm using karmasphere Hadoop community plugin with Netbeans. The program works fine using the plugin. But I'm unable to run it on actual cluster.
I'm running this command
bin/hadoop jar MRIF.jar 689
Where MRIF.jar is the jar file made by building the netbeans project and 689 is number to be factored. The input and output directories are hard coded in program itself. When running on actual cluster, it appears that the inside java classes are not being processed as reduce completes to 100% before map being at 0% itself. And input and output files are created with no content.
But this works fine when running using Karmasphere plugin.

Try running it as bin/hadoop -jar MRIF.jar 689. The -jar forces it to run local and displays information to the console as well as logs to that machine. You can also check the Hadoop logs to see if they have any indicators of why it's not happening correctly.
When using -jar you can use System.out.println(...); to display information on the console, further helping to debug.
You can also use Hadoop Counters (link is random blog post I found) to assist in troubleshooting when running (psuedo-)distributed.
I admit this post isn't a 'solution' to the problem; Without more/further information about what is happening and where, there is a wide range of things that could be going on. If it is, as you mention, not processing the 'inside java classes' then it would likely be your implementation, of which we can't see to make suggestions, ect.
More data about the issue, such as logs, errors or output will likely assist in getting more solution-y responses instead of debugging tips. :)
EDIT: Thanks for the link to the files. I think your call is missing a component.
I looked in the run.sh and think this might get it to work for you:
bin/hadoop jar mrif.jar com.javiertordable.mrif.MapReduceQuadraticSieve 689

Related

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?

It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.

Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

Running an OpenCV program with Eclipse

I'm trying to run a simple example of OpenCV on Eclipse [which was perfectly buit and installed before (using CMake and MinGw), even libraries and all includes are in place !].
When building, I'm getting no errors or warning, all seems good, but when I try to run, I get a message as if the project had no Binaries, even if all binaries are there. I even specified the path to the ".exe" (run->run conf-> new launch-> browse ...etc.).
You can notice on the images attached that the project is built and the binaries are generated.
Notice: when I run an example of a (Hellow world) on the console ... it displays the messag without errors.
I read a lot on Internet before posting here, but I found nothing that matches to this case.
Thank you so much,
Error Capture
Build Capture
Regards

JAI can't execute in native spark - only in sbt and as a separate scala function

I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.

How use eclipse debug hadoop wordcount?

I want to use eclipse debug the wordcount, because I want to see the job how to run in the JobTracker. But hadoop use Proxy, I don't know the concrete process that job how to run in the JobTracker. How should I debug?

You are better off debugging "locally" against a single-node cluster (e.g. one of the sandboxes supplied by Cloudera or Hortonworks): that way you can truly step through the code as there is only one mapper/reducer in play. That's been my approach at least: usually the problems I had to debug were to do with the contents of specific files; I just copied over the relevant file to my test system and debugged there.

mapreduce programs using eclipse in CDH4

I am very new to Java, eclipse and Hadoop things, so pardon my mistake if it my question seems too silly.
The question is:
I have 3 node CDH4 cluster of RHEL5 on cloud platform. CDH4 setup has been completed and now I want to write some sample mapreduce programs to learn about it.
Here is my understanding to how to do it:
To write Java mapreduce programs I will have to install Eclipse in my main server, right? Which version of eclipse should i go for.
And just installing eclipse will not be enough, I will have to do some setting changes so that it can use my CDH cluster, what are the things needed to do this?
and last but not least, could you guys please suggest some sites where i can get more info regarding same, remember i am just beginner in all these..:)
Thanks in advance...
pankaj

Pankaj, you can always visit the official page. Apart from this you might find these links helpful :
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/
http://cloudfront.blogspot.in/2012/07/how-to-run-mapreduce-programs-using.html#.UgH8OWT08Vk
It is not mandatory to have Eclipse on the main server(main server=master machine???). Any of the last 3 versions of eclipse work perfectly fine. Don't know about earlier versions. You can either run your job through Eclipse directly or you can write your job in Eclipse and export it as a jar. You can then copy this jar to your JT machine and execute it there through the shell using hadoop/jar command. If you are running your job directly through the eclipse you need to tell it the location of your NameNode and JobTracker machines though these properties :
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://NN_HOST:9000");
conf.set("mapred.job.tracker", "JT_HOST:9001");
(Change the hostnames and ports as per your configuration).
One quick suggestion though. You can always search for these kind of things before posting the question. A lot of info is available over the net and it is very easily accessible.
HTH

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse