How use eclipse debug hadoop wordcount?

How use eclipse debug hadoop wordcount? - eclipse

I want to use eclipse debug the wordcount, because I want to see the job how to run in the JobTracker. But hadoop use Proxy, I don't know the concrete process that job how to run in the JobTracker. How should I debug?

You are better off debugging "locally" against a single-node cluster (e.g. one of the sandboxes supplied by Cloudera or Hortonworks): that way you can truly step through the code as there is only one mapper/reducer in play. That's been my approach at least: usually the problems I had to debug were to do with the contents of specific files; I just copied over the relevant file to my test system and debugged there.

Related

What is the purpose of having two run configurations suh as 'Debug' and 'Release'?

I have just started using Eclipse CDT and would like to know why there are two run configurations such as Debug and Release.
Could I use this this to improve my workflow in any way? The manuals for CDT just mention that there are two default configurations, but never mention why.
Thank you for your answers.

Not specifically to Eclipse, but for all software and web development you'll encounter these two configurations.
You use Debug to test your application. You will probably generate debug symbols to step through your code, and it will probably avoid some optimizations. The purpose is to facilitate diagnosing issues.
Then, Release configuration is the one you want to use to publish or deploy your application. It could apply optimizations.
Also, if you make want to connect to different servers, or name files differently, or even execute different path codes depending if you are testing locally or if the final user is executing what you made.
Another example is Logging and Tracing, in Debug mode you may want to enable things to print to console or write to a file/log. But in Release you'll want to avoid them if it reduces the performance

How to keep separate dev, test, and prod databases in Play! 2 Framework?

In particular, for test-cases, I want to keep the test database separate so that the test cases don't interfere with development or production databases.
What are some good practices for separating development, test and production environments?
EDIT1: Some context
In Ruby On Rails, there are different configuration files by convention for different environments. So does Play! 2 also support that ?
Or, do I have to cook the configuration files, and then write some glue code that selects the appropriate configuration files ?
At the moment if I run sbt test it uses development database ( configured as "default" in conf/application.conf ). However I would like Play!2 to use a different test database.
EDIT2: On commands that play provides
For Play! 2 framework, I observed this.
$ help play
Welcome to Play 2.2.2!
These commands are available:
-----------------------------
...OUTPUT SKIPPED...
run <port> Run the current application in DEV mode.
test Run Junit tests and/or Specs from the command line
start <port> Start the current application in another JVM in PROD mode.
...OUTPUT SKIPPED...
There are three well defined commands for "test", "development" and "production" instances which are:
test: This runs the test cases. So it should automatically select test configuration.
run <port>: this runs the development instance on the specified port. So this command should automatically select development configuration.
start <port>: this runs the production instance on the specified port. So this should automatically select production configuration.
However, all these commands select the values that are provided in conf/application.conf. I feel there is some gap to be filled here.
Please do correct me if I am wrong.
EDIT3: Best approach is using Global.scala
Described here: How to manage application.conf in several environments with play 2.0?

Good practice is keeping separate instances of the application in separate folders and synching them i.e. via git repo.
When you want to keep single instance you can use alternative configuration file for each environment.

In your application.conf file there is an entry (or entries) for your database, e.g. db.default.url=jdbc:mysql://127.0.0.1:3306/devdb
The conf file can read environment variables using ${?ENV_VAR_NAME} syntax, so change that to something like db.default.url=${?DB_URL} and use environment variables.

A simpler way to get this done and manage your configuration easier is via GlobalSettings. There is a method you can override and that its name is "onLoadConfig". Try check its api at API_LINK
Basically on your conf/ project folder, you'll setup similar to below:
conf/application.conf --> configurations common for all environment
conf/dev/application.conf --> configurations for development environment
conf/test/application.conf --> configurations for testing environment
conf/prod/application.conf --> configurations for production environment
So with this, your application knows which configuration to run for your specific environment mode. For a code snippet of my implementation on onLoadConfig try check my article at my blog post
I hope this is helpful.

mapreduce programs using eclipse in CDH4

I am very new to Java, eclipse and Hadoop things, so pardon my mistake if it my question seems too silly.
The question is:
I have 3 node CDH4 cluster of RHEL5 on cloud platform. CDH4 setup has been completed and now I want to write some sample mapreduce programs to learn about it.
Here is my understanding to how to do it:
To write Java mapreduce programs I will have to install Eclipse in my main server, right? Which version of eclipse should i go for.
And just installing eclipse will not be enough, I will have to do some setting changes so that it can use my CDH cluster, what are the things needed to do this?
and last but not least, could you guys please suggest some sites where i can get more info regarding same, remember i am just beginner in all these..:)
Thanks in advance...
pankaj

Pankaj, you can always visit the official page. Apart from this you might find these links helpful :
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/
http://cloudfront.blogspot.in/2012/07/how-to-run-mapreduce-programs-using.html#.UgH8OWT08Vk
It is not mandatory to have Eclipse on the main server(main server=master machine???). Any of the last 3 versions of eclipse work perfectly fine. Don't know about earlier versions. You can either run your job through Eclipse directly or you can write your job in Eclipse and export it as a jar. You can then copy this jar to your JT machine and execute it there through the shell using hadoop/jar command. If you are running your job directly through the eclipse you need to tell it the location of your NameNode and JobTracker machines though these properties :
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://NN_HOST:9000");
conf.set("mapred.job.tracker", "JT_HOST:9001");
(Change the hostnames and ports as per your configuration).
One quick suggestion though. You can always search for these kind of things before posting the question. A lot of info is available over the net and it is very easily accessible.
HTH

Automated deployment of Check Script for Nagios

We currently use Ant to automate our deployment process. One of the tasks that requires carrying out when setting up a new Service is to implement monitoring for it.
This involves adding the service in one of the hosts in the Nagios configuration directory.
Has anyone attempted to implement such a thing where it is all automated? It seems that the Nagios configuration is laid out where the files are split up so that they are host based, opposed to application based.
For example:
localhost.cfg
This may cause an issue with implementing an automated solution as when I'm setting up the monitoring as I'm deploying the application to the environment (i.e - host). It's like a jigsaw puzzle where two pieces don't quite fit together. Any suggestions?
Ok, you can say that really you may only need to carry out the setting up of the monitor only once but I want the developers to have the power to update the checking script when the testing criteria changes without too much involvement from Operations.
Anyone have any comments on this?
Kind Regards,
Steve

The splitting of Nagios configuration files is optional, you can have it all in one file if you want to or split it up into several files as you see fit. The cfg_dir configuration statement can be used to have Nagios pick up any .cfg files found.
When configuration files have changed, you'll have to reload the configuration in Nagios. This can be done via the external commands pipe.
Nagios provides a configuration validation tool, so that you can verify that your new configuration is ok before loading it into the live environment.

Running a mapreduce jar on Hadoop cluster

I'm trying to run the map reduce implementation of quadratic sieve algorithm on Hadoop. For this purpose I'm using karmasphere Hadoop community plugin with Netbeans. The program works fine using the plugin. But I'm unable to run it on actual cluster.
I'm running this command
bin/hadoop jar MRIF.jar 689
Where MRIF.jar is the jar file made by building the netbeans project and 689 is number to be factored. The input and output directories are hard coded in program itself. When running on actual cluster, it appears that the inside java classes are not being processed as reduce completes to 100% before map being at 0% itself. And input and output files are created with no content.
But this works fine when running using Karmasphere plugin.

Try running it as bin/hadoop -jar MRIF.jar 689. The -jar forces it to run local and displays information to the console as well as logs to that machine. You can also check the Hadoop logs to see if they have any indicators of why it's not happening correctly.
When using -jar you can use System.out.println(...); to display information on the console, further helping to debug.
You can also use Hadoop Counters (link is random blog post I found) to assist in troubleshooting when running (psuedo-)distributed.
I admit this post isn't a 'solution' to the problem; Without more/further information about what is happening and where, there is a wide range of things that could be going on. If it is, as you mention, not processing the 'inside java classes' then it would likely be your implementation, of which we can't see to make suggestions, ect.
More data about the issue, such as logs, errors or output will likely assist in getting more solution-y responses instead of debugging tips. :)
EDIT: Thanks for the link to the files. I think your call is missing a component.
I looked in the run.sh and think this might get it to work for you:
bin/hadoop jar mrif.jar com.javiertordable.mrif.MapReduceQuadraticSieve 689

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse