Hadoop learning development workflow with Eclipse and AWS - eclipse

I have Installed a single node hadoop 2.8 on AWS free tier nano instance. I have a local windows machine with eclipse on it. What is a good learning workflow. I am not sure of capabilities of AWS orhadoop. Should I write code in local eclipse build a jar, transfer it AWS machine and run it?
If I have to write and create a jar from local machine should I have hadoop installed? how should I do? and What is good learning path from installation to being comfortable with working on hadoop?

Related

How to integrate Eclipse IDE with Databricks Cluster

I am trying to integrate my Scala Eclipse IDE with my Azure Databricks Cluster so that I can directly run my Spark program through Eclipse IDE on my Databricks Cluster.
I followed the official documentation of Databricks Connect(https://docs.databricks.com/dev-tools/databricks-connect.html)
.
I have:
Installed Anaconda.
Installed Python Lib 3.7 and Databricks Connect library 6.0.1.
Did the Databricks Connect Configuration part(CLI part).
Also, added the client libraries in the Eclipse IDE.
Set the SPARK_HOME env. variable to the path which I get from running command in Anaconda, i.e. 'databricks-connect get-jar-dir'
I have not set any other environment variables apart from the one mentioned above.
Need help on finding what else is to be done to accomplish this integration, like how the ENV. variable related to connection works if running through IDE.
If someone has already done this successfully, guide me please.

Run MATLAB runtime code against files on hadoop

Here's the situation: We are doing a POC for a customer that analyses medical images for radio-therapy using MATLAB. The've automated everything with a nice framework based on micro-services. That way everything is automated from the moment the file comes in on a specific (windows) file-location until everything is processed and analysed.
The customer has asked us to do a POC with their software to see if it runs faster on HADOOP. The idea we have at this time is to run everything on windows, but keep the files stored on Hadoop. But that means that the MATLAB runtime has to connect to HDFS. Our Hadoop System is actually Microsoft HDInsight, as part of a Microsoft APS.
The have compiled there own runtime, but we also have the source-code of the runtime. We cannot install anyhting on the Windows servers where Hadoop is installed. We do have a server that is able to connect to the Hadoop servers and were we have installed the Matlab runtime.
How do we proceed with this?
Is it possible to put the files that need to be analyzed on Hadoop en execute the MATLAB runtime from a windows machine?
Are my thoughts on this matter too simple? Or do we need to write map/redcue code that needs to be integrated in the MATLAB runtime?

Automating Development Environment Setup

I've just started work at a new company and I am looking to automate as much as possible their process to setup a development environment for new starters on their computers.
To setup a machine with a working development environment involves:
Checking out 4 different projects
Invoking maven to build and install those projects
Starting JBoss fuse
Running various windows bat files
Staring JBoss Portal
At the moment I am considering writing a script in Scala to do the above relying heavily on scala.sys.process.
I am not too clued up on sbt at the moment and was wondering if that is better suited for this type of task or am I on the right track with writing my own custom setup script in scala.

Yahoo hadoop tutorial

I am trying to follow the Yahoo hadoop tutorial:
http://developer.yahoo.com/hadoop/tutorial/module3.html#vm
Everything is fine until I try to connect my Eclipse IDE to the hadoop server process according to the "Getting Started With Eclipse" section. The short story is that my "map reduce location", my DFS Location keeps coming back with "Error:null". My VM is running and I can ping it from my PC. Hadoop server is running as I have run the Pi example.
My PC runs WindowsXP and there is no "hadoop.job.ugi" in the Advanced list for the hadoop location....What does "/hadoop/mapred/system" refer too. There is no such directory in the hadoop installation that you install from the tutorial. It seems like a pretty important directory from the name of the field. I have gone into the advanced settings and switched any reference to my WinXP login (Ben) over to "hadoop-user". It is easy to find in the VM the folder locations that it is looking for like "/tmp/hadoop-hadoop-user/mapred/temp".
Am I right in thinking I can run eclipse on the WinXP environment and connect to the VMWare process via its IP address? Isn't that the point of the article? It does not work.
You read it right. The eclipse plugin for hadoop has lot of caveats and there are couple of things that are not well documented. See the second answer by Icn over Installing Hadoop's Eclipse Plugin. Hopefully that would solve the problem.
"/hadoop/mapred/system" refers to the directories inside HDFS, so you don't see it from terminal using ls
I did see the "hadoop.job.ugi" in Advanced list, and succeed to connected to the VM following the instructions there.
are you using the recommended version of eclipse (3.3.1) ?

how do you build your appliances?

virtual machines hold great promise as a way to distribute hard to configure applications. i have been using jeos vmbuilder (and some bash scripts) to generate my appliances, but i'm looking for something more elegant.
in my case, i'm looking for a solution that will build a linux-based vm with configured versions of tomcat and mysql as a base. each future release would be a new war file and a sql update script. it'd be really nice if already deployed vms could self-update and test builds could be pushed to ec2.
in my brief search, i've found rpath rbuilder, turnkey linux,
vagrant up, suse studio, jeos vmbuilder, and vmware studio. rather than try all of these, i figure i'd ask what this community uses to build and distribute appliances...
I use pungi myself.