Can Eclipse/IntelliJ Idea be used to execute code on the cluster - eclipse

Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.

I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.

Related

Confluence migration from cloud to server

We have migrated a space from cloud instance to server instance,in cloud instance we were using "Plantuml diagrams for confluence" but in server we are using "Confluence PlantUML Plugin" .so macro name are different in both cloud and server ,so macro name for cloud is "plantumlcloud" but for server it is "plantuml".so ,in pages after migration it is showing "plantumlcloud" not a valid macro ,kindly help to resolve.
In general, migration of confluence spaces to another application which is not running the same plugins will cause any functionality of that plugin to break.
If you migrate hosting platforms, and have the equivalent version of the plugin for your new platform, created by the same developer, in most cases you will retain functionality, however there will often be differences between versions.
These differences are found especially when downgrading, and moving from cloud to server is a very definite example of a downgrade, as cloud will always run the latest version.
In general I would reccomend against a migration from cloud to server, and when it must be done, time should be spent to ensure compatability with all plugins, and migration guides and plans should be made and followed.
As commented by #tgdavies, there seems to be an equivelent version of the plugin you were using on cloud, so hopefully that can resolve your issue.

Scala Spark IntelliJ Idea development process

I am currently using spark to write my dimensional data model and we are currently uploading the jar to an AWS EMR cluster to test. However, this is tedious and time consuming for testing and building tables.
I would like to know what others are doing to speed up their development. The possibilities I came across in my research is running spark jobs directly from the IDE with Intellij Idea and I would like to know other development processes that are being used where it's faster to develop.
The ways I have had tried till now are:
Installing spark and hdfs on two or three commodity PCs and test the code before submitting it on the cluster.
Running the code on the single node to avoid dummy mistakes.
Submitting the jar file on the cluster.
The similar part in the first and third method is making the jar file which may takes a lot of time. The second one is not suitable to find and fix the bugs and problems and raise on distributed running environments.

openshif cloudcomputing configuration,is possible complete on cloud?

Is possible build a bigdata application on cloud with RED HAT'PaaS OpenShift? I'm looking how build on cloud an Scala Application with Hadoop (HDFS),Spark,an Apache Mahout but i can't find any thing about it.I've seen something with HortonWorks but nothing clear about how install it in an openshift environment an how add HDFS node in Cloud too.Is it possible with OpneShift?
It's possible in Amazon but my question is : IS possible in OpenShift ??
It really depends on what you're ultimately trying to achieve. I know you mention building a big data application on Openshift with Scala but what will the application ultimately be doing?
I've gotten Hadoop running in a gear before but if you want a better example check out this quickstart here to get an idea of how its done https://github.com/ryanj/flask-hbase-todos. I know its not scala but here's a good article that will show you how to put together a scala app https://www.openshift.com/blogs/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift.
What will the application ultimately be doing?:
Forecasting for football match result for several football leagues,a web application (ruby) and
statistic computation and data mining ,calculations with Scala language
and apache frameworks(spark & mahout).
We get the info via CSV files, process and save it in nosql db (Cassandra).
And all of this on cloud(OpenShift),that's the idea.
I've seen the info https://github.com/ryanj/flask-hbase-todos.I'll try by this way but
with Scala.

Deploying an application without undeploying previous one and with no downtime?

I use Glassfish Java, and JSP over MySQL for my web applications. Many online people uses this web application and that web-site should not be down.
When I want to deploy a new war file, I should undeploy and deploy the new one for my application at server.
My question is that;
Is there any technology that doesn't need to undeploy my application and just change the appropriate classes so no need to redoploy it again?
There are java technologies that would allow you to replace classes on the fly (like JRebel). But since you're using Glassfish already, you should just start using clustering which is built into glassfish. You'll need either 2.1 or 3.1, as 3.0 does not support clustering. With a Glassfish cluster, you have a load balancer (Apache, Sun Web Server, hardware (Big IP, Coyote), etc) distribute the load among your cluster nodes. When you want to upgrade the app, you can technically do it one node at a time. Setting up the cluster is not the easiest thing in the world, but it is doable and it would get you some great benefits. You'll be able to scale the load by adding new hardware and even using Amazon (or whoever) cloud services. You'll be able to keep your site running even if the hardware fails on one of the nodes.
Personally I'm in the middle of converting from Glassfish 2.1 to 3.1. So far I like the management of the Glassfish 3.1 cluster much better, but I can't personally vouch for how it will run in production, though I have high expectations.
http://download.oracle.com/docs/cd/E18930_01/html/821-2432/gktqx.html#gktob
Jim is right, the best solution is currently to use a cluster and perform a manual rolling-upgrade.
But there is actually work ongoing to address your needs. We are working on a rolling-upgrade feature in a single standalone instance. To sum up in a nutshell (as the specifications have not been published yet), it will let you switch from an application version to another (see application versioning and the enable command) with no downtime. Stay tuned.

Heroku-like services for Scala?

I love Heroku but I would prefer to develop in Scala rather than Ruby on Rails.
Does anyone know of any services like Heroku that work with Scala?
UPDATE: Heroku now officially supports Scala - see answers below for links
As of October 3rd 2011, Heroku officially supports Scala, Akka and sbt.
http://blog.heroku.com/archives/2011/10/3/scala/
Update
Heroku has just announced support for Java.
Update 2
Heroku has just announced support for Scala
Also
Check out Amazon Elastic Beanstalk.
To deploy Java applications using
Elastic Beanstalk, you simply:
Create your application as you
normally would using any editor or IDE
(e.g. Eclipse).
Package your
deployable code into a standard Java
Web Application Archive (WAR file).
Upload your WAR file to Elastic
Beanstalk using the AWS Management
Console, the AWS Toolkit for Eclipse,
the web service APIs, or the Command
Line Tools.
Deploy your application.
Behind the scenes, Elastic Beanstalk
handles the provisioning of a load
balancer and the deployment of your
WAR file to one or more EC2 instances
running the Apache Tomcat application
server.
Within a few minutes you will
be able to access your application at
a customized URL (e.g.
http://myapp.elasticbeanstalk.com/).
Once an application is running,
Elastic Beanstalk provides several
management features such as:
Easily deploy new application versions
to running environments (or rollback
to a previous version).
Access
built-in CloudWatch monitoring metrics
such as average CPU utilization,
request count, and average latency.
Receive e-mail notifications through
Amazon Simple Notification Service
when application health changes or
application servers are added or
removed.
Access Tomcat server log
files without needing to login to the
application servers.
Quickly restart
the application servers on all EC2
instances with a single command.
Another strong contender is Cloud Foundry. One of the nice features of Cloud Foundry is the ability to have a local version of "the cloud" running on your laptop so you can deploy and test offline.
I started working on the exact same thing as what you said a few weeks ago. I use Lift, which is a great framework and has a lot of potential, on top of Linux chroot environment.
I'm done with a demo version, but Linux chroot is not that stable (nor secure), so I'm now switching to FreeBSD jail on Amazon EC2, and hopefully it'll be done soon.
http://lifthub.net/
There are also other Java hosting environment including VMForce mentioned above.
If you are looking for a custom setup which also has the ease of deployment that heroku offers: http://dotcloud.com. They are invite only right now but I was given access in under three days. I am working on a Lift/MongoDB project there and it works well.
Off the top of my head, only VMForce comes to mind, but its not available yet. This will be a Java-oriented service, so that probably means you'll have to spend a wee bit of time figuring out how to package the app.
For more discussion, there was a debate about this in 2008.
I'm not entirely sure if it's really suitable or not, but people have deployed Scala applications to Google App Engine, for example http://mawson.wordpress.com/2009/04/10/first-steps-with-scala-on-google-app-engine/
Actually you can run scala on heroku right now. You don't believe it?
https://github.com/lstoll/heroku-playframework-scala
I'm not sure the tricks lstoll has used are legit but using the
new cedar platform where you can run custom processes and some
ingenious Gemfile hacking he has managed to bootstrap the Java
play platform into a process. Seems to work as he has a live
site running a test page.
Stax cloud service offers preconfigured lift project skeleton. Also, there is a tutorial on how to deploy lift project to appengine.