openshif cloudcomputing configuration,is possible complete on cloud? - scala

Is possible build a bigdata application on cloud with RED HAT'PaaS OpenShift? I'm looking how build on cloud an Scala Application with Hadoop (HDFS),Spark,an Apache Mahout but i can't find any thing about it.I've seen something with HortonWorks but nothing clear about how install it in an openshift environment an how add HDFS node in Cloud too.Is it possible with OpneShift?
It's possible in Amazon but my question is : IS possible in OpenShift ??

It really depends on what you're ultimately trying to achieve. I know you mention building a big data application on Openshift with Scala but what will the application ultimately be doing?
I've gotten Hadoop running in a gear before but if you want a better example check out this quickstart here to get an idea of how its done https://github.com/ryanj/flask-hbase-todos. I know its not scala but here's a good article that will show you how to put together a scala app https://www.openshift.com/blogs/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift.

What will the application ultimately be doing?:
Forecasting for football match result for several football leagues,a web application (ruby) and
statistic computation and data mining ,calculations with Scala language
and apache frameworks(spark & mahout).
We get the info via CSV files, process and save it in nosql db (Cassandra).
And all of this on cloud(OpenShift),that's the idea.
I've seen the info https://github.com/ryanj/flask-hbase-todos.I'll try by this way but
with Scala.

Related

Setting up workspace to build on Apache Atlas

I'm exploring options on open source data catalog tools which can provide metadata features like
Open source
Search and discovery
Lineage tracking
Tagging
I found Apache Atlas as a good candidate to start working on since it does not currently have connectors to Google Cloud Platform.
I've spent a lot of time on figuring out how the platform works but I need to understand how I can start writing connectors to support Google Cloud Platform. Is there any documentation to get started?
I went through the link: https://atlas.apache.org/#/EclipseSetup. which talks about how to setup the environment to start on eclipse but I'm not aware on how to actually start building and testing the new code I'm thinking to write.
I think there are a lot of components at play and I'm too noob to get started on this.
TLDR;
Detail i'm looking for is: once I write the code, how do I test that It will work after I package the application

automatic pulling REST API data to visualize it in Apache Superset

I work in a large enterprise and have a project to build some custom automated dashboards for our IT department, the small amount of data needs to be fetched only from the REST API endpoints. This process needs to be fully automated and there is not enough time to build a custom API wrapper. For this approach I was going to use Apache Airflow + Apache Superset tools. I have been googling for a couple of days for more easier open source solution than the Apache Airflow to move data from the REST API endpoints to visualize it in Superset. Please share your experience what would you choose instead of the Apache Airflow?
I chose to go with fhe following solution:
Apache Airflow + PostgreSQL + Grafana (instead of a Superset, because in Grafana you can actually create a drill-down option using a workaround)

Apache kylin and PostgreSQL

I’m a student and i’m working on my last year project, the project is about Data warhousing, BI, etc...
So Im asked to work with Apache Kylin
I did some researchs about it, learned some
And I looked for if it is possible to use a PostgreSQL as Data warehouse and make it communicate with Apache Kylin to build cubes
But found nothing...
So would you please answer to my following question:
Is it possible to make the apache kylin communicate with a postgreSQL DWH?
And if there is some hidden documentations about it would you please share it?
Time is running guys and i really appreciate your answers and guides
Thanks in advance.
Khalil
It's doable. Kylin provides data source adapter for JDBC data sources. PostgreSQL could be one of the data source adapters. MySQL is supported by default. You could check this link to learn more: http://kylin.apache.org/development/datasource_sdk.html

Can Eclipse/IntelliJ Idea be used to execute code on the cluster

Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.

HDFS web interface alternative

Alright, this is annoying! I am new to Hadoop. And I am trying to find decent alternative to basic HDFS web interface. i tried with hadoop eclipse plugin but seems it's oudated already and it's pain to set it up correctly! I have cloudera's distribution installed and I heard about cloudera desktop but it's no longer available. Can anybody tell me decent alternative to HDFS web interface where I can download and upload files to HDFS via GUI easily? P.S I am running everything on my local no, cluster involved. Tried a lot to find , but nothing seems to be pointing towards right direction
You can use webhdfs of which REST API supports the complete FileSystem interface for HDFS. http://hadoop.apache.org/docs/r1.0.4/webhdfs.html
OR
You can integrate hadoop with hoop(HDFS over HTTP), which is used to access HDFS via HTTP protocol. Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S
for more details please refer.
http://bigobject.blogspot.in/2013/03/hoop-https-over-hdfs.html
or also you can user HTTPFS as a option to Hoop
http://bigobject.blogspot.in/2013/03/apache-hadoop-httpfs-service-that.html