MongoDB data pipeline to Redshift using Apache-Spark

MongoDB data pipeline to Redshift using Apache-Spark - mongodb

As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)

A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint

Related

Migrate data from NoSQL to an RDBMS

We have data existing in HBase and we want to move to AWS Aurora (MySQL) and we need to use the existing data so have to somehow load the NoSQL data into Aurora.
It's not a very big data base. Just a few tables.
Are there any best practices/tools to migrate data from NoSQL to a relational DB? I saw a lot of questions on the internet that ask to the reverse (DB -> NoSQL) but my requirement is a bit different and I don't find any helpful information.
Can someone please help? Where do I even start?

One simple way to do this without writing too much custom code would be to use Spark-HBase Connector from Hortonworks (SHC) to read data from an HBase table into a Spark dataframe and to write that dataframe into a MySQL table. The key challenge would be to get SHC to work, because in my experience it's extremely version sensitive. So the trick is to correctly coordinate your version of Spark, HBase, and SHC (and finding that right combination is trickier than you may think).
However, if you manage to get all the dependencies right, then doing the above is a matter of a few lines of code in Jupyter Notebook or Pyspark. You could run this on Yarn to parallelize the workload, in case it's large. Should work. Give it a try.

Integrate PMML to MongoDB

I have build a supervised learning model in R, and exported the model/decision rules in PMML format. I was hoping I could link the PMML straightforwardly to MongoDB using something like the JPMML library (as JPMML integrates well with PostgreSQL).
However, it seems the only way to link MongoDB to my PMML xml file is to use Cascading Pattern through Hadoop. Since my dataset isn't large (<50GB), I don't really need Hadoop.
Has anyone used PMML with MongoDB before that doesn't involve having to go down the hadoop route? Many thanks

Basically, you have two options here:
Convert the PMML file to something that you can execute inside MongoDB.
Deploy the PMML file "natively" to some outside service and connect MongoDB to it.
50 GB is still quite a lot of data, so option #1 is clearly preferable in terms of the ease of setup and the speed of execution. Is it possible to write a Java user-defined function (UDF) for MongoDB? If so, then it would be possible to run the JPMML library inside MongoDB. Otherwise, you might see if it would be possible to convert your PMML model to SQL script. For example, the latest versions of KNIME software (2.11.1 and newer) contain a "PMML to SQL" conversion node.
If you fall back to option #2, then the following technical article might provide some inspiration to you: Applying predictive models to database data: the REST web service approach.

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?

First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.

From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way

HBase and elasticsearch integration like MongoDB river

I am kinda new to both elasticsearch and HBase but for a research project I would like to combine the two. My research project mainly involves searching through large portion of documents (doc,pdf,msg etc) and extracting named entities from the documents through
mapreduce jobs running on the documents stored in HBase.
Does any one know if there is something similar to MongoDB river plugin for HBase? Or can point me to some documentation about integrating ElasticSearch and Hbase? I have looked on the internet for any documentation but unfortunately without any luck.
Kind regards,
Martijn

I don't know of any elasticsearch hbase integrations but there are a few Solr and HBase integrations that you can use like Lily and SolBase

Tell me what you think about this https://github.com/posix4e/Elasticsearch-HBase-River. It uses hbase log shipping to reliably handle updates and deletes from hbase into an elastic search cluster. It could easily be extended to do n regionserver to m elastic search server replication.

you can use phoenix jdbc driver + es jdbc river as shown here: http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html

I don't know of any packaged solutions, but as long as your mapreduce preps the data in the right way, it should be fairly easy to write a simple batch job in the programming language of your choice that reads from HBase and submits to ElasticSearch.

take a look to this page (3 years later) : http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html

Job scheduling in MongoDB

The project in which I am working requires synchronization of data between MongoDB and SQL Server, so what is the best way for it? This synchronization should be handled by MongoDB side; does MongoDB support job scheduling?
I am new in MongoDB and I want to know few things about it. I know it is not an RDBMS but in case it is possible, then how?
Like we can write program in Oracle can we write in MongoDB. I mean directly in MongoDB, if no then which client side language can used?

As #Vasanth said, MongoDB has no native job management.
As to which client side language: well that's entirely upto you.
You can always get get a prebuilt replicator like: http://code.google.com/p/tungsten-replicator/ that might do the job for you.
As further reference, this question maybe of help to you: https://softwareengineering.stackexchange.com/questions/125980/can-sql-server-and-mongo-be-used-together

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse