Extract data from MongoDB with Sqoop to write on HDFS? - mongodb

I am concerning about extracting data from MongoDB where my application transact most of the data from MongoDB.
I have worked on sqoop to extract data and found RDBMS gel up with HDFS via sqoop. However, no clear direction found to extract data from NoSQL DB with sqoop to dump it over HDFS for big chunk of data processing?
Please share your suggestions and investigations.
I have extracted static information and data transactions from MySQL. Simply, used sqoop to store data in HDFS and processed the data. Now, I have some live transactions of 1million unique emailIDs per day which data modelled into MongoDB. I need to move data from mongoDB to HDFS for processing/ETL. How can I achieve this goal using Sqoop. I know I can schedule my task but what should be the best approach to take out data from mongoDB via sqoop.
Consider 5DN cluster with 2TB size. Data size varies from 1GB ~ 2GB in peak hours.

Sqoop is applied to import data only from relational databases. There are other ways to get data from mongo to Hadoop.
eg: https://docs.mongodb.com/ecosystem/tools/hadoop/
Or else you can use any data flow management tools like Nifi or Streamsets and get data from mongo in realtime.

Related

Transfer data from Cassandra to Postgressql

I would like to know it will have any solution
Problem :
I have Cassandra database for saving large scale data from other sources continuously. Application-based data are saving in postgressql. For functionality, I want to query all data from postgresql. so I would like to save Cassandra data consistently to postgressql database based on data coming in Cassandra.
Is it possible?
Please suggest
I would like to save Cassandra data consistently to PostgreSQL database based on data coming in Cassandra.
There are no special utilities for this. You need to create your own service to gather data from Cassandra, process it and put results to PostgreSQL.
Sure. You can use Change Data Capture (CDC) to copy the data from Cassandra to PostgreSQL as and when there is a change in Cassandra data. One option is to use Kafka Connect with appropriate coonectors.

is it possible to store mongodb data on hdfs

In my project, I'm challenging with data storage method. Firstly, in my project, there are streaming data in JSON format and most suitable db is MongoDB. I have to analyze data with Hadoop or Spark.
So, my conflict starts here: Can I store MongoDB collections in HDFS or must MongoDB and HDFS storage units be different? It is an important issue for my decision. Must I use Hadoop and MongoDB in same disk units or separate units?
They need to be different units since the methods of storage, security policy implementations and storage mechanisms themselves are different.

What is the common practice to store users data and analysis it with Spark/hadoop?

I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark

Incrementally updating/adding data on HDFS

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.
Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.
I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.
My need is :
there would be a scheduler which will run once in day, and It will:
Pick the data from all four tables and load in hdfs using sqoop.
PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
Sqoop will again export this data in a separate eporting table.
I have few questions around this:
Do i need to import whole data from DB to HDFS on every sqoop import call ?
in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.
Please help me out in this case...
Please suggest me the better solution if you have..
Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

HDFS to PostgreSQL

We need a process in place to pull data from Hadoop Distributed File System (HDFS) to a relational DB (PostgreSQL) on a regular basis. We will need to transfer several million records per hour and I am looking for the best industry standards to move data out of HDFS. Does any one have any suggestions?
The idea is for a web app to interact with PostgreSQL which will have aggregated data.
Sqoop is built for the purpose of moving data between relational data stores and Hadoop. Specifically, you want sqoop-export.