Migrate a large hive table into mongo collection - mongodb

i want my large hive table to be migrated into a mongo collection as daily job.
i can split hive table into mini batches and insert them into mongo collection one by one programatically.
however, i think there must be a recommended and fancy method to accomplish my goal.
i have already tried searching proper tools and found mongo-hadoop connecter interesting, but it is out of maintenance.
any kind of help would be appreciated! thanks

Related

What is the best way to update ElasticSearch if we have lots of condition and transformation to do on the records?

We have a database in Postgresql, having candidate,job,campaign related tables with mapping of candidate-->job(job_candidate_mapping let's call it jcm table)and candidate-->campaign(campaign_candidate_mapping let's call it ccm table).
Also we have candidate related tables like candidate_education_details, candidate_company_details, etc.
we want to send the data to elasticsearch related to candidate-job-campaign as one document.
What will be the best way to send relatable data for candidate from multiple tables to ElasticSearch?
We are planning to create a table having all the denormalised data related to that table in a row which we need during search.
Every time we update any candidate related data from above tables, we need to update it on ElasticSearch.
So now we have to maintain this denormalised table and need extra code to update this table, is it the right approach?
What's the standard way to update the search engine, how does big companies do this?
Please help, any suggestions would be appreciated.

How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?

I am trying to load huge amount of data from mongodb. Data size is in millions. So, it makes sense to pull this data using appropriate indexes and also query mongo in parallel. Thats why to do batch reads, I am using mongo spark.
How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?
Also, I was exploring "com.mongodb.reactivestreams.client.MongoCollection". If possible, can someone throw some light on this?

HiveQL in MongoDB

I have been studying NoSQL and Hadoop for Data Warehousing however I never worked with this technologies before and I would like to inquire if this following is possible to check if I got my understanding of this technologies right.
If I have my data stored in MongoDB, can I use Hadoop with Hive to make Hiveql queries directly to MongoDB and store the output of those queries as views back in MongoDB again, instead of the HDFS?
Also If I understand correctly most of the NoSQL databases don't support joins and aggregates, but it's possible to make them through map-reduce. If HiveQL queries are map-reduce jobs when I do a join in HiveQL would it already be automatically "joining" the MongoDB data in map-reduce for me, with no need to be worried about the lack of support for joins and aggregates in MongoDB?
MongoDB does have very good support for Aggregation kind of functions. There are no joins of-course. The way MongoDB Schema is usually designed is such that you would typically not need a join.
HiveQL operates on 'Tables' in HDFS. That's the default behavior.
But you have a MongoDB-Hadoop Connector: http://docs.mongodb.org/ecosystem/tools/hadoop/
which will let you query MongoDB data from within Hadoop.
To use Map Reduce you can do that with MongoDB itself (without Hadoop).
See this: http://docs.mongodb.org/manual/core/map-reduce/

MongoDB data modeling - separate or combine collections?

i have a question for the performance in meteorJS. Before i used meteorJS is always wrote my Applications in PHP and MySQL. In MySQL i always created a lot of tables with many connections betweens them.
For example:
Table User
id;login;password;email
Table User_Data
user_id;name;age
My questions is now how i have to design my MongoDB collections. Its nice that the collection are build like js objects so i dont have to predesign my tables and can always easy change the collumns. But is it better to combine all data to one collection or to several collections ?
For example:
Table User
_id;login;password;email;data:{name;age}
Is it better or worse for the performance ? Or is it the wrong pattern to design MongoDB Collections ?
The question mainly about MongoDB data modeling. What you'll learn applies to MongoDB used with Meteor or with anything else.
http://docs.mongodb.org/manual/data-modeling/ talks about data modeling with MongoDB and is a good introduction.
In your particular case, you can read more about how to avoid JOINs in MongoDB.

MongoDB with Hive DW

I am planning to build a DataWarehouse in MongoDB for the first time. It has been suggested to me that I should use Hadoop for map-reduce in case I need some more complex analyses of the datasets.
Having discovered Hive, I liked the idea of doing mapreduces through a language similar with SQL. But my doubt is, can I make HiveQL queries directly into mongodb without needing to build an Hive DW on top of Hadoop? Because in all use cases I found it seems to only work in the data found in the Hadoop HDFS.
You could use MongoDB Connector for Hadoop:
http://docs.mongodb.org/ecosystem/tools/hadoop/
Mongo DB on its own has Map-Reduce paradigm too:
http://docs.mongodb.org/manual/core/map-reduce/