hadoop with mongodb plugin - read data - mongodb

I know that it is possible read and write data from mongodb via hadoop.
I want know if this adapter when read data from mongodb collection use native driver of mongodb, so it use mongod instance or this adapter read directy data collection?
Also when hadoop read data of mongodb for processing in a map reduce, this map reduce of hadoop don't lock data collection of mongodb?
in other word when hadoop read data of mongodb, hadoop save this data for hadoop use, and hadoop don't interfere with mongodb data because when hadoop execute mapreduce it work on data retrieve by mongodb but save internal at hadoop for processing?

No data is cached or saved within Hadoop using the mongo-hadoop plugin.
Instead, each chunk is read into Hadoop as an individual input split to paralellize the Hadoop MapReduce job.
The only locking that occurs in mongodb is a light read lock as data is read from Mongo.

Related

Download from MongoDB database using multiprocessing?

I am analyzing a MongoDB database in Python via Pymongo.
I want to parallelize the documents download + processing steps (processing is long, and I do not have enough RAM nor memory to download the entire database at once).
But apparently parallel computing with Pymongo / MongoDB causes trouble for Unix systems (https://pymongo.readthedocs.io/en/stable/faq.html#using-pymongo-with-multiprocessing).
Any idea about how to circumvent this?
I use:
Pathos for parallel computing,
Pymongo 4.0.1 (no more parallel_scan in this version!),
MongoDB 5.
I have already read multiprocessing-dask-pymongo and parallelizing-loading-data-from-mongodb-into-python.
Thank you for your help !

Can I use mongdb data in my v4.4.5 database that was exported from v3.6.3?

I have a mongodb database with version 3.6.3. I have another mongodb database (on another machine) using version 4.4.5 with no documents in it. I want to put the data from the v3.6.3 into the v4.4.5 database. Can I safetly do this using mongoexport and then mongoimport or do I need to perform more steps?
Yes, mongoexport writes the documents out to a JSON file, and mongoimport can read that file and insert the documents to the new database.
These will transfer only the documents, but not index information. You many want to consider mongodump/mongorestore if you also need to move indexes.

will reading mongodb collection from pyspark, makes mongodb slow or will result in write lock and crashes mongodb

this is the way I read data from replication mongodb .
def read_col(self, collection_name, spark):
return spark.read.format('com.mongodb.spark.sql.DefaultSource') \
.option('uri', '{}/{}.{}?authSource={}'
.format(self.mongo_url, self.mongo_db, collection_name, self.auth_source)) \
.option('sampleSize', 50000)\
.load()
I read a large db for data processing with pyspark. But the backend team says because Im reading the data from mongodb at that time write lock occurs at replication mongodb and primary mongodb is not able to sync with replication and finally mongodb crashes, but write has more priority that read isn't it?
Can anyone suggest there opinion on this matter

how transfer the data from mongodb to elascticsearch with logstash 2.2.0?

the problem is that logstash 2.2.0 are not compatible with the plugin logstash-input-mongodb and don't find other way to transfer the data.
There are a tool like import-handler of solr?
We use Mongo-Connector to transfer data from mongo to elastic: https://github.com/mongodb-labs/mongo-connector

Synchronizing data between Hadoop and PostgreSql using SymmetricDs

I'm using Hadoop to store the data of our application. How can I synchronize data between PostgreSql and Hadoop? I 'm using SymmetricDS as the replication tool.
If hadoop only copies data from PostgreSQL and no updates are done on the hadoop site, try using sqoop - simple database into hadoop import tool.
If you want to continue to use SymmetricDS you can implement an IDatabaseWriter. Here is an example of writing to MongoDB. https://github.com/JumpMind/symmetric-mongo