Download from MongoDB database using multiprocessing? - mongodb

I am analyzing a MongoDB database in Python via Pymongo.
I want to parallelize the documents download + processing steps (processing is long, and I do not have enough RAM nor memory to download the entire database at once).
But apparently parallel computing with Pymongo / MongoDB causes trouble for Unix systems (https://pymongo.readthedocs.io/en/stable/faq.html#using-pymongo-with-multiprocessing).
Any idea about how to circumvent this?
I use:
Pathos for parallel computing,
Pymongo 4.0.1 (no more parallel_scan in this version!),
MongoDB 5.
I have already read multiprocessing-dask-pymongo and parallelizing-loading-data-from-mongodb-into-python.
Thank you for your help !

Related

How to import 700+ million rows into MongoDB in minutes

We have 32 Core Windows Server, 96 GB RAM with 5TB HDD
Approach 1( Using Oracle SQLLDR)
We fetched input data from oracle database.
We processed and generated multiple TSV files.
Using threading, we are importing data into the Oracle database using SQL Loader.
It requires approximately 66 Hrs.
Approach 2( Using MongoImport)
We fetched input data from oracle database.
We processed and generated multiple TSV files.
Using threading, we are importing data into a MongoDB database using mongoimport command line utility.
It requires approximately 65 Hrs.
There is no considerable difference observed in performance.
We need to process 700+ Millions of record, please suggest the better approach for optimized performance.
We are fetching from oracle database, processing in our application and storing the output in another database. This is an existing process which we do on Oracle database but it is time-consuming so we decided to try MongoDB for performance improvement.
We did one POC, where we did not get any considerable difference. We thought it may work on the server because of hardware so we did POC on the server where we got an above-mentioned result.
We think that MongoDB is more robust than the Oracle database but failed to get the desired result after comparing the stats.
Please find MongoDB related details of production server:
MongoImport Command
mongoimport --db abcDB --collection abcCollection --type tsv --file abc.tsv --headerline --numInsertionWorkers 8 --bypassDocumentValidation
Wired Tiger Configuration
storage:
dbPath: C:\data\db
journal:
enabled: false
wiredTiger:
engineConfig:
cacheSizeGB: 40
Approximate computation time is calculated by process log details for process execution using Oracle and process execution using MongoDB.
Underlined POC carried out on the production server is for comparing performance Oracle(SQL Loader) vs MongoDB ( MongoImport )
As we are using standalone MongoDB instance for our POC, we have not created any sharding in production server.
If we get the desired result using MongoDB, then we come to the conclusion about migration.
Thanking you in advance.

Neo4J: Importing a large Cypher dump

I have a large dump (millions of nodes and relationships) from a Neo4J 2.2.5 database in Cypher format (produced with neo4j-sh -c dump), that I'm trying to import into a 3.0.3 instance.
However, the import process (neo4j-sh < dump.cypher) slows down drastically after a few minutes, down to a couple records per second.
Is there any way to speed up this process? I tried upgrading the database as described in the manual, but the new instance crashes with an exception about a version mismatch in the store format.
Neo4j 3.0 comes with a bin/neo4j-admin tool for exactly this purpose.
try bin/neo4j-admin import --mode database --from /path/to/db
see: http://neo4j.com/docs/operations-manual/current/deployment/upgrade/#upgrade-instructions
The cypher dump is not useful for large database, it's only for smaller setups (a few thousand nodes) for demos etc.
FYI: In Neo4j 3.0 the cypher export procedure from APOC is much more suited for large scale cypher dumps.
Update
You can also try to upgrade from 2.2 to 2.3 first. E.g by using neo4j-shell
add allow_store_upgrade=true to your neo4j.properties` in 2.3
and then do: bin/neo4j-shell -path /path/to/db -config conf/neo4j.properties -c quit
If it is finished that backup of your db is on Version 2.3
Then you should be able to use neo4j-admin -import ...
I recently had this same symptom with my CSV import slowing to death.
My load-csv cypher script had too many rels.
So I divided my load in two. First create the nodes, then the relations and most connected nodes. HIH.
Back to your issue
First, try to increase the memory for the JVM. In NEO/conf, there is a wrapper file. At the beginning are the memory settings.
Lastly, from an instance w/ your data, export to multiple CSV files and import them in your new server.

Incredibly low GridFS performance using MongoDB 3.0.0 and Mongofiles

I have a MongoDB database with a GridFS collection containing hundreds of thousands of files (345,073, to be precise -- and about 100GBs in volume).
On MongoDB 2.6.8 it takes a fraction of a second to list the files using the native mongofiles and connecting to mongod. This is the command I use:
mongofiles --db files list
I just brewed and linked MongoDB 3.0.0 and suddenly the same operation takes more than five minutes to complete, if ever it does. I have to kill the query most of the time, as it drives two of my CPU cores to 100%. The log file does not show anything irregular. I rebuilt the indexes to no avail. I also tried the same with my other GridFS collections in other databases, each with millions of files and I encounter the same issue.
Then I uninstalled 3.0.0 and relinked 2.6.8 and everything is back to normal (using the exact same data files).
I am running MongoDB on Yosemite, and I reckon the problem might be platform specific. But is there anything that I have ommited and I should take into consideration? Or have I really discovered a bug that I must report?
Having the same problem here, for me running a mongofiles 2.6 from a docker image fixed the problem, seems they broke something with the rewrite.

Mongodb32-bit limitation is for single database?

I am using mongodb-v2.0. I have gone through the 32-bit mongodb limitation of "2GB". The thing which baffling me is 2GB limitation. I will explain our scenario :-
When the database reaches 2GB. It is possible to use different database name in a single instance.If so then each database will have 2GB? Can we use different instance of mongodb listening on different port. If its possible,then can we continue in creating new database until it reaches 2GB of size?. In this way can we use multiple database of size 2GB on 32-bit mongodb on 32-bit machines?
Thanks,
sampath
The 2GB are the storage limit for the mongodb server. See in the FAQ http://www.mongodb.org/display/DOCS/FAQ#FAQ-Whatarethe32bitlimitations%3F
But maybe this is your solution: Database over 2GB in MongoDB

How to know when my MongoDB database overhead reaches its limit?

I installed a MongoDB database on my server. My server is in 32Bit and I can't change it soon.
When you use MongoDB in a 32Bit architecture you have a limit of 2,5Go of data, as mentionned in this MongoDB blog post.
The thing is that I have several database. So how can I know if I am close or not to this limit ?
Mongo will start throwing assertions when you hit the limit. If you look at the logs, it will have a bunch of error messages (helpfully) telling you to switch to a 64-bit machine.
Using the mongo shell, you can say:
> use admin
> db.runCommand({listDatabases : 1})
to get a general idea of how close to the limit you are. Sizes are in bytes.