faster mongoimport, in parallel in airflow?

faster mongoimport, in parallel in airflow? - mongodb

tl;dr: there seems to be a limit on how fast data is inserted into our mongodb atlas cluster. Inserting data in parallel does not speed this up. How can we speed this up? Is our only option to get a larger mongodb atlas cluster with more Write IOPS? What even are write IOPS?
We replace and re-insert >10GB+ of data daily into our mongodb cluster with atlas. We have the following 2 bash commands, wrapped in python functions to help parameterize the commands, that we use with BashOperator in airflow:
upload single JSON to mongo cluster
def mongoimport_file(mongo_table, file_name):
# upload single file from /tmp directory into Mongo cluster
# cleanup: remove .json in /tmp at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& mongoimport --uri "{uri}" --collection {mongo_table} --drop --file /tmp/{file_name}.json \
&& echo AND REMOVE LOCAL FILEs... \
&& rm /tmp/{file_name}.json
"""
upload directory of JSONs to mongo cluster
def mongoimport_dir(mongo_table, dir_name):
# upload directory of JSONs into mongo cluster
# cleanup: remove directory at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& cat /tmp/{dir_name}/*.json | mongoimport --uri "{uri}" --collection {mongo_table} --drop \
&& echo AND REMOVE LOCAL FILEs... \
&& rm -rf /tmp/{dir_name}
"""
There are called in airflow using the BashOperator:
import_to_mongo = BashOperator(
task_id=f'mongo_import_v0__{this_table}',
bash_command=mongoimport_file(mongo_table = 'tname', file_name = 'fname')
)
Both of these work, although with varying performance:
mongoimport_file with 1 5GB file: takes ~30 minutes to mongoimport
mongoimport_dir with 100 50MB files: takes ~1 hour to mongoimport
There is currently no parallelization with ** mongoimport_dir**, and in fact it is slower than importing just a single file.
Within airflow, is it possible to parallelize the mongoimport of our directory of 100 JSONs, to achieve a major speedup? If there's a parallel solution using python's pymongo that cannot be done with mongoimport, we're happy to switch (although we'd strongly prefer to avoid loading these JSONs into memory).
What is the current bottleneck with importing to mongo? Is it (a) CPUs in our server / docker container, or (b) something with our mongo cluster configuration (cluster RAM, or cluster vCPU, or cluster max connections, or cluster read / write IOPS (what are these even?)). For reference, here is our mongo config. I assume we can speed up our import by getting a much bigger cluster but mongodb atlas becomes very expensive very fast. 0.5 vCPUs doesn't sound like much, but this already runs us $150 / month...

First of all "What is the current bottleneck with importing to mongo?" and "Is it (a) CPUs in our server / docker container " - don't believe to anyone who will tell you the answer from the screenshot you provided.
Atlas has monitoring tools that will tell you if the bottleneck is in CPU, RAM, disk or network or any combination of those on db side:
On the client side (airflow) - please use system monitor of your host OS to answer the question. Test disk I/O inside docker. Some combinations of host OS and docker storage drivers performed quite poor in the past.
Next, "What even are write IOPS" - random
write operations per second
https://cloud.google.com/compute/docs/disks/performance
IOPS calculation differs depending on cloud provider. Try AWS and Azure to compare cost vs speed. M10 on AWS gives you 2 vCPU, yet again I doubt you can compare them 1:1 between vendors. The good thing is it's on-demand and will cost you less than a cup of coffee to test and delete the cluster.
Finally, "If there's a parallel solution using python's pymongo" - I doubt so. mongoimport uses batches of 100,000 documents, so essentially it sends it as fast as the stream is consumed on the receiver. The limitations on the client side could be: network, disk, CPU. If it is network or disk, parallel import won't improve a thing. Multi-core systems could benefit from parallel import if mongoimport was using a single CPU and it was the limiting factor. By default mongoimport uses all CPUs available: https://github.com/mongodb/mongo-tools/blob/cac1bfbae193d6ba68abb764e613b08285c6f62d/common/options/options.go#L302. You can hardly beat it with pymongo.

Related

Does mongod run in a single thread when performing a mongorestore?

When monitoring a mongodb restore, I'm tracking two processes...
CPU COMMAND
100% mongod --config /etc/mongo/mongod.conf
0% mongorestore /data/dump
MongoDB is 4.4.14 and mongorestore is version is 100.5.3. I'm running it inside a docker container.
I never see mongod go past 100%.. Is there a way to allow it to use more than a single core when performing a mongo restore?

By default mongorestore will restore max 4x collections in paralel with single insertionWorkerPerCollection( Since you see only two processes running in paralel , perhaps you have only 2x collections in your database... ) , but you can increase this parameters if you have more collections and would like to restore faster , check the official docs for details:
--numParalelCollections ( default=4 )
Number of collections mongorestore should restore in parallel.
--numInsertionWorkersPerCollection ( default=1 )
Specifies the number of insertion workers to run concurrently per collection.
For large imports, increasing the number of insertion workers may increase the speed of the import.

How to speed up spark df.write jdbc to postgres database?

I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write:
df.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
I tried increasing the batchsize but that didn't help, as completing this task still took ~4hours. I've also included some snapshots below from aws emr showing more details about how the job ran. The task to save the dataframe to the postgres table was only assigned to one executor (which I found strange), would speeding this up involve dividing this task between executors?
Also, I have read spark's performance tuning docs but increasing the batchsize, and queryTimeout have not seemed to improve performance. (I tried calling df.cache() in my script before df.write, but runtime for the script was still 4hrs)
Additionally, my aws emr hardware setup and spark-submit are:
Master Node (1): m4.xlarge
Core Nodes (2): m5.xlarge
spark-submit --deploy-mode client --executor-cores 4 --num-executors 4 ...

Spark is a distributed data processing engine, so when you are processing your data or saving it on file system it uses all its executors to perform the task.
Spark JDBC is slow because when you establish a JDBC connection, one of the executors establishes link to the target database hence resulting in slow speeds and failure.
To overcome this problem and speed up data writes to the database you need to use one of the following approaches:
Approach 1:
In this approach you need to use postgres COPY command utility in order to speed up the write operation. This requires you to have psycopg2 library on your EMR cluster.
The documentation for COPY utility is here
If you want to know the benchmark differences and why copy is faster visit here!
Postgres also suggests using COPY command for bulk inserts. Now how to bulk insert a spark dataframe.
Now to implement faster writes, first save your spark dataframe to EMR file system in csv format and also repartition your output so that no file contains more than 100k rows.
#Repartition your dataframe dynamically based on number of rows in df
df.repartition(10).write.option("maxRecordsPerFile", 100000).mode("overwrite").csv("path/to/save/data)
Now read the files using python and execute copy command for each file.
import psycopg2
#iterate over your files here and generate file object you can also get files list using os module
file = open('path/to/save/data/part-00000_0.csv')
file1 = open('path/to/save/data/part-00000_1.csv')
#define a function
def execute_copy(fileName):
con = psycopg2.connect(database=dbname,user=user,password=password,host=host,port=port)
cursor = con.cursor()
cursor.copy_from(fileName, 'table_name', sep=",")
con.commit()
con.close()
To gain additional speed boost, since you are using EMR cluster you can leverage python multiprocessing to copy more than one file at once.
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
print(p.map(execute_copy, [file,file1]))
This is the approach recommended as spark JDBC can't be tuned to gain higher write speeds due to connection constraints.
Approach 2:
Since you are already using an AWS EMR cluster you can always leverage the hadoop capabilities to perform your table writes faster.
So here we will be using sqoop export to export our data from emrfs to the postgres db.
#If you are using s3 as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir s3://mybucket/myinputfiles/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
#If you are using EMRFS as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir /path/to/save/data/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
Why sqoop?
Because sqoop opens multiple connections with the database based on the number of mapper specified. So if you specify -m as 8 then 8 concurrent connection streams will be there and those will write data to the postgres.
Also, for more information on using sqoop go through this AWS Blog, SQOOP Considerations and SQOOP Documentation.
If you can hack around your way with code then Approach 1 will definitely give you the performance boost you seek and if you are comfortable with hadoop components like SQOOP then go with second approach.
Hope it helps!

Spark side tuning => Perform repartition on Datafarme so that there would multiple executor writing to DB in parallel
df
.repartition(10) // No. of concurrent connection Spark to PostgreSQL
.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
Postgresql side tuning =>
There will need to bump up below parameters on PostgreSQL respectively.
max_connections determines the maximum number of concurrent
connections to the database server. The default is typically 100
connections.
shared_buffers configuration parameter determines how much
memory is dedicated to PostgreSQL to use for caching data.

To solve the performance issue, you generally need to resolve the below 2 bottlenecks:
Make sure the spark job is writing the data in parallel to DB -
To resolve this make sure you have a partitioned dataframe. Use "df.repartition(n)" to partiton the dataframe so that each partition is written in DB parallely.
Note - Large number of executors will also lead to slow inserts. So start with 5 partitions and increase the number of partitions by 5 till you get optimal performance.
Make sure the DB has enough compute, memory and storage required for ingesting bulk data.

By repartitioning the dataframe you can achieve a better write performance is a known answer. But there is an optimal way of repartitioning your dataframe.
Since you are running this process on an EMR cluster , First get to know about the instance type and the number of cores that are running on each of your slave instances. According to that specify your number of partitions on a dataframe.
In your case you are using m5.xlarge(2 slaves) which will have 4 vCPUs each which means 4 threads per instance. So 8 partitions will give you an optimal result when you are dealing with huge data.
Note : Number of partitions should be increased or decreased based on your data size.
Note : Batch size is also something you should consider in your writes. Bigger the batch size better the performance

Transfer a MongoDB database over an unstable connection

I have a fairly small MongoDB instance (15GB) running on my local machine, but I need to push it to a remote server in order for my partner to work on it. The problem is twofold,
The server only has 30GB of free space
My local internet connection is very unstable
I tried copyDatabase to transfer it directly, but it would take approximately 2 straight days to finish, in which the connection is almost guaranteed to fail at some point. I have also tried both mongoexport and mongodump but both produce files that are ~40GB, which won't fit on the server, and that's ignoring the difficulties of transferring 40GB in the first place.
Is there another, more stable method that I am unaware of?

Since your mongodump output is much larger than your data, I'm assuming you are using MongoDB 3.0+ with the WiredTiger storage engine and your data is compressed but your mongodump output is not.
As at MongoDB 3.2, the mongodump and mongorestore tools now have support for compression (see: Archiving and Compression in MongoDB Tools). Compression is not used by default.
For your use case as described I'd suggest:
Use mongodump --gzip to create a dump directory with compressed backups of all of your collections.
Use rsync --partial SRC .... DEST or similar for a (resumable) file transfer over your unstable internet connection.
NOTE: There may be some directories you can tell rsync to ignore with --exclude; for example the local and test databases can probably be skipped. Alternatively, you may want to specify a database to backup with mongodump --gzip --db dbname.
Your partner can use a similar rsync commandline to transfer to their environment, and a command line like mongorestore --gzip /path/to/backup to populate their local MongoDB instance.
If you are going to transfer dumps on an ongoing basis, you will probably find rsync's --checksum option useful to include. Normally rsync transfers "updated" files based on a quick comparison of file size and modification time. A checksum involves more computation but would allow skipping collections that have identical data to previous backups (aside from the modification time).
If you need to sync data changes on ongoing basis, you also may be better moving your database to a cloud service (eg. a Database-as-a-Service provider like MongoDB Atlas or your own MongoDB instance).

performance issue until mongodump

we operate for our customer a server with a single mongo instance, gradle, postgres and nginx running on it. The problem is we had massiv performance problmes until mongodump is running. The mongo queue is growing and no data be queried. The next problem is the costumer want not invest in a replica-set or a software update (mongod 3.x).
Has somebody any idea how i clould improve the performance.
command to create the dump:
mongodump -u ${MONGO_USER} -p ${MONGO_PASSWORD} -o ${MONGO_DUMP_DIR} -d ${MONGO_DATABASE} --authenticationDatabase ${MONGO_DATABASE} > /backup/logs/mongobackup.log
tar cjf ${ZIPPED_FILENAME} ${MONGO_DUMP_DIR}
System:
6 Cores
36 GB RAM
1TB SATA HDD
+ 2TB (backup NAS)
MongoDB 2.6.7
Thanks
Best regards,
Markus

As you have heavy load, adding a replica set is a good solution, as backup could be taken on secondary node, but be aware that replica need at least three servers (you can have an master/slave/arbiter - where the last need a little amount of resources)
MongoDump makes general query lock which will have an impact if there is a lot of writes in dumped database.
Hint: try to make backup when there is light load on system.

Try with volume snapshots. Check with your cloud provider what are the options available to take snapshots. It is super fast and cheaper if you compare actual pricing used in taking a backup(RAM and CPU used and if HDD then transactions const(even if it is little)).

Auto compact the deleted space in mongodb?

The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.

In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.

I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.

Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.

If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse