How to import 700+ million rows into MongoDB in minutes

How to import 700+ million rows into MongoDB in minutes - mongodb

We have 32 Core Windows Server, 96 GB RAM with 5TB HDD
Approach 1( Using Oracle SQLLDR)
We fetched input data from oracle database.
We processed and generated multiple TSV files.
Using threading, we are importing data into the Oracle database using SQL Loader.
It requires approximately 66 Hrs.
Approach 2( Using MongoImport)
We fetched input data from oracle database.
We processed and generated multiple TSV files.
Using threading, we are importing data into a MongoDB database using mongoimport command line utility.
It requires approximately 65 Hrs.
There is no considerable difference observed in performance.
We need to process 700+ Millions of record, please suggest the better approach for optimized performance.
We are fetching from oracle database, processing in our application and storing the output in another database. This is an existing process which we do on Oracle database but it is time-consuming so we decided to try MongoDB for performance improvement.
We did one POC, where we did not get any considerable difference. We thought it may work on the server because of hardware so we did POC on the server where we got an above-mentioned result.
We think that MongoDB is more robust than the Oracle database but failed to get the desired result after comparing the stats.
Please find MongoDB related details of production server:
MongoImport Command
mongoimport --db abcDB --collection abcCollection --type tsv --file abc.tsv --headerline --numInsertionWorkers 8 --bypassDocumentValidation
Wired Tiger Configuration
storage:
dbPath: C:\data\db
journal:
enabled: false
wiredTiger:
engineConfig:
cacheSizeGB: 40
Approximate computation time is calculated by process log details for process execution using Oracle and process execution using MongoDB.
Underlined POC carried out on the production server is for comparing performance Oracle(SQL Loader) vs MongoDB ( MongoImport )
As we are using standalone MongoDB instance for our POC, we have not created any sharding in production server.
If we get the desired result using MongoDB, then we come to the conclusion about migration.
Thanking you in advance.

Related

How to speed up spark df.write jdbc to postgres database?

I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write:
df.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
I tried increasing the batchsize but that didn't help, as completing this task still took ~4hours. I've also included some snapshots below from aws emr showing more details about how the job ran. The task to save the dataframe to the postgres table was only assigned to one executor (which I found strange), would speeding this up involve dividing this task between executors?
Also, I have read spark's performance tuning docs but increasing the batchsize, and queryTimeout have not seemed to improve performance. (I tried calling df.cache() in my script before df.write, but runtime for the script was still 4hrs)
Additionally, my aws emr hardware setup and spark-submit are:
Master Node (1): m4.xlarge
Core Nodes (2): m5.xlarge
spark-submit --deploy-mode client --executor-cores 4 --num-executors 4 ...

Spark is a distributed data processing engine, so when you are processing your data or saving it on file system it uses all its executors to perform the task.
Spark JDBC is slow because when you establish a JDBC connection, one of the executors establishes link to the target database hence resulting in slow speeds and failure.
To overcome this problem and speed up data writes to the database you need to use one of the following approaches:
Approach 1:
In this approach you need to use postgres COPY command utility in order to speed up the write operation. This requires you to have psycopg2 library on your EMR cluster.
The documentation for COPY utility is here
If you want to know the benchmark differences and why copy is faster visit here!
Postgres also suggests using COPY command for bulk inserts. Now how to bulk insert a spark dataframe.
Now to implement faster writes, first save your spark dataframe to EMR file system in csv format and also repartition your output so that no file contains more than 100k rows.
#Repartition your dataframe dynamically based on number of rows in df
df.repartition(10).write.option("maxRecordsPerFile", 100000).mode("overwrite").csv("path/to/save/data)
Now read the files using python and execute copy command for each file.
import psycopg2
#iterate over your files here and generate file object you can also get files list using os module
file = open('path/to/save/data/part-00000_0.csv')
file1 = open('path/to/save/data/part-00000_1.csv')
#define a function
def execute_copy(fileName):
con = psycopg2.connect(database=dbname,user=user,password=password,host=host,port=port)
cursor = con.cursor()
cursor.copy_from(fileName, 'table_name', sep=",")
con.commit()
con.close()
To gain additional speed boost, since you are using EMR cluster you can leverage python multiprocessing to copy more than one file at once.
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
print(p.map(execute_copy, [file,file1]))
This is the approach recommended as spark JDBC can't be tuned to gain higher write speeds due to connection constraints.
Approach 2:
Since you are already using an AWS EMR cluster you can always leverage the hadoop capabilities to perform your table writes faster.
So here we will be using sqoop export to export our data from emrfs to the postgres db.
#If you are using s3 as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir s3://mybucket/myinputfiles/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
#If you are using EMRFS as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir /path/to/save/data/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
Why sqoop?
Because sqoop opens multiple connections with the database based on the number of mapper specified. So if you specify -m as 8 then 8 concurrent connection streams will be there and those will write data to the postgres.
Also, for more information on using sqoop go through this AWS Blog, SQOOP Considerations and SQOOP Documentation.
If you can hack around your way with code then Approach 1 will definitely give you the performance boost you seek and if you are comfortable with hadoop components like SQOOP then go with second approach.
Hope it helps!

Spark side tuning => Perform repartition on Datafarme so that there would multiple executor writing to DB in parallel
df
.repartition(10) // No. of concurrent connection Spark to PostgreSQL
.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
Postgresql side tuning =>
There will need to bump up below parameters on PostgreSQL respectively.
max_connections determines the maximum number of concurrent
connections to the database server. The default is typically 100
connections.
shared_buffers configuration parameter determines how much
memory is dedicated to PostgreSQL to use for caching data.

To solve the performance issue, you generally need to resolve the below 2 bottlenecks:
Make sure the spark job is writing the data in parallel to DB -
To resolve this make sure you have a partitioned dataframe. Use "df.repartition(n)" to partiton the dataframe so that each partition is written in DB parallely.
Note - Large number of executors will also lead to slow inserts. So start with 5 partitions and increase the number of partitions by 5 till you get optimal performance.
Make sure the DB has enough compute, memory and storage required for ingesting bulk data.

By repartitioning the dataframe you can achieve a better write performance is a known answer. But there is an optimal way of repartitioning your dataframe.
Since you are running this process on an EMR cluster , First get to know about the instance type and the number of cores that are running on each of your slave instances. According to that specify your number of partitions on a dataframe.
In your case you are using m5.xlarge(2 slaves) which will have 4 vCPUs each which means 4 threads per instance. So 8 partitions will give you an optimal result when you are dealing with huge data.
Note : Number of partitions should be increased or decreased based on your data size.
Note : Batch size is also something you should consider in your writes. Bigger the batch size better the performance

Is it possible to restore a MongoDB database from a .bson file quicker than mongorestore?

I have a very huge database from an old backup. It's about 500GB total and it's a .bson file. At the current rate of my harddrive and CPU, I am done in probably 10-20 hours. EDIT: About 9 hours.
I simply ran:
mongorestore -d database -c collection C:\very_large_backup.bson
Is it possible for MongoDB to simply access the .bson file directly, or is mongorestore the only option I have?
I plan on moving to a Microsoft SQL Server with this data (discarding the extra bits of information that might overlap). Maybe there's a faster way that way?

Neo4J: Importing a large Cypher dump

I have a large dump (millions of nodes and relationships) from a Neo4J 2.2.5 database in Cypher format (produced with neo4j-sh -c dump), that I'm trying to import into a 3.0.3 instance.
However, the import process (neo4j-sh < dump.cypher) slows down drastically after a few minutes, down to a couple records per second.
Is there any way to speed up this process? I tried upgrading the database as described in the manual, but the new instance crashes with an exception about a version mismatch in the store format.

Neo4j 3.0 comes with a bin/neo4j-admin tool for exactly this purpose.
try bin/neo4j-admin import --mode database --from /path/to/db
see: http://neo4j.com/docs/operations-manual/current/deployment/upgrade/#upgrade-instructions
The cypher dump is not useful for large database, it's only for smaller setups (a few thousand nodes) for demos etc.
FYI: In Neo4j 3.0 the cypher export procedure from APOC is much more suited for large scale cypher dumps.
Update
You can also try to upgrade from 2.2 to 2.3 first. E.g by using neo4j-shell
add allow_store_upgrade=true to your neo4j.properties` in 2.3
and then do: bin/neo4j-shell -path /path/to/db -config conf/neo4j.properties -c quit
If it is finished that backup of your db is on Version 2.3
Then you should be able to use neo4j-admin -import ...

I recently had this same symptom with my CSV import slowing to death.
My load-csv cypher script had too many rels.
So I divided my load in two. First create the nodes, then the relations and most connected nodes. HIH.
Back to your issue
First, try to increase the memory for the JVM. In NEO/conf, there is a wrapper file. At the beginning are the memory settings.
Lastly, from an instance w/ your data, export to multiple CSV files and import them in your new server.

Transfer a MongoDB database over an unstable connection

I have a fairly small MongoDB instance (15GB) running on my local machine, but I need to push it to a remote server in order for my partner to work on it. The problem is twofold,
The server only has 30GB of free space
My local internet connection is very unstable
I tried copyDatabase to transfer it directly, but it would take approximately 2 straight days to finish, in which the connection is almost guaranteed to fail at some point. I have also tried both mongoexport and mongodump but both produce files that are ~40GB, which won't fit on the server, and that's ignoring the difficulties of transferring 40GB in the first place.
Is there another, more stable method that I am unaware of?

Since your mongodump output is much larger than your data, I'm assuming you are using MongoDB 3.0+ with the WiredTiger storage engine and your data is compressed but your mongodump output is not.
As at MongoDB 3.2, the mongodump and mongorestore tools now have support for compression (see: Archiving and Compression in MongoDB Tools). Compression is not used by default.
For your use case as described I'd suggest:
Use mongodump --gzip to create a dump directory with compressed backups of all of your collections.
Use rsync --partial SRC .... DEST or similar for a (resumable) file transfer over your unstable internet connection.
NOTE: There may be some directories you can tell rsync to ignore with --exclude; for example the local and test databases can probably be skipped. Alternatively, you may want to specify a database to backup with mongodump --gzip --db dbname.
Your partner can use a similar rsync commandline to transfer to their environment, and a command line like mongorestore --gzip /path/to/backup to populate their local MongoDB instance.
If you are going to transfer dumps on an ongoing basis, you will probably find rsync's --checksum option useful to include. Normally rsync transfers "updated" files based on a quick comparison of file size and modification time. A checksum involves more computation but would allow skipping collections that have identical data to previous backups (aside from the modification time).
If you need to sync data changes on ongoing basis, you also may be better moving your database to a cloud service (eg. a Database-as-a-Service provider like MongoDB Atlas or your own MongoDB instance).

MongoDB 2.2: why didn't replication catch up a collection following a dump/restore?

We have a three-server replicaset running MongoDB 2.2 on Ubuntu 10.04, and recently had to upgrade the hard drive for each server where one particular database resides. This database contains log information for web service requests, where they write to collections in hourly buckets using the current timestamp to determine the name, e.g. log_yyyymmddhh.
I performed this process:
backup the database on the primary server with mongodump --db log_db
take a secondary server offline, replace the disk
bring the secondary server up in standalone mode (i.e. comment out the replSet entry
in /etc/mongodb.conf before starting the service)
restore the database on the secondary server with mongorestore --drop --db log_db
add the secondary server back into the replicaset and bring it online,
letting replication catch up the hourly buckets that were updated/created
while it had been offline
Everything seemed to go as expected, except that the collection which was the current bucket at the time of the backup was not brought up to date by replication. I had to manually copy that collection over by hand to get it up to date. Note that collections which were created after the backup were synched just fine.
What did I miss in this process that caused MongoDB not to get things back in synch for that one collection? I assume something got out of whack with regard to the oplog?
Edit 1:
The oplog on the primary showed that its earliest timestamp went back a couple of days, so there should have been plenty of space to maintain transactions for a few hours (which was the time the secondary was offline).
Edit 2:
Our MongoDB installation uses two disk partitions: /dev/sda1 and /dev/sdb1. The primary MongoDB directory /var/lib/mongodb/ is on /dev/sda1, and holds several databases, while the log database resides by itself on /dev/sdb1. There's a sym link /var/lib/mongodb/log_db which points to a directory on /dev/sdb1. Since the log db was getting full, we needed to upgrade the disk for /dev/sdb1.

You should be using mongodump with the --oplog option. Running a full database backup with mongodump on a replicaset that is updating collections at the same time may not leave you with a consistent backup. This becomes worse with larger databases, more collections and more frequent updates/inserts/deletes.
From the documentation for your version (2.2) of MongoDB (it's the same for 2.6 but just to be as accurate as possible):
--oplog
Use this option to ensure that mongodump creates a dump of the
database that includes an oplog, to create a point-in-time snapshot of
the state of a mongod instance. To restore to a specific point-in-time
backup, use the output created with this option in conjunction with
mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump
operation, the dump will not reflect a single moment in time. Changes
made to the database during the update process can affect the output
of the backup.
http://docs.mongodb.org/v2.2/reference/mongodump/
This is not covered well in most MongoDB tutorials around backups and restores. Generally you are better off if you can perform a live snapshot of the storage volume your database resides on (assuming your storage solution has a live snapshot ability compatible with MongoDB). Failing that, your next best bet is taking a secondary offline and then performing a snapshot or backup of the database files. Mongodump on a live database is increasingly a less optimal solution for larger databases due to performance issues.
I'd definitely take a look at the MongoDB overview of backup options: http://docs.mongodb.org/manual/core/backups/

I would guess this has to do with the oplog not being long enough, although it seems like you checked that and it looked reasonably big.
Still, when adding new members to a replica set you shouldn't be snapshotting and restoring them. It's better to simply add a new member and let replication happen by itself. This is described in the Mongo docs and is the process I've always followed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse