Push billions of records spread across CSV files in s3 to MongoDb - mongodb

I have an s3 bucket, which gets almost 14-15 Billion records spread across 26000csv files, every day.
I need to parse these files and push it to mongo db.
Previously with just 50 to 100 million records, I was using bulk upsert with multiple parallel processes in an ec2 instance and it was fine. But since the number of records increased drastically, previous method is not that efficient.
So what will be the best method to do this?

You should look at mongoimport which is written in GoLang and can make effective use of threadsto parallelize the uploading. It's pretty fast. you would have to copy the files from S3 to local disk prior to uploading but if you put the node in the same region as the S3 bucket and the database it should run quickly. Also, you could use MongoDB Atlas and its API to turn up the IOPS on your cluster while you were loading and dial it down afterwards to speed up uploading.

Related

How to check in GCP console, the number of put operations performed while uploading file to GCS bucket?

I have uploaded a big file of more than 5 GB size but I want to ckeck this big file is counted as single put operation or it will counted as multiple put operations based on the number of small chunks GCP divided this file.
I am not able to find any such monitoring on GCP console. Kind help/guide me how to check the number of class A or class B operations performed so far in GCP.

Is writing files to a gcloud storage bucket supposed to be slow?

I'm using a gcloud storage bucket mounted to a VM instance with gcsfuse. I have no problems opening files and reading them when the files are stored on the storage bucket, but when I try to write files to the storage bucket it is enormously slow and when I say 'enormously' I mean at least 10 times slower if not 100 times. Is it supposed to be that way? If so, I guess I'm going to have to write files to a persistent disk, then upload the files to the storage bucket, then download the files to my personal computer from the storage bucket. Although the process will take the same amount of time, at least the psychological demoralization will not occur.
From Documentation:
Performance: Cloud Storage FUSE has much higher latency than a local file system. As such, throughput may be reduced when reading or writing one small file at a time. Using larger files and/or transferring multiple files at a time will help to increase throughput.
Individual I/O streams run approximately as fast as gsutil.
The gsutil rsync command can be particularly affected by latency because it reads and writes one file at a time. Using the top-level -m flag with the command is often faster.
Small random reads are slow due to latency to first byte (don't run a database over Cloud Storage FUSE!)
Random writes are done by reading in the whole blob, editing it locally, and writing the whole modified blob back to Cloud Storage. Small writes to large files work as expected, but are slow and expensive.
Optionally, please check out the gsutil tool or GCS Client Libraries, or even Storage Transfer Service since they may suit your needs better depending on your specific use case.
I hope this clarifies your concerns.

Loading data to Postgres RDS is still slow after tuning parameters

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?
In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.

Can i use Amazon Kinesis to connect to amazon redshift for data load in every couple of mins

From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.

Expected balancing behavior when restoring data with mongorestore to a sharded cluster

I have noticed that when restoring data with mongorestore to a sharded cluster through mongos, all the records are initially saved to the primary shard (of the collection) and only the balancer process moves these chunks, which is a relatively slow process, so right after restore I have a similar situation:
chunks:
rs_shard-1 28
rs_shard-2 29
rs_shard-4 27
rs_shard-3 644
I don't have any errors in the mongodb/mongos log files.
I'm not sure, but I think that in the past data was restored in an already balanced way. Now I'm using version 2.4.6. Can someone confirm what is the expected behavior?
Here is what happens imho:
When restoring the data, there are initial ranges for chunks assigned to each shard. The data is inserted by mongorestore without waiting for any responses from mongos, not speaking of the shards, resulting in a relatively fast insertion of the documents. I assume that you have a monotonically increasing shard key, like ObjectId for example. Now what happens is that one shard has been assigned the range from X to infinite (called "maxKey" in mongoland) during the initial assignment of chunk ranges. The documents in this range will be created on that shard, resulting in a lot of chunk splits and an increasing number of chunks on that server. A chunk split will trigger a balancer round, but since the insertion of new documents is faster than the chunk migration, the number of chunks will increase faster than the balancer can reduce it.
So what I would do is to check the shard key. I am pretty sure that it is monotonically increasing. Which is bad not only when restoring a backup, but in production use, too. Please see shard key documentation and Considerations for Selecting Shard Keys in the MongoDB docs.
A few additional notes. The mongodump utility is designed for small databases, like the config db of a sharded cluster. Your database has a size of roughly 46.5GB which isn't exactly small. I'd rather use file system snapshots on each individual shard, synchronized using a cronjob. If you really need a point in time recovery, you can still use mongodump in direct file access mode on the snapshotted files to create a dump and restore those dumps using the --oplogLimit option. Other than the ability to do a point in time recovery, the usage of mongodump has no advantage over taking file system snapshots, but has the disadvantage that you have to stop the balancer in order to have a consistent backup and to lock the database during the whole backup procedure in order to have a true point in time recovery option.