I have two mongodb cluster - want to export data.
I am using EC2 instance to login to DocumentDB cluster and use mongoexport to get all the documents in JSON format.
Problem:
Number of records are more than 2 Billion and mongoexport will create one single file with all records.
Any suggestions on how
1. mongoexport all data to multiple files
2. write all exported data directly to s3 bucket instead of first writing to EC2 and then using aws s3 cp/ sync to upload it to s3.
Looked at https://www.npmjs.com/package/mongo-to-s3 - too old to use
https://www.npmjs.com/package/mongo-dump-s3-2 - it takes mongodump, I want data in Json format.
Related
I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.
Source: Azure Storage Gen 2 (file with 10 json lines)
Sink: Azure Cosmos with Mongo API
I used Azure Data factory pipeline (Copy activity) to move the file data to Mongo collection. Copy is successful but when I run find({}) on my collection, it returns 0 records. When I run stats(), it shows the count as 10 which is expected. I cannot figure out what is the issue when reading these records from Robo3T to query Mongo DB.
I created second pipeline to read data from Mongo and write to Azure Storage to test if the data really is present in Mongo. I was able to write all 10 records to storage. It proves the data is present in Mongo, but I cannot read/access it.
You wont be able to directly read data collection stored in the data or any databases. Must you use Mongo Shell via Azure Portal. Where you have to go to your Azure Cosmos DB resource -> Data Explorer -> Mongo Shell. If there any specific errors here is the troubleshooting document.
I'm having a mongodb deployed on openshift, I want to restore the data from S3 bucket, is there a way to do this directly or I need to download the data from S3 first and then run mongorestore command?
The issue is that mirgating data from MongoDB to DynamoDB/S3/Redshift currently, as I unterstand for us is not available via AWS DMS Service, as it does not support all data types. Or maybe I'm wrong.
The probelm is that our Mongo object contain not scalar fields(arrays, maps).
So when I make a mirgation task via AWS DMS with table mode, it pull data badly.Buy some reason only selection works. Transformation rules are ignored by DMS(tried renaming and removing).
In the doc mode is all ok, but how can I run migration with some custom script for transformation? As storing data this way still need transformation.
We need some modifications like: rename, remove fields and flatting some fields(for example we ahve a map object and it should be flatten into several scalar fields).
Migration should be done into one of the sources: S3, Dyanamo, Redshift
Will be thankfull for any help and suggestions.
use the following below script to take a backup of the MongoDB DB
mongodump -h localhost:27017 -d my_db_name -o $DEST
use the below command to sync your backup to S3 bucket
aws s3 sync ~/db_backups s3://my-bucket-name
Once your data in S3, you can load very easily to Redshift using copy command
We have a Redshift cluster that needs one table from one of our RDS / postgres databases. I'm not quite sure the best way to export that data and bring it in, what the exact steps should be.
In piecing together various blogs and articles the consensus appears to be using pg_dump to copy the table to a csv file, then copying it to an S3 bucket, and from there use the Redshift COPY command to bring it in to a new table-- that's my high level understanding, but am not sure what the command line switches should be, or the actual details. Is anyone doing this currently and if so, is what I have above the 'recommended' way to do a one-off import into Redshift?
It appears that you want to:
Export from Amazon RDS PostgreSQL
Import into Amazon Redshift
From Exporting data from an RDS for PostgreSQL DB instance to Amazon S3 - Amazon Relational Database Service:
You can query data from an RDS for PostgreSQL DB instance and export it directly into files stored in an Amazon S3 bucket. To do this, you use the aws_s3 PostgreSQL extension that Amazon RDS provides.
This will save a CSV file into Amazon S3.
You can then use the Amazon Redshift COPY command to load this CSV file into an existing Redshift table.
You will need some way to orchestrate these operations, which would involve running a command against the RDS database, waiting for it to finish, then running a command in the Redshift database. This could be done via a Python script that connects to each database (eg via psycopg2) in turn and runs the command.