Best practice for importing bulk data to AWS RDS PostgreSQL database - postgresql

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

Related

Persistant Data in Heroku Postgres - Ephemeral Filesystem

This might be a simple question, but I would like some clarification.
Based off the docs, Heroku has an ephemeral file system. How I interpret it is that anytime you upload a file to Heroku and there is a change in the configuration or the app is restarted, the files are gone.
However, I was wondering if this is the case if you upload data to Heroku Postgres through a dumps file.
For development, I am using a local Postgres server. From there, I would create a dumps file and then upload that file using commands found here:
https://stackoverflow.com/a/71206831/3100570
Now suppose my application makes a POST request to Heroku Postgres, would that data be persisted along with the initial data from the dumps file in the event that the application is restarted or crashed?
Ingesting data into your PostgreSQL database this way doesn't touch your dyno's filesystem. You are simply connecting to PostgreSQL and running the SQL commands contained in that file:
-f, --file=file
SQL file to run
The data will be stored in PostgreSQL in exactly the same way it would if you did a bunch of INSERTs yourself. You should have no problem ingesting data this way and then continuing to interact with your application as normal.

uploading large file to AWS aurora postgres serverless

I have been trying for days to copy a large CSV file to a table in PostgreSQL I am using PGadmin4 to access the database. I have a file on my system the file is 10 GB so I am getting starting error when trying to upload it via UI or \copy command.
When talking about 10 GB CSV file, then you may use as well different options
I believe \copy should work, you did not provide any more information about the issue
I'd personally use the AWS Glue - an ETL service which could read from an S3 file

loading one table from RDS / postgres into Redshift

We have a Redshift cluster that needs one table from one of our RDS / postgres databases. I'm not quite sure the best way to export that data and bring it in, what the exact steps should be.
In piecing together various blogs and articles the consensus appears to be using pg_dump to copy the table to a csv file, then copying it to an S3 bucket, and from there use the Redshift COPY command to bring it in to a new table-- that's my high level understanding, but am not sure what the command line switches should be, or the actual details. Is anyone doing this currently and if so, is what I have above the 'recommended' way to do a one-off import into Redshift?
It appears that you want to:
Export from Amazon RDS PostgreSQL
Import into Amazon Redshift
From Exporting data from an RDS for PostgreSQL DB instance to Amazon S3 - Amazon Relational Database Service:
You can query data from an RDS for PostgreSQL DB instance and export it directly into files stored in an Amazon S3 bucket. To do this, you use the aws_s3 PostgreSQL extension that Amazon RDS provides.
This will save a CSV file into Amazon S3.
You can then use the Amazon Redshift COPY command to load this CSV file into an existing Redshift table.
You will need some way to orchestrate these operations, which would involve running a command against the RDS database, waiting for it to finish, then running a command in the Redshift database. This could be done via a Python script that connects to each database (eg via psycopg2) in turn and runs the command.

How do i dump data from an Oracle Database without access to the database's file system

I am trying to dump the schema and data from an existing Oracle DB and import it into another Oracle DB.
I have tried using the "Export Wizard" provided by sqldeveloper.
I found answers using Oracle Data Pump, however i do not have access to the filesystem of the DB server.
I expect to get a file that i can copy and import into another DB
Without Data Pump, you have to make some concessions.
The biggest concession is you're going to ask a Client application, running somewhere on your network, to deal with a potentially HUGE amount of data/IO.
Withing reasonable limits, you can use the Tools > Database Export wizard to build a series of SQLPlus style scripts, both DDL (CREATEs) and DATA (INSERTs).
Once you have those scripts, you can use SQLPlus, SQLcl, or SQL Developer to run them on your new/target database.

Easy way to get all tables out to S3 on a nightly basis?

I need to be able to dump the contents of each table in my redshift data warehouse each night to S3.
The outcome that I want to achieve is the same outcome as if I was manually issueing an UNLOAD command for each table.
For something this simple, I assumed I could use something like data pipeline or glue, but these don’t seem to make this easy.
Am I looking at this problem wrong? This seems like it should be simple.
I had this process but in reverse recently. My solution: a python script that queried pg_schema (to grab eligible table names), and then looped through the results using the table name as a parameter in an INSERT query. I ran the script as a cron job in an EC2.
In theory, you could set up the script via Lambda or as a ShellCommand in Pipeline. But I could never get that to work, whereas a cron job was super simple.
Do you have a specific use case for explicitly UNLOADing data to S3? Like being able to use that data with Spark/Hive?
If not, you should be scheduling Snapshots of Redshift cluster to S3 every day. This happens by default anyway.
Snapshots are stored in S3 as well.
Snapshots are incremental and fast. You can restore entire clusters using snapshots.
You can also restore individual tables from snapshots.
Here is the documentation about it: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html
This is as simple as creating a script (shell/python/...) and putting that in crontab. Somewhere in the lines of (snippet from a shell script):
psql -U$username -p $port -h $hostname $database -f path/to/your/unload_file.psql
and your unload_file.psql would contain the standard Redshift unload statement:
unload ('select * from schema.tablename') to 's3://scratchpad_bucket/filename.extension'
credentials 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXX'
[options];
Put your shell script in a crontab and execute it daily at the time when you want to take the backup.
However, remember:
While taking backups is indispensible, daily full backups will generate a mammoth bill for s3. You should rotate the backups /
log files i.e. regularly delete them or take a backup from s3 and
store locally.
A full daily backup might not be the best thing to do. Check whether you can do it incrementally.
It would be better that you tar and gzip files and then send them to s3 rather than storing an Excel or a csv.