Import CSV parts from S3 into RDS Aurora PostgresSQL - postgresql

I spent some time fiddling with the tiny details of the AWS S3 extension for Postgres described here https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-s3-export.html#postgresql-s3-export-access-bucket (postgres extension configuration, roles, policies, tiny function input details).
I want to easily export, then import huge tables for testing purposes (indexes, generated columns, partitions etc) to optimize the database performance.
I am using this extension because I want to avoid to use my laptop to store the file with stuff like the following command which involves a lot of network I/O and is affected by slow internet connections, broken pipes when the connection is being nuked by the Operating System after a while and more of these problems related to huge tables:
# store CSV from S3 to local
aws s3 cp s3://my_bucket/my_sub_path/my_file.csv /my_local_directory/my_file.csv
# import from local CSV to AWS RDS Aurora PostgresSQL
psql -h my_rds.amazonaws.com -U my_username -d my_dbname -c '\COPY table FROM ''my_file.csv'' CSV HEADER'
I managed to export a very big table (160GB) into CSV files to S3 with:
SELECT * from aws_s3.query_export_to_s3(
'SELECT * FROM my_schema.my_large_table',
aws_commons.create_s3_uri(
'my_bucket/my_subpath',
'my_file.csv',
'eu-central-1'
),
options:='format csv'
);
However this ends up in lots of "part files" in S3:
the first one with that same CSV filename my_file.csv
all the others like my_file.csv_part2 ... my_file.csv_part20 and so on
Now, I don't think this is a problem as long as I am able to import back the CSV data somewhere else in AWS RDS Aurora (PostgresSQL). Although I am not sure what strategies could be applied here, if it's better having all these CSV files, or perhaps I can configure the export to use only one huge CSV file (160GB).
Now the import stuff (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.html):
Turns out I have to import all these "part files" with PL/pgSQL, but then I get lost in the details on how to format those strings for the S3 paths and in general I see all sorts of errors (both export and import). One file import takes around 20 minutes, so it's quite frustrating figure out what is going wrong.
What's wrong with the source code / error below?
Is there a better way to handle all this export/import at scale (160GB tables)?
DO $$
DECLARE
my_csv_s3_sub_path text;
BEGIN
FOR cnt IN 2..26 LOOP
my_csv_s3_sub_path := 'my_subpath/my_file.csv_part' || cnt;
RAISE NOTICE '% START loading CSV file % from S3', now(), cnt;
SELECT aws_s3.table_import_from_s3(
'my_schema.my_large_table_new',
'',
'(format csv)',
aws_commons.create_s3_uri(
'my_bucket',
my_csv_s3_sub_path,
'eu-central-1'
)
);
RAISE NOTICE '% STOP loading CSV file % from S3', now(), cnt;
END LOOP;
END; $$
The code above gives:
SQL Error [42601]: ERROR: query has no destination for result data
Hint: If you want to discard the results of a SELECT, use PERFORM instead.
Where: PL/pgSQL function inline_code_block line 8 at SQL statement
I think it's related to variables and string interpolation because I need to dynamically generate the CSV file name in S3 to be used in the Postgres AWS extension.
But I had all sorts of other errors before, e.g. some export/import inconsistency in the syntax around the S3 bucket sub-path that was leading to the Postgres AWS S3 extension to throw error HTTP 400:
SQL Error [XX000]: ERROR: HTTP 400. Check your arguments and try again. Where: SQL function "table_import_from_s3" statement 1
Is there a better alternative to export/import huge table from/to AWS RDS Aurora PostgresSQL?

The solution was to:
use PERFORM instead of SELECT when running aws_s3.table_import_from_s3 inside a stored procedure,
loop on all the S3 paths to the CSV file parts e.g. my_subpath/my_file.csv_part1 to my_subpath/my_file.csv_part26 (bear in mind there's also the "part 0" my_subpath/my_file.csv)
create the table index AFTER the data I/O above
-- this goes into the loop for all the CSV parts
PERFORM aws_s3.table_import_from_s3(
'my_schema.my_large_table_new',
'',
'(format csv)',
aws_commons.create_s3_uri(
'my_bucket',
'my_subpath/my_file.csv_part26',
'eu-central-1'
)
);
-- then AFTER the CSV ingestion create the index on the table
CREATE INDEX my_dx ON my_schema.my_large_table_new USING btree (my_column)
This still took 1 day of processing all the CSV files of 6GB each. Not very practical for most scenarios.
For the sake of SQL completeness, make sure the Postgres extension is installed and configured like this:
DROP EXTENSION aws_s3;
DROP EXTENSION aws_commons;
CREATE EXTENSION aws_s3 CASCADE;
You will still have to configure policies, roles and all of that on AWS.

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

Postgres Error | COPY Command Disk Quota Exceeded

I have 2 tables with 7.7 million records and other with 160 million records.
I want dump of the tables on my NAS drive (700GB+ available memory). I'm using following command to export the data to csv file:
\COPY (select * from table_name) to '/path_to_nas_drive/test.csv' csv header;
After running the above command, it's throwing the issue
Could not write COPY data : disk quota exceeded
Why it's throwing the error? is it because of the space issue on database server that it's not able to create buffer/temporary file or is there any way to handle this?
Let's try scaling back the request and just getting one row just to make sure you can use COPY even for just one row:
\COPY (select * from table_name limit 1) to '/path_to_nas_drive/test.csv' csv header;
You can also try compressing the output with gzip like so:
psql -c "\COPY (select * from table_name) to stdout" yourdbname | gzip > /path_to_nas_drive/test.csv.gz
It is a client issue, not a server issue. "could not write COPY data" is generated by psql, and "disk quota exceeded" comes from the OS and is then passed on by psql.
So your NAS is telling you that you are not allowed to write as much to it as you would like to. Maybe it has plenty of space, but it isn't going to give it to you, where "you" is whoever is running psql, not whoever is running PostgreSQL server.

Creating Postgres table on AWS RDS using CSV file

I'm having this issue with creating a table on my postgres DB on AWS RDS by importing the raw csv data. Here's the few steps that I already did.
CSV file has been uploaded on my S3 bucket
Followed AWS's tutorial to give RDS permission to import data from S3
Created an empty table on postgres
Tried using pgAdmin's 'import' feature to import the local csv file into the table, but it kept giving me the error.
So I'm using this query below to import the data into the table:
SELECT aws_s3.table_import_from_s3(
'public.bayarea_property_data',
'',
'(FORMAT CSV, HEADER true)',
'cottage-prop-data',
'clean_ta_file_edit.csv',
'us-west-1'
);
However, I keep getting this message:
ERROR: extra data after last expected column
CONTEXT: COPY bayarea_property_data, line 2: ",2009.0,2009.0,0.0,,0,2019,13061.0,,0,0.0,0.0,,2019,0.0,6767.0,576040,172810,403230,70,1,,1.0,,6081,..."
SQL statement "copy public.bayarea_property_data from '/rdsdbdata/extensions/aws_s3/amazon-s3-fifo-6261-20200819T083314Z-0' with (FORMAT CSV, HEADER true)"
SQL state: 22P04
Anyone can help me with this? I'm an AWS noob, so have been struggling over the past few days. Thanks!

Create query that copies from a CSV file on my computer to the DB located on another computer in Postgres

I am trying to create a query that will copy data from a CSV file that is located on my computer to a Postgres DB that is on a different computer.
Our Postgres DB is located on another computer, and I work on my own to import and query data. I have successfully copied data from the CSV file on MY computer TO the DB in PSQL Console using the following:
\COPY table_name FROM 'c:\path\to\file.csv' CSV DELIMITER E'\t' HEADER;
But when writing a query using the SQL Editor, I use the same code above without the '\' in the beginning. I get the following error:
ERROR: could not open file "c:\pgres\dmi_vehinventory.csv" for reading: No such file or directory
********** Error **********
ERROR: could not open file "c:\pgres\dmi_vehinventory.csv" for reading: No such file or directory
SQL state: 58P01
I assume the query is actually trying to find the file on the DB's computer rather than my own.
How do I write a query that tells Postgres to look for the file on MY computer rather than the DB's computer?
Any help will be much appreciated !
\COPY is a correct way if you want to upload file from local computer (computer where you've stared psql)
COPY is correct when you want to upload on remote host from remote directory
here is an example, i've connected with psql to remote server:
test=# COPY test(i, i1, i3)
FROM './test.csv' WITH DELIMITER ',';
ERROR: could not open file "./test.csv" for reading: No such file
test=# \COPY test(i, i1, i3)
FROM './test.csv' WITH DELIMITER ',';
test=# select * from test;
i | i1 | i3
---+----+----
1 | 2 | 3
(1 row)
There are several common misconceptions when dealing with PostgreSQL's COPY command.
Even though psql's \COPY FROM '/path/to/file/on/client' command has identical syntax (other than the backslash) to the backend's COPY FROM '/path/to/file/on/server' command, they are totally different. When you include a backslash, psql actually rewrites it to a COPY FROM STDIN command instead, and then reads the file itself and transfers it over the connection.
Executing a COPY FROM 'file' command tells the backend to itself open the given path and load it into a given table. As such, the file must be mapped in the server's filesystem and the backend process must have the correct permissions to read it. However, the upside of this variant is that it is supported by any postgresql client that supports raw sql.
Successfully executing a COPY FROM STDIN places the connection into a special COPY_IN state during which an entirely different (and much simpler) sub-protocol is spoken between the client and server, which allows for data (which may or may not come from a file) to be transferred from the client to the server. As such, this command is not well supported outside of libpq, the official client library for C. If you aren't using libpq, you may or may not be able to use this command, but you'll have to do your own research.
COPY FROM STDIN/COPY TO STDOUT doesn't really have anything to do with standard input or standard output; rather the client needs to speak the sub-protocol on the database connection. In the COPY IN case, libpq provides two commands, one to send data to the backend, and another to either commit or roll back the operation. In the COPY OUT case, libpq provides one function that receives either a row of data or an end of data marker.
I don't know anything about SQL Editor, but it's likely that issuing a COPY FROM STDIN command will leave the connection in an unusable state from its point of view, especially if it's connecting via an ODBC driver. As far as I know, ODBC drivers for PostgreSQL do not support COPY IN.

SQLLDR / BCP equivalent for DB2

My product needs to support Oracle, SQLServer, and DB2 v9. We are trying to figure out the most efficient way to periodically load data into the database. This currently takes 40+ minutes with individual insert statements, but just a few minutes when we use SQLLDR or BCP. Is there an equivalent in DB2 that allows CSV data to be loaded into the database quickly?
Our software runs on windows, so we need to assume that the database is running on a remote system.
load:
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/core/r0008305.htm
If the data is in CSV format try import the data with the delimiter as coma(,)
db2 import from <filename> of del modified by coldel, insert into <Table Nmae>
Or else you ca use Load command - load from file
db2 load client from /u/user/data.del of del
modified by coldel, insert into mytable