Load data with default values into Redshift from a parquet file - amazon-redshift

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?

Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Related

Selecting data from a BYTEA data type in Postgres that contains CSV data and storing it in a table

I have a table ("file_upload") in a postgreSQL (11,8) database, which we use for storing the original CSV file that was used for loading some data to our system (I guess the question of best practices is up for debate here, but for now lets just assume it is).
The files are stored in a column ("file") which is of the data type "bytea"
So one row of this table contains
id - file_name - upload_date - uploaded_by - file <-- this being the column in question.
This column then stores the data of a csv file:
item_id;item_type_id;item_date;item_value
11;1;2022-09-22;123.45
12;4;2022-09-20;235.62
13;1;2022-09-21;99.99
14;2;2022-09-19;654.32
What I need to be able to do is query this column, extracrt the data and store it in a temporary table (note: the structure of these csv files are all the same, so the table structure can be pre-defined and does not have to be dynamic or anything).
Any help would be greatly appreciated
Use
COPY (SELECT file FROM file_upload WHERE id =1)
TO '/tmp/blob' (FORMAT 'binary');
to re-export the data to a file. Then create the temporary table and use COPY to read them in again. Make sure to use the proper ENCODING.
You can wrap that in a loop that performs this operation for all rows in your table.

Column mapping option argument is not supported for PARQUET based COPY

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks
The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

Basic questions about Cloud SQL

I'm trying to populate a cloud sql database using a cloud storage bucket, but I'm getting some errors. The csv has the headers (or column names) as first row and does not have all the columns (some columns in the database can be null, so I'm loading the data I need for now).
The database is in postgresql and this is the first database in GCP I'm trying to configure and I'm a little bit confused.
Does it matters if the csv file has the column names?
Does the order of the columns matter in the csv file? (I guess they do if there are not present in the csv)
The PK of the table is a serial number, which I'm not including in the csv file. Do I need to include also the PK? I mean, because its a serial number it should be "auto assigned", right?
Sorry for the noob questions and thanks in advance :)
This is all covered by the COPY documentation.
It matters in that you will have to specify the HEADER option so that the first line is skipped:
[...] on input, the first line is ignored.
The order matters, and if the CSV file does not contain all the columns in the same order as the table, you have to specify them with COPY:
COPY mytable (col12, col2, col4, ...) FROM '/dir/afile' OPTIONS (...);
Same as above: if you omit a table column in the column list, it will be filled with the default value, in that case that is the autogenerated number.

Can I import CSV data into a table without knowing the columns of the CSV?

I have a CSV file file.csv.
In Postgres, I have made a table named grants:
CREATE TABLE grants
(
)
WITH (
OIDS=FALSE
);
ALTER TABLE grants
OWNER TO postgres;
I want to import file.csv data without having to specify columns in Postgres.
But if I run COPY grants FROM '/PATH/TO/grants.csv' CSV HEADER;, I get this error: ERROR: extra data after last expected column.
How do I import the CSV data without having to specify columns and types?
The error is normal.
You created a table with no column. The COPY command try to import data into the table with the good structure.
So you have to create the table corresponding to your csv file before execute the COPY command.
I discovered pgfutter :
"Import CSV and JSON into PostgreSQL the easy way. This small tool abstract all the hassles and swearing you normally have to deal with when you just want to dump some data into the database"
Perhaps a solution ...
The best method for me was to convert the csv to dataframe and then follow
https://github.com/sp-anna-jones/data_science/wiki/Importing-pandas-dataframe-to-postgres
No, it is not possible using the COPY command
If a list of columns is specified, COPY will only copy the data in the
specified columns to or from the file. If there are any columns in the
table that are not in the column list, COPY FROM will insert the
default values for those columns.
COPY does not create columns for you.

How can I copy an IDENTITY field?

I’d like to update some parameters for a table, such as the dist and sort key. In order to do so, I’ve renamed the old version of the table, and recreated the table with the new parameters (these can not be changed once a table has been created).
I need to preserve the id field from the old table, which is an IDENTITY field. If I try the following query however, I get an error:
insert into edw.my_table_new select * from edw.my_table_old;
ERROR: cannot set an identity column to a value [SQL State=0A000]
How can I keep the same id from the old table?
You can't INSERT data setting the IDENTITY columns, but you can load data from S3 using COPY command.
First you will need to create a dump of source table with UNLOAD.
Then simply use COPY with EXPLICIT_IDS parameter as described in Loading default column values:
If an IDENTITY column is included in the column list, the EXPLICIT_IDS
option must also be specified in the COPY command, or the COPY command
will fail. Similarly, if an IDENTITY column is omitted from the column
list, and the EXPLICIT_IDS option is specified, the COPY operation
will fail.
You can explicitly specify the columns, and ignore the identity column:
insert into existing_table (col1, col2) select col1, col2 from another_table;
Use ALTER TABLE APPEND twice, first time with IGNOREEXTRA and the second time with FILLTARGET.
If the target table contains columns that don't exist in the source
table, include FILLTARGET. The command fills the extra columns in the
source table with either the default column value or IDENTITY value,
if one was defined, or NULL.
It moves the columns from one table to another, extremely quickly; took me 4s for 1GB table in dc1.large node.
Appends rows to a target table by moving data from an existing source
table.
...
ALTER TABLE APPEND is usually much faster than a similar CREATE TABLE
AS or INSERT INTO operation because data is moved, not duplicated.
Faster and simpler than UNLOAD + COPY with EXPLICIT_IDS.