Column mapping option argument is not supported for PARQUET based COPY - amazon-redshift

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks

The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

Related

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Copy from S3 AVRO file to Table in Redshift Results in All Null Values

I am trying to copy an AVRO file that is stored in S3 to a table I created in Redshift and I am getting all null values. However, the AVRO file does not have null values in it. I see the following error when I look at the log: "Missing newline: Unexpected character 0x79 found at location 9415"
I did some research online and the only post I could find said that values would be null if the column name case in the target table did not match the source file. I have ensured the case for the column in the target table is the same as the source file.
Here is mock snippet from the AVRO file:
Objavro.schemaĒ{"type":"record","name":"something","fields":[{"name":"g","type":["string","null"]},{"name":"stuff","type":["string","null"]},{"name":"stuff","type":["string","null"]}
Here is the sql code I am using in Redshift:
create table schema.table_name (g varchar(max));
copy schema.table_name
from 's3://bucket/folder/file.avro'
iam_role 'arn:aws:iam::xxxxxxxxx:role/xx-redshift-readonly'
format as avro 'auto';
I am expecting to see a table with one column called g where each row has the value stuff.

Create a table from pyspark code on top of parquet file

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. Now what I am trying to do is that from the same code I want to create a table on top of this parquet file which then I can later query from. How can I do that?
You can use the saveAsTable method :
peopleDF.write.saveAsTable('people_table')
you have to create external table in hive like this:
CREATE EXTERNAL TABLE my_table (
col1 INT,
col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';
Where /path/to/ is absolute path to files in HDFS.
If you want to use partitioning you can add PARTITION BY (col3 INT). In that case to see the data you have to execute repair.

Can I import CSV data into a table without knowing the columns of the CSV?

I have a CSV file file.csv.
In Postgres, I have made a table named grants:
CREATE TABLE grants
(
)
WITH (
OIDS=FALSE
);
ALTER TABLE grants
OWNER TO postgres;
I want to import file.csv data without having to specify columns in Postgres.
But if I run COPY grants FROM '/PATH/TO/grants.csv' CSV HEADER;, I get this error: ERROR: extra data after last expected column.
How do I import the CSV data without having to specify columns and types?
The error is normal.
You created a table with no column. The COPY command try to import data into the table with the good structure.
So you have to create the table corresponding to your csv file before execute the COPY command.
I discovered pgfutter :
"Import CSV and JSON into PostgreSQL the easy way. This small tool abstract all the hassles and swearing you normally have to deal with when you just want to dump some data into the database"
Perhaps a solution ...
The best method for me was to convert the csv to dataframe and then follow
https://github.com/sp-anna-jones/data_science/wiki/Importing-pandas-dataframe-to-postgres
No, it is not possible using the COPY command
If a list of columns is specified, COPY will only copy the data in the
specified columns to or from the file. If there are any columns in the
table that are not in the column list, COPY FROM will insert the
default values for those columns.
COPY does not create columns for you.