Copy from S3 AVRO file to Table in Redshift Results in All Null Values - copy

I am trying to copy an AVRO file that is stored in S3 to a table I created in Redshift and I am getting all null values. However, the AVRO file does not have null values in it. I see the following error when I look at the log: "Missing newline: Unexpected character 0x79 found at location 9415"
I did some research online and the only post I could find said that values would be null if the column name case in the target table did not match the source file. I have ensured the case for the column in the target table is the same as the source file.
Here is mock snippet from the AVRO file:
Objavro.schemaĒ{"type":"record","name":"something","fields":[{"name":"g","type":["string","null"]},{"name":"stuff","type":["string","null"]},{"name":"stuff","type":["string","null"]}
Here is the sql code I am using in Redshift:
create table schema.table_name (g varchar(max));
copy schema.table_name
from 's3://bucket/folder/file.avro'
iam_role 'arn:aws:iam::xxxxxxxxx:role/xx-redshift-readonly'
format as avro 'auto';
I am expecting to see a table with one column called g where each row has the value stuff.

Related

Column mapping option argument is not supported for PARQUET based COPY

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks
The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

not able to do copy activity with bit value in azure data factory without column mapping for sink as postgresql

I've multiple csv files in folder like employee.csv, student.csv, etc.,.. with headers
And also I've tables for all the files(Both header and table column name is same).
employee.csv
id|name|is_active
1|raja|1
2|arun|0
student.csv
id|name
1|raja
2|arun
Table Structure:
emplyee:
id INT, name VARCHAR, is_active BIT
student:
id INT, name VARCHAR
now I'm trying to do copy activity with all the files using foreach activity,
the student table copied successfully, but the employee table was not copied its throwing error while reading the employee.csv file.
Error Message:
{"Code":27001,"Message":"ErrorCode=TypeConversionInvalidHexLength,Exception occurred when converting value '0' for column name 'is_active' from type 'String' (precision:, scale:) to type 'ByteArray' (precision:0, scale:0). Additional info: ","EventType":0,"Category":5,"Data":{},"MsgId":null,"ExceptionType":"Microsoft.DataTransfer.Common.Shared.PluginRuntimeException","Source":null,"StackTrace":"","InnerEventInfos":[]}
Use data flow activity.
In dataflow activity, select Source.
After this select derived column and change datatype of is_active column from BIT to String.
As shown in below screenshot, Salary column has string datatype. So I changed it to integer.
To modify datatype use expression builder. You can use toString()
This way you can change datatype before sink activity.
In a last step, provide Sink as postgresql and run pipeline.

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Bug during COPY in Postgres

I have a table named basic_data which contains more than 8 millions rows and I want to copy all this data into a CSV file.
So I use the COPY command like this :
copy basic_data to '/tmp/data_fdw.csv' delimiter ';' null '';
COPY 8792481
This work great but when I want to insert my data into another table with exactly the same schema (but it's a foreign table), I had this following error:
ERROR: value out of range: overflow