Glue load job not retaining default column value in redshift - amazon-redshift

I have a Glue job that loads a CSV from S3 into a redshift table. There is 1 column (updated_date) which is not mapped. Default value for that column is set to current_timestamp in UTC. But each time the Glue job runs, this updated_date column is null.
I tried removing updated_dt from Glue metadata table. I tried removing updated_dt from SelectFields.apply() in Glue script.
When I do a normal insert statement in Redshift without using updated_dt column, default current_timestamp() value is inserted for those rows.
Thanks

Well I had the same problem. AWS support told me to convert the Glue DynamicFrame into a Spark DataFrame and use the Spark writer to load data into Redshift:
SparkDF = GlueDynFrame.toDF()
SparkDF.write.format('jdbc').options(
url = ‘<‘JDBC url>’,
dbtable=‘<schema>.<table>’,
user=‘<username>’,
password=‘<password>’).mode(‘append’).save()
Me on the other hand handled the problem by either dropping and creating the target table using preactions or just use some update to set the default values in the postactions.

Related

How to write decimal type to redshift using awsglue?

I am trying to write a variety of columns to redshift from a dynamic frame using the DynamicFrameWriter.from_jdbc_conf method, but all DECIMAL fields end up as a column of NULLs.
The ETL pulls in from some redshift views, and eventually writes back to a redshift table using the aforementioned method. In general this works for other datatypes, but adding a statement as simple as SELECT CAST(12.34 AS DECIMAL(4, 2)) AS decimal_test results in a column full of NULLs. In pyspark, I can print out the column and see the decimal values immediately before they are written as NULLs to the redshift table. When I look at the schema on redshift, I can see the column decimal_test has a type of NUMERIC(4, 2). What could be causing this failure?

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

SCD2 Implementation in Redshift using AWS GLue Pyspark

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .

Hive Partition Table with Date Datatype via Spark

I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.