How to make an existing column NOT NULL in AWS REDSHIFT? - amazon-redshift

I had dynamically created a table through glue job and it is successfully working fine. But as per new requirement, I need to add a new column which generates unique values and should be primary key in redshift.
I had implemented the same using rownum() function and it’s working fine. But the latest requirement is that particular column should be primary key.
When I try to do that, it asks the column to have not null. Do you know how to make the column not null dynamically through glue job ? Or any redshift query to make it not null.
I tried all the ways without luck.
w = Window().orderBy(lit('A'))
df = timestampedDf.withColumn("row_num", row_number().over(w))
rowNumDf = DynamicFrame.fromDF(df1, glueContext, "df1")
postStep = "begin; ALTER TABLE TAB_CUSTOMER_DATA ALTER COLUMN row_num INTEGER NOT NULL; ALTER TABLE TAB_CUSTOMER_DATA ADD CONSTRAINT PK_1 PRIMARY KEY (row_num); end;"
## #type: DataSink
## #args: [catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "tab_customer_data", "database": "randomdb"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## #return: datasink4
## #inputs: [frame = rowNumDf]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = rowNumDf, catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "TAB_CUSTOMER_DATA", "database": "randomdb", "postactions": postStep}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

I had solved this using below link approach:
add a new Column with default and not null.
update the old column values to new column.
drop an old column.
make this new column primary.
https://ubiq.co/database-blog/how-to-remove-not-null-constraint-in-redshift/

Related

DELTA TABLES - The specified properties do not match the existing properties

I'm using a set of user properties on DataBricks Delta Tables for metadata management. The problem is when I need to change one of those properties I'm getting the 'FAILED Error: The specified properties do not match the existing properties at /mnt/silver/...' error message.
Databricks documentation only states that an Exception will be raised and I didn't find any argument to force it to accept the new values.
Is it possible to just update table Properties?
Any Suggestions?
Sample Code:
query = f'''
CREATE TABLE if not exists {tableMetadataDBName}.{tableMetadataTableName}
(
... my columns ...
-- COMMON COLUMNS
,Hash string
,sourceFilename STRING
,HashWithFileDate string
,Surrogate_Key STRING
,SessionId STRING
,SessionRunDate TIMESTAMP
,Year INT GENERATED ALWAYS AS ( YEAR(fileDate))
,Month INT GENERATED ALWAYS AS ( MONTH(fileDate))
,fileDate DATE
)
USING DELTA
COMMENT '{tableDocumentationURL}'
LOCATION "{savePath}/parquet"
OPTIONS( "compression"="snappy")
PARTITIONED BY (Year, Month, fileDate )
TBLPROPERTIES ("DataStage"="{txtDataStage.upper()}"
,"Environment"="{txtEnvironment}"
,"source"="{tableMetadataSource}"
,"CreationDate" = "{tableMetadataCreationDate}"
,"CreatedBy" = "{tableMetadataCreatedBy}"
,"Project" = "{tableMetadataProject}"
,"AssociatedReports" = "{tableMetadataAssociatedReports}"
,"UpstreamDependencies" = "{tableMetadataUpstreamDependencies}"
,"DownstreamDependencies" = "{tableMetadataDownstreamDependencies}"
,"Source" = "{tableMetadataSource}"
,"PopulationFrequency" = "{tableMetadataPopulationFrequency}"
,"BusinessSubject" = "{tableMetadataBusinessSubject}"
,"JiraProject" = "{tableMetadataJiraProject}"
,"DevOpsProject" = "{tableMetadataDevOpsProject}"
,"DevOpsRepository" = "{tableMetadataDevOpsRepository}"
,"URL" = "{tableMetadataURL}") '''
spark.sql(query)
Yes, it's possible to change just properties - you need to use "ALTER TABLE [table_name] SET TBLPROPERTIES ..." for that:
query = f"""ALTER TABLE {table_name} SET TBLPROPERTIES (
'CreatedBy' = '{tableMetadataCreatedBy}
,'Project' = '{tableMetadataProject}'
....
)"""
spark.sql(query)

How to drop the duplicate column in glue job. As glue is creating duplicate column

I have created the glue job and its creating duplicate column once I run the crawler on transformed file .How to drop the duplicate column in it
I have know there is DropNullFields function but it will drop the null field not duplicate coulmn.
What is the way to drop the duplicate column? and stored in csv
Here is code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample", table_name = "test", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("root", "s3://testing/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
dropNullfields = DropNullFields.apply(frame = m_df)
datasink2 = glueContext.write_dynamic_frame.from_options(frame = DropNullFields ,
connection_type = "s3", connection_options = {"path": "s3://sample/" +
df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
You can use the .dropFields() function. Example:
droppedFields = dropNullfields.drop_fields(paths=["lname", "userid"])

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

why ADD COLUMN to kafka table is not supported in Clickhouse

I have a problem adding a column to the Kafka queue in ClickHouse.
I've created a table with the command
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`ts` String,
.... some other columns
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
And then trying to add a column
ALTER TABLE my_db.my_queue ON CLUSTER my_cluster ADD COLUMN new_column String;
But getting an error
SQL Error [48]: ClickHouse exception, code: 48, host: 172.21.0.4, port: 8123; Code: 48,
e.displayText() = DB::Exception: There was an error on [clickhouse-server:9000]: Code: 48,
e.displayText() = DB::Exception: Alter of type 'ADD COLUMN' is not supported by storage Kafka
(version 20.11.4.13 (official build)) (version 20.11.4.13 (official build))
I am not familiar with ClickHouse and any analytical database.
So I am wondering why it is not supported? Or I should add a column in another way?
A way of supporting messages with different schema from a Kafka queue consists on storing the raw JSON messages like this:
CREATE TABLE my_db.my_queue ON CLUSTER my_cluster
(
`message` String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = '172.21.0.3:9092',
kafka_topic_list = 'my_topic',
kafka_group_name = 'my_group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_skip_broken_messages = 10;
The JSONAsString format will store the raw JSON in the message column. This way from the Kafka table you can post-process each new row through materialized views and JSON functions.
For instance:
CREATE TABLE my_db.post_processed_data (
`ts` String,
`another_column` String
)
-- use a proper engine
Engine=Log;
CREATE MATERIALIZED VIEW my_db.my_queue_mv TO my_db.post_processed_data
AS
SELECT
JSONExtractString(message, 'ts') AS ts,
JSONExtractString(message, 'another_column') AS another_column
FROM my_db.my_queue;
If there's any change in the JSON schema of the Kafka queue, you can react accordingly doing an ALTER TABLE .. ADD COLUMN .. in the post_processed_data table and updating the materialized view accordingly. That way the Kafka table would remain as it is.
kafka Engine does not support it.
Just drop the table and create with a new schema.
It does not support alter because an author of KafkaEngine does not need it.

Glue creates duplicates of records, how to fix it?

Currently, we use Glue (python scripts) for data migration from MySQL database into RedShift database.
Yesterday, we found an issue: some records are duplicates, these records have the same primary key which is used in MySQL database. According to our requirements, all data in RedShift database should be the same as in MySQL database.
I tried to remove a RedShift table before migration, but didn't find method for that...
Could you help me to fix the issue?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue-db", table_name = "table", transformation_ctx = "datasource0")
applymapping0_1 = ApplyMapping.apply(frame = datasource0, mappings = [...], transformation_ctx = "applymapping0_1")
resolvechoice0_2 = ResolveChoice.apply(frame = applymapping0_1, choice = "make_cols", transformation_ctx = "resolvechoice0_2")
dropnullfields0_3 = DropNullFields.apply(frame = resolvechoice0_2, transformation_ctx = "dropnullfields0_3")
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "table", "database": "database"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")
My solution is:
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "mytable", "database": "mydatabase", "preactions": "delete from public.mytable;"}
If your goal is not to have duplicates in destination table you can use postactions option for JBDC sink (see this answer for more details). Basically it allows to implement Redshift merge using staging table.
For your case it should be like this (replaces existing records):
post_actions = (
"DELETE FROM dest_table USING staging_table AS S WHERE dest_table.id = S.id;"
"INSERT INTO dest_table (id,name) SELECT id,name FROM staging_table;"
"DROP TABLE IF EXISTS staging_table"
)
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "staging_table", "database": "database", "overwrite" -> "true", "postactions" -> post_actions}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")
Redshift does not impose unique key constraints
Unless you can guarantee that your source scripts avoid duplicates then you need to run a regular job to de-duplicate on redshift,
delete from yourtable
where id in
(
select id
from yourtable
group by 1
having count(*) >1
)
;
Did you consider DMS as an alternative to Glue? This could work better for you.