Glue creates duplicates of records, how to fix it? - amazon-redshift

Currently, we use Glue (python scripts) for data migration from MySQL database into RedShift database.
Yesterday, we found an issue: some records are duplicates, these records have the same primary key which is used in MySQL database. According to our requirements, all data in RedShift database should be the same as in MySQL database.
I tried to remove a RedShift table before migration, but didn't find method for that...
Could you help me to fix the issue?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue-db", table_name = "table", transformation_ctx = "datasource0")
applymapping0_1 = ApplyMapping.apply(frame = datasource0, mappings = [...], transformation_ctx = "applymapping0_1")
resolvechoice0_2 = ResolveChoice.apply(frame = applymapping0_1, choice = "make_cols", transformation_ctx = "resolvechoice0_2")
dropnullfields0_3 = DropNullFields.apply(frame = resolvechoice0_2, transformation_ctx = "dropnullfields0_3")
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "table", "database": "database"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")
My solution is:
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "mytable", "database": "mydatabase", "preactions": "delete from public.mytable;"}

If your goal is not to have duplicates in destination table you can use postactions option for JBDC sink (see this answer for more details). Basically it allows to implement Redshift merge using staging table.
For your case it should be like this (replaces existing records):
post_actions = (
"DELETE FROM dest_table USING staging_table AS S WHERE dest_table.id = S.id;"
"INSERT INTO dest_table (id,name) SELECT id,name FROM staging_table;"
"DROP TABLE IF EXISTS staging_table"
)
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "staging_table", "database": "database", "overwrite" -> "true", "postactions" -> post_actions}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")

Redshift does not impose unique key constraints
Unless you can guarantee that your source scripts avoid duplicates then you need to run a regular job to de-duplicate on redshift,
delete from yourtable
where id in
(
select id
from yourtable
group by 1
having count(*) >1
)
;
Did you consider DMS as an alternative to Glue? This could work better for you.

Related

Convert cursor into pyspark dataframe

I have been using this to connect to our organisation's cluster to query the data. is it possible to convert the cursor output directly into pyspark dataframe.
jarFile="/dataflame/nas_nfs/tmp/lib/olympus-jdbc-driver.jar"
url = "olympus:jdbcdriver"
env='uat'
print("Using environment", env.upper())
className = "net.vk.olympus.jdbc.driver.OlympusDriver"
conn = jaydebeapi.connect(className, url,{'username':userid,'password':pwd,'ENV':env,'datasource':'HIVE','EnableLog':'false'},jarFile)
cursor = conn.cursor()
query = "select * from abc.defcs123 limit 5"
cursor.execute(query)
pandas_df = as_pandas(cursor)
print(pandas_df)

How to drop the duplicate column in glue job. As glue is creating duplicate column

I have created the glue job and its creating duplicate column once I run the crawler on transformed file .How to drop the duplicate column in it
I have know there is DropNullFields function but it will drop the null field not duplicate coulmn.
What is the way to drop the duplicate column? and stored in csv
Here is code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample", table_name = "test", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("root", "s3://testing/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
dropNullfields = DropNullFields.apply(frame = m_df)
datasink2 = glueContext.write_dynamic_frame.from_options(frame = DropNullFields ,
connection_type = "s3", connection_options = {"path": "s3://sample/" +
df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
You can use the .dropFields() function. Example:
droppedFields = dropNullfields.drop_fields(paths=["lname", "userid"])

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

How to make an existing column NOT NULL in AWS REDSHIFT?

I had dynamically created a table through glue job and it is successfully working fine. But as per new requirement, I need to add a new column which generates unique values and should be primary key in redshift.
I had implemented the same using rownum() function and it’s working fine. But the latest requirement is that particular column should be primary key.
When I try to do that, it asks the column to have not null. Do you know how to make the column not null dynamically through glue job ? Or any redshift query to make it not null.
I tried all the ways without luck.
w = Window().orderBy(lit('A'))
df = timestampedDf.withColumn("row_num", row_number().over(w))
rowNumDf = DynamicFrame.fromDF(df1, glueContext, "df1")
postStep = "begin; ALTER TABLE TAB_CUSTOMER_DATA ALTER COLUMN row_num INTEGER NOT NULL; ALTER TABLE TAB_CUSTOMER_DATA ADD CONSTRAINT PK_1 PRIMARY KEY (row_num); end;"
## #type: DataSink
## #args: [catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "tab_customer_data", "database": "randomdb"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## #return: datasink4
## #inputs: [frame = rowNumDf]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = rowNumDf, catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "TAB_CUSTOMER_DATA", "database": "randomdb", "postactions": postStep}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()
I had solved this using below link approach:
add a new Column with default and not null.
update the old column values to new column.
drop an old column.
make this new column primary.
https://ubiq.co/database-blog/how-to-remove-not-null-constraint-in-redshift/

How to execute hql file in spark with arguments

I have a hql file which accepts several arguments and I then in stand alone spark application, I am calling this hql script to create a dataframe.
This is a sample hql code from my script:
select id , name, age, country , created_date
from ${db1}.${table1} a
inner join ${db2}.${table2} b
on a.id = b.id
And in this is how I am calling it in my Spark script:
import scala.io.Source
val queryFile = `path/to/my/file`
val db1 = 'cust_db'
val db2 = 'cust_db2'
val table1 = 'customer'
val table2 = 'products'
val query = Source.fromFile(queryFile).mkString
val df = spark.sql(query)
When I am using this way, I am getting:
org.apache.spark.sql.catylyst.parser.ParserException
Is there a way to pass arguments directly to my hql file and then create a df out of the hive code.
Parameters can be injected with such code:
val parametersMap = Map("db1" -> db1, "db2" -> db2, "table1" -> table1, "table2" -> table2)
val injectedQuery = parametersMap.foldLeft(query)((acc, cur) => acc.replace("${" + cur._1 + "}", cur._2))