Delete in Apache Hudi - Glue Job - pyspark

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?

The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

Related

How to make an existing column NOT NULL in AWS REDSHIFT?

I had dynamically created a table through glue job and it is successfully working fine. But as per new requirement, I need to add a new column which generates unique values and should be primary key in redshift.
I had implemented the same using rownum() function and it’s working fine. But the latest requirement is that particular column should be primary key.
When I try to do that, it asks the column to have not null. Do you know how to make the column not null dynamically through glue job ? Or any redshift query to make it not null.
I tried all the ways without luck.
w = Window().orderBy(lit('A'))
df = timestampedDf.withColumn("row_num", row_number().over(w))
rowNumDf = DynamicFrame.fromDF(df1, glueContext, "df1")
postStep = "begin; ALTER TABLE TAB_CUSTOMER_DATA ALTER COLUMN row_num INTEGER NOT NULL; ALTER TABLE TAB_CUSTOMER_DATA ADD CONSTRAINT PK_1 PRIMARY KEY (row_num); end;"
## #type: DataSink
## #args: [catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "tab_customer_data", "database": "randomdb"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## #return: datasink4
## #inputs: [frame = rowNumDf]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = rowNumDf, catalog_connection = "REDSHIFT_CONNECTION", connection_options = {"dbtable": "TAB_CUSTOMER_DATA", "database": "randomdb", "postactions": postStep}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()
I had solved this using below link approach:
add a new Column with default and not null.
update the old column values to new column.
drop an old column.
make this new column primary.
https://ubiq.co/database-blog/how-to-remove-not-null-constraint-in-redshift/

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

How to create multiple temp views in spark using multiple data frame

I have 10 data frame and i want to create multiple temp view so that I can perform sql operations on it using createOrReplaceTempView command in pyspark
This is probably what you're after.
source_tables = [
'sql.production.dbo.table1',
'sql.production.dbo.table2',
'sql.production.dbo.table3',
'sql.production.dbo.table4',
'sql.production.dbo.table5',
'sql.production.dbo.table6',
'sql.production.dbo.table7',
'sql.production.dbo.table8',
'sql.production.dbo.table9',
'sql.production.dbo.table10'
]
for source_table in source_tables:
try:
view_name = source_table.replace('.', '_')
# Lowercase all column names
df = df.toDF(*[c.lower() for c in df.columns])
df.createOrReplaceTempView(view_name)
except Exception as e:
print(e)

Glue creates duplicates of records, how to fix it?

Currently, we use Glue (python scripts) for data migration from MySQL database into RedShift database.
Yesterday, we found an issue: some records are duplicates, these records have the same primary key which is used in MySQL database. According to our requirements, all data in RedShift database should be the same as in MySQL database.
I tried to remove a RedShift table before migration, but didn't find method for that...
Could you help me to fix the issue?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "glue-db", table_name = "table", transformation_ctx = "datasource0")
applymapping0_1 = ApplyMapping.apply(frame = datasource0, mappings = [...], transformation_ctx = "applymapping0_1")
resolvechoice0_2 = ResolveChoice.apply(frame = applymapping0_1, choice = "make_cols", transformation_ctx = "resolvechoice0_2")
dropnullfields0_3 = DropNullFields.apply(frame = resolvechoice0_2, transformation_ctx = "dropnullfields0_3")
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "table", "database": "database"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")
My solution is:
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "mytable", "database": "mydatabase", "preactions": "delete from public.mytable;"}
If your goal is not to have duplicates in destination table you can use postactions option for JBDC sink (see this answer for more details). Basically it allows to implement Redshift merge using staging table.
For your case it should be like this (replaces existing records):
post_actions = (
"DELETE FROM dest_table USING staging_table AS S WHERE dest_table.id = S.id;"
"INSERT INTO dest_table (id,name) SELECT id,name FROM staging_table;"
"DROP TABLE IF EXISTS staging_table"
)
datasink0_4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields0_3, catalog_connection = "redshift-cluster", connection_options = {"dbtable": "staging_table", "database": "database", "overwrite" -> "true", "postactions" -> post_actions}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0_4")
Redshift does not impose unique key constraints
Unless you can guarantee that your source scripts avoid duplicates then you need to run a regular job to de-duplicate on redshift,
delete from yourtable
where id in
(
select id
from yourtable
group by 1
having count(*) >1
)
;
Did you consider DMS as an alternative to Glue? This could work better for you.

Sparksql using scala

val scc = spark.read.jdbc(url,table,properties)
val d = scc.createOrReplaceTempView(“k”)
spark.sql(“select * from k”).show()
if you observe here #1 we are reading complete table and then #3 we are fetching the results based on desired query. Here reading complete table and then querying takes alot of time. Can’t we execute our query while establishing connection ? please do help me if you have any prior knowledge about this .
Check this out.
var dbTable =
"(select emp_no, concat_ws(' ', first_name, last_name) as full_name from employees) as employees_name";
Dataset<Row> jdbcDF =
sparkSession.read().jdbc(CONNECTION_URL, dbTable,connectionProperties);