Databricks Spark-Redshift: Sortkeys not working - scala

I am trying to add the sort keys from scala code by following instructions here: https://github.com/databricks/spark-redshift
df.write
.format(formatRS)
.option("url", connString)
.option("jdbcdriver", jdbcDriverRS)
.option("dbtable", table)
.option("tempdir", tempDirRS + table)
.option("usestagingtable", "true")
.option("diststyle", "KEY")
.option("distkey", "id")
.option("sortkeyspec", "INTERLEAVED SORTKEY (id,timestamp)")
.mode(mode)
.save()
The sort keys are being implemented wrong because when I am checking the table info:
sort key = INTERLEAVEDˇ
I need the right way to add the sort keys.

There is no wrong with the implementation, the wrong is from the "checking query" it returns
sort key = interleavedˇ
which is confusing enough to believe that there is something wrong happening.
so if you need to check the interleaved sort keys you should run this query:
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;

Related

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

Sparksql using scala

val scc = spark.read.jdbc(url,table,properties)
val d = scc.createOrReplaceTempView(“k”)
spark.sql(“select * from k”).show()
if you observe here #1 we are reading complete table and then #3 we are fetching the results based on desired query. Here reading complete table and then querying takes alot of time. Can’t we execute our query while establishing connection ? please do help me if you have any prior knowledge about this .
Check this out.
var dbTable =
"(select emp_no, concat_ws(' ', first_name, last_name) as full_name from employees) as employees_name";
Dataset<Row> jdbcDF =
sparkSession.read().jdbc(CONNECTION_URL, dbTable,connectionProperties);

Spark: strange nullPointerException when extracting data from PostgreSQL

I'm working with PostgreSQL 9.6 and Spark 2.0.0
I want to create a DataFrame form a postgreSQL table, as following:
val query =
"""(
SELECT events.event_facebook_id,
places.placeid, places.likes as placelikes,
artists.facebookId, artists.likes as artistlikes
FROM events
LEFT JOIN eventsplaces on eventsplaces.event_id = events.event_facebook_id
LEFT JOIN places on eventsplaces.event_id = places.facebookid
LEFT JOIN eventsartists on eventsartists.event_id = events.event_facebook_id
LEFT JOIN artists on eventsartists.artistid = artists.facebookid) df"""
The request is valid (if I run it on psql, I don't get any error) but with Spark,
if I execute the following code, I get a NullPointerException:
sqlContext
.read
.format("jdbc")
.options(
Map(
"url" -> claudeDatabaseUrl,
"dbtable" -> query))
.load()
.show()
If I change, in the query artists.facebookId by an other column as artists.description (which can be null contrary to facebookId), the exception disappears.
I find this very very strange, any idea?
You have different facebookId's in your query: artists.facebook[I]d and artists.facebook[i]d.
Please, try to use the correct one.

How to correctly use the querybuilder in order to do a subselect?

I would like to do a subselect in order to do the following postgresql query with the querybuilder:
SELECT i.* FROM internship i
WHERE EXISTS (SELECT iw.*
FROM internship_weeks iw
WHERE i.id = iw.internship)
Does anyone have an idea how to get the same result with queryBuilder? or maybe with DQL?
Thanks for the help !
As example, only for demonstrate HOW-TO use a subquery select statement inside a select statement, suppose we what to find all user that not yet have compile the address (no records exists in the address table):
// get an ExpressionBuilder instance, so that you
$expr = $this->_em->getExpressionBuilder();
// create a subquery
$sub = $this->_em->createQueryBuilder()
->select('iw')
->from(IntershipWeek::class, 'iw')
->where('i.id = iw.intership');
$qb = $this->_em->createQueryBuilder()
->select('i')
->from(Intership::class, 'u')
->where($expr->exists($sub->getDQL()));
return $qb->getQuery()->getResult();
Hope this help