Pyspark updating a particular partition of an external Hive table

Pyspark updating a particular partition of an external Hive table - pyspark

I am trying to overwrite a particular partition of a hive table using pyspark but each time i am trying to do that, all the other partitions are getting wiped off. I went through couple of posts in here regarding this and implemented the steps but seems like i am still getting and error. the code i am using is
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.format('parquet').mode('overwrite').partitionBy('col1').option("partitionOverwriteMode", "dynamic").saveAsTable(op_dbname+'.'+op_tblname)
Initially the partitions are like col1=m and col1=n and while i am trying to overwrite only the partition col1=m its wiping our col1=n as well.
Spark version is 2.4.4
Appreciate any help.

After multiple trial and error this is the method that i tried
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
op_df.write.format(op_fl_frmt).mode(operation_mode).option("partitionOverwriteMode", "dynamic").insertInto(op_dbname+'.'+op_tblname,overwrite=True)
When i tried to use the saveAsTable no matter what i do it is always wiping off all the values. And setting only the flag 'spark.sql.sources.partitionOverwriteMode' to dynamic doesnt seem to work. Hence using insertInto along with an overwrite flag inside that to achieve the desired output.

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

Insert or update scala dataframe in postgres

I have a look up table that needs to be updated from my spark code. There are multiple spark jobs using the same look up table.
I have a dataframe that needs to be inserted or updated into Postgres table. I cant do it using sparks native postgres library because it only supports appends and overwrites, however I have a clear case of insert or update.
Things i have tried
I have tried using slick but am unable to get anything working and the documentation seemed confusing.
I also tried using foreach partition and iterating on every row and running the query using slicks db.run functionality, still not working and no errors
The thing is I'm unable to get a working solution, can someone share their code snippet on how they got it to be working? Seems like a pretty common use case but Im stuck.

Deltalake - Merge is creating too many files per partition

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation:
DeltaTable
.forPath(sparkSession, deltaTablePath)
.alias(SOURCE)
.merge(
rawDf.alias(RAW),
joinClause // using primary keys and when possible partition keys. Not important here I guess
)
.whenNotMatched().insertAll()
.whenMatched(dedupConditionUpdate).updateAll()
.whenMatched(dedupConditionDelete).delete()
.execute()
When the merge is done, every impacted partition has hundreds of new files. As there is one new ingestion per day, this behaviour makes every following merge operation slower and slower.
Versions:
Spark : 2.4
DeltaLake: 0.6.1
Is there a way to repartition before saving ? or any other way to improve this ?

After searching a bit in Delta core's code, there is an option that does repartition on write :
spark.conf.set("spark.databricks.delta.merge.repartitionBeforeWrite.enabled", "true")

you should set delta.autoOptimize.autoCompact property on table for auto compaction.
following page shows how you can set at for existing and new table.
https://docs.databricks.com/delta/optimizations/auto-optimize.html

Upsert to Phoenix table in Apache Spark

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.

While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

Apache Spark 1.3 dataframe SaveAsTable database other then default

I am trying to save a dataframe as table using saveAsTable and well it works but I want to save the table to not the default database, Does anyone know if there is a way to set the database to use? I tried with hiveContext.sql("use db_name") and this did not seem to do it. There is an saveAsTable that takes in some options. Is there a way that i can do it with the options?

It does not look like you can set the database name yet... if you read the HiveContext.scala code you see a lot comments like...
// TODO: Database support...
So I am guessing that its not supported yet.
Update:
In spark 1.5.1 this works, which did not work in early versions. In early version you had to use a using statement like in deformitysnot answer.
df.write.format("parquet").mode(SaveMode.Append).saveAsTable("databaseName.tablename")

This was fixed in Spark 1.5 and you can do it using :
hiveContext.sql("USE sparkTables");
dataFrame.saveAsTable("tab3", "orc", SaveMode.Overwrite);
By the way in Spark 1.5 you can read Spark saved dataframes from Hive command line (beeline, ...), something that was impossible in earlier versions.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark updating a particular partition of an external Hive table - pyspark

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

Insert or update scala dataframe in postgres

Deltalake - Merge is creating too many files per partition

Upsert to Phoenix table in Apache Spark

Apache Spark 1.3 dataframe SaveAsTable database other then default

Categories

Resources