spark dataset overwrite particular partition not working in spark 2.4 - scala

In my job final step is to store the executed data in Hive table with partition on "date" column.
Sometime, due to job fail, I need to re-run job for particular partition alone.
As observed, when I use below code, spark overrides all the partitions when using overwrite mode.
ds.write.partitionBy("date").mode("overwrite").saveAsTable("test.someTable")
After going through multiple blogs and stackoverflow, I followed below steps to overwrite particular partitions only.
Step 1: Enbable dynamic partition for overwrite mode
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
Step 2: write dataframe to hive table using saveToTable
Seq(("Company1", "A"),
("Company2","B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Seq(("CompanyA3", "A"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.insertInto(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Still it overwrite all the partitions.
As per this blog, https://www.waitingforcode.com/apache-spark-sql/apache-spark-sql-hive-insertinto-command/read, "insertinto" should overwrite only particular partitions
If I create table first and then use "insertinto" method, it working fine
Set required configuration,
Step 1: Create table
Step 2: Add data using insertinto method
Step 3: Overwrite paritition
I wanted to know, what is difference between creating hive table via SaveToTable and creating table manually? Why it is not working in first scenario?
Could any one help me in this?

Try with lowercase w!
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
not
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
It fooled me. You have 2 variations in use in your scripting if you look.
My original answer deprecated it appears.

Related

Hudi data overrides every time on new batch of spark structure streaming

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 seconds. The below code is working fine but it overwrites the resultant apache hudi table data on every batch. I do not yet figure out why it is happening? Is it spark structure streaming or hudi behavior? I am using MERGE_ON_READ so the table file should not delete on every update. But don't know why it is happening? Due to this issue, my other job failed which read this table.
spark.readStream
.format('kafka')
.option("kafka.bootstrap.servers",
"localhost:9092")
...
...
df1 = df.groupby('a', 'b', 'c').agg(sum('d').alias('d'))
df1.writeStream
.format('org.apache.hudi')
.option('hoodie.table.name', 'table1')
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
.option('hoodie.datasource.write.keygenerator.class', 'org.apache.hudi.keygen.ComplexKeyGenerator')
.option('hoodie.datasource.write.recordkey.field', "a,b,c")
.option('hoodie.datasource.write.partitionpath.field', 'a')
.option('hoodie.datasource.write.table.name', 'table1')
.option('hoodie.datasource.write.operation', 'upsert')
.option('hoodie.datasource.write.precombine.field', 'c')
.outputMode('complete')
.option('path', '/Users/lucy/hudi/table1')
.option("checkpointLocation",
"/Users/lucy/checkpoint/table1")
.trigger(processingTime="10 second")
.start()
.awaitTermination()
Based on your configurations, the explanation for this problem may be that you read the same keys at each batch (the same a, b, c with different value of d), and where you have an upsert operation, hudi relace the old values by the new one. Try using insert instead of upsert or modify the hudi key depending on what you want to do.

Eliminate duplicates (deduplication) in Streaming DataFrame

I have a Spark streaming processor.
The Dataframe dfNewExceptions has duplicates (duplicate by "ExceptionId").
Since this is a streaming dataset, the below query fails:
val dfNewUniqueExceptions = dfNewExceptions.sort(desc("LastUpdateTime"))
.coalesce(1)
.dropDuplicates("ExceptionId")
val dfNewExceptionCore = dfNewUniqueExceptions.select("ExceptionId", "LastUpdateTime")
dfNewExceptionCore.writeStream
.format("console")
// .outputMode("complete")
.option("truncate", "false")
.option("numRows",5000)
.start()
.awaitTermination(1000)
**
Exception in thread "main" org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
**
This is also documented here: https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-programming-guide.html
Any suggestions on how the duplicates can be removed from dfNewExceptions?
I recommend to follow the approach explained in the Structured Streaming Guide on Streaming Deduplication. There it says:
You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as de-duplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use de-duplication with or without watermarking.
With watermark - If there is an upper bound on how late a duplicate record may arrive, then you can define a watermark on an event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.
An example in Scala is also given:
val dfExceptions = spark.readStream. ... // columns: ExceptionId, LastUpdateTime, ...
dfExceptions
.withWatermark("LastUpdateTime", "10 seconds")
.dropDuplicates("ExceptionId", "LastUpdateTime")
You can use watermarking to drop duplicates in a specific timeframe.

AWS Glue - GlueContext: read partitioned data from S3, add partitions as columns of DynamicFrame

I have some data stored in an S3 bucket in parquet format, following a hive-like partitioning style, with these partition keys: retailer - year - month - day.
Eg
my-bucket/
retailer=a/
year=2020/
....
retailer=b/
year=2020/
month=2/
...
I wanna read all this data in a sagemaker notebook and I want to have the partitions as columns of my DynamicFrame, so that when I df.printSchema(), they are included.
If I use Glue's suggested method, the partitions don't get included in my schema. Here's the code I'm using:
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options={
'paths': ['s3://my-bucket/'],
"partitionKeys": [
"retailer",
"year",
"month",
"day"
]
},
format='parquet'
)
By using normal spark code and the DataFrame class, instead, it works, and the partition get included in my schema:
df = spark.read.parquet('s3://my-bucket/').
I wonder if there is a way to do it with AWS Glue's specific methods or not.
maybe u could try crawling the data and read it using The from_catalog option. Although I would think U don’t need to mention the partition keys since it should see that = means it’s a partition. Especially considering glue is just a wrapper around spark

How to save data in parquet format and append entries

I am trying to follow this example to save some data in parquet format and read it. If I use the write.parquet("filename"), then the iterating Spark job gives error that
"filename" already exists.
If I use SaveMode.Append option, then the Spark job gives the error
".spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables".
Please let me know the best way to ensure new data is just appended to the parquet file. Can I define primary keys on these parquet tables?
I am using Spark 1.6.2 on Hortonworks 2.5 system. Here is the code:
// Option 1: peopleDF.write.parquet("people.parquet")
//Option 2:
peopleDF.write.format("parquet").mode(SaveMode.Append).saveAsTable("people.parquet")
// Read in the parquet file created above
val parquetFile = spark.read.parquet("people.parquet")
//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT * FROM people.parquet")
I believe if you use .parquet("...."), you should use .mode('append'),
not SaveMode.Append:
df.write.mode('append').parquet("....")

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot