Hudi data overrides every time on new batch of spark structure streaming - pyspark

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 seconds. The below code is working fine but it overwrites the resultant apache hudi table data on every batch. I do not yet figure out why it is happening? Is it spark structure streaming or hudi behavior? I am using MERGE_ON_READ so the table file should not delete on every update. But don't know why it is happening? Due to this issue, my other job failed which read this table.
spark.readStream
.format('kafka')
.option("kafka.bootstrap.servers",
"localhost:9092")
...
...
df1 = df.groupby('a', 'b', 'c').agg(sum('d').alias('d'))
df1.writeStream
.format('org.apache.hudi')
.option('hoodie.table.name', 'table1')
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
.option('hoodie.datasource.write.keygenerator.class', 'org.apache.hudi.keygen.ComplexKeyGenerator')
.option('hoodie.datasource.write.recordkey.field', "a,b,c")
.option('hoodie.datasource.write.partitionpath.field', 'a')
.option('hoodie.datasource.write.table.name', 'table1')
.option('hoodie.datasource.write.operation', 'upsert')
.option('hoodie.datasource.write.precombine.field', 'c')
.outputMode('complete')
.option('path', '/Users/lucy/hudi/table1')
.option("checkpointLocation",
"/Users/lucy/checkpoint/table1")
.trigger(processingTime="10 second")
.start()
.awaitTermination()

Based on your configurations, the explanation for this problem may be that you read the same keys at each batch (the same a, b, c with different value of d), and where you have an upsert operation, hudi relace the old values by the new one. Try using insert instead of upsert or modify the hudi key depending on what you want to do.

Related

Apache Hudi - How to understand the hudi write operation vs spark savemode?

How to understand the hudi write operation with upsert but df savemode with append? Since this will upsert the records, why append instead of overwrite? What's the difference?
Like showed in the pic:
Example: Upsert a DataFrame, specifying the necessary field names for recordKey => _row_key, partitionPath => partition, and precombineKey => timestamp
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) //Where clientOpts is of type Map[String, String]. clientOpts can include any other options necessary.
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
When you use the overwrite mode, you tell spark to delete the table and recreate it (or just the partitions which exist in your new df if you use a dynamic partitionOverwriteMode).
But when we use append mode, spark will append the new data to existing old data on disk/cloud storage. With hudi we can provide additional operation to merge the two versions of data and update old records which have key present in new data, keep old records which have a key not present in new data and add new records having new keys. This is totally different from overwriting data.

Eliminate duplicates (deduplication) in Streaming DataFrame

I have a Spark streaming processor.
The Dataframe dfNewExceptions has duplicates (duplicate by "ExceptionId").
Since this is a streaming dataset, the below query fails:
val dfNewUniqueExceptions = dfNewExceptions.sort(desc("LastUpdateTime"))
.coalesce(1)
.dropDuplicates("ExceptionId")
val dfNewExceptionCore = dfNewUniqueExceptions.select("ExceptionId", "LastUpdateTime")
dfNewExceptionCore.writeStream
.format("console")
// .outputMode("complete")
.option("truncate", "false")
.option("numRows",5000)
.start()
.awaitTermination(1000)
**
Exception in thread "main" org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
**
This is also documented here: https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-programming-guide.html
Any suggestions on how the duplicates can be removed from dfNewExceptions?
I recommend to follow the approach explained in the Structured Streaming Guide on Streaming Deduplication. There it says:
You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as de-duplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use de-duplication with or without watermarking.
With watermark - If there is an upper bound on how late a duplicate record may arrive, then you can define a watermark on an event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.
An example in Scala is also given:
val dfExceptions = spark.readStream. ... // columns: ExceptionId, LastUpdateTime, ...
dfExceptions
.withWatermark("LastUpdateTime", "10 seconds")
.dropDuplicates("ExceptionId", "LastUpdateTime")
You can use watermarking to drop duplicates in a specific timeframe.

spark dataset overwrite particular partition not working in spark 2.4

In my job final step is to store the executed data in Hive table with partition on "date" column.
Sometime, due to job fail, I need to re-run job for particular partition alone.
As observed, when I use below code, spark overrides all the partitions when using overwrite mode.
ds.write.partitionBy("date").mode("overwrite").saveAsTable("test.someTable")
After going through multiple blogs and stackoverflow, I followed below steps to overwrite particular partitions only.
Step 1: Enbable dynamic partition for overwrite mode
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
Step 2: write dataframe to hive table using saveToTable
Seq(("Company1", "A"),
("Company2","B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Seq(("CompanyA3", "A"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.insertInto(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Still it overwrite all the partitions.
As per this blog, https://www.waitingforcode.com/apache-spark-sql/apache-spark-sql-hive-insertinto-command/read, "insertinto" should overwrite only particular partitions
If I create table first and then use "insertinto" method, it working fine
Set required configuration,
Step 1: Create table
Step 2: Add data using insertinto method
Step 3: Overwrite paritition
I wanted to know, what is difference between creating hive table via SaveToTable and creating table manually? Why it is not working in first scenario?
Could any one help me in this?
Try with lowercase w!
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
not
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
It fooled me. You have 2 variations in use in your scripting if you look.
My original answer deprecated it appears.

Spark Structured Streaming groupByKey on a time Window not working

I need to batch my Kafka stream into time windows of 10 minutes each and then run some batch processing on it.
Note: records below have a timestamp field
val records = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerPool)
.option("subscribe", topic)
.option("startingOffsets", kafkaOffset)
.load()
I add a time window to each record using,
.withColumn("window", window($"timing", windowDuration))
I created some helper classes like
case class TimingWindow(
start: java.sql.Timestamp,
end: java.sql.Timestamp
)
case class RecordWithWindow(
record: MyRecord,
groupingWindow: TimingWindow
)
Now I have a DF of type [RecordWithWindow]
All this works very well.
Next,
metricsWithWindow
.groupByKey(_.groupingWindow)
//By grouping, I get several records per time window
//resulting an object of the below type which I write out to HDFS
case class WindowWithRecords(
records: Seq[MyRecord],
window: TimingWindow
)
Where I examine HDFS,
Example:
Expected :
Each WindowWithRecords object having a unique TimingWindow
WindowWithRecordsA(TimingWindowA, Seq(MyRecordA, MyRecordB, MyRecordC))
Actual :
More than one WindowWithRecords object with the same TimingWindow
WindowWithRecordsA(TimingWindowA, Seq(MyRecordA, MyRecordB))
WindowWithRecordsB(TimingWindowA, Seq(MyRecordC))
Looks like the groupByKey logic is not working well.
I hope my question is clear. Any pointers would be helpful.
Found the problem:
I was not using an explicit trigger when processing the window. As a result, Spark was creating micro batches as soon as it could, as opposed to doing it at the end of the window.
streamingQuery
.writeStream
.trigger(Trigger.ProcessingTime(windowDuration))
...
.start
This was a result of me misunderstanding Spark documentation.
Note: groupByKey uses the object's hashcode. It is important to make sure that the hashcode of the object is consistent.

How to write a Dataset to Kafka topic?

I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))