Spark Streaming dropDuplicates - scala

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?

In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically

Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.


Is it possible to query the diff between two Apache Iceberg snapshots?

I have two snapshots in my Iceberg history table, and I want to be able to see the difference between them, or at least with columns/ rows that have been affected on the last snapshot. Is there an easy way of getting this information?
You can use the java api to get the incremental change log between two snapshot id in a table.
It will get the full change log.
If you just want to query incremental data, an easier way is to use spark or flink:
.option("start-snapshot-id", "10963874102873")
.option("end-snapshot-id", "63874143573109")
Currently gets only the data from append operation. Cannot support replace, overwrite, delete operations.
Enjoy yourself.

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.

Materialised View in Clickhouse not populating

I am currently working on a project which needs to ingest data from a Kafka Topic (JSON format), and write it directly into Clickhouse. I followed the method as suggested in the Clickhouse documentation:
Step 1: Created a clickhouse consumer which writes into a table (say, level1).
Step 2: I performed a select query on 'level1' and it gives me a set of results, but is not particularly useful as it can be read only once.
Step 3: I created a materialised view that converts data from the engine(level1) and puts it into a previously created table (say, level2). While writing into 'level2' the aggregation is on a day level (done by converting timestamp in level1 to datetime).
Therefore, data in 'level2' :- day + all columns in 'level1'
I intend to use this view (level2) as the base for any future aggregation (say, at level3)
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
Is it because of day level aggregation, and it might populate at the end of the day? Can I ingest data from 'level2' in real-time?
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
EDIT: There is latency in Clickhouse while retrieving data from Kafka. Had to make changes in the user.xml file in my Clickhouse server (change max_block_size).
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
This might be related to the default settings of kafka storage, which always starts consuming data from the latest offset. You can change the behavior by adding this
to config.xml
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
You'd better avoid reading from kafka storage directly. You can set up a dedicated materialized view M1 for 'level1' and use that to populate 'level2' too. Then reading from M1 is repeatable.
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
Nope, though you can try using Cap'n Proto which should provide similar performance like Avro, and it's supported directly by ClickHouse.

spark streaming and spark sql considerations

I am using spark stream (scala) and receiving records of customer calls to a call center through kafka after every 20 minutes. Those records are converted in rdd and later dataframe to exploit spark sql. I have a business use case where I want to identify all the customers who called more than 3 times in the last two hours.
What would be the best approach to do that? Should I keep inserting in a hive table all the records received in every batch and run a separate script to keep querying who did 3 calls in last two hours or there is another better using spark in memory capabilities ?
For this kind of use case you can get the result using spark (There is no requirement of hive). You must have some customer unique id, so you can prepared some query like
ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY time DESC) as call_count
and you have to apply filter for last 2 hours data with call_count=3 so you can get your expected result. Then you can set this spark script into crontab or any other autorun methods.