Spark Structured streaming - dropDuplicates with watermark alternate solution - spark-structured-streaming

I am trying to deduplicate on streaming data using the dropDuplicate function with watermark. The problem I am facing currently is that I have to two timestamps for a given record
One is the eventtimestamp - timestamp of the record creation from the source.
Another is an transfer timestamp - timestamp from an intermediate process that is responsible to stream the data.
The duplicates are introduced during the intermediate stage so for a given a record duplicate, the eventtimestamp is same but transfer timestamp is different.
For the watermark, I like to use the transfertimestamp because I know the duplicates cant occur more than 3 minutes apart in transfer. But I cant use it within dropDuplicate because it wont capture the duplicates as the duplicates have different transfer timestamp.
Here is an example,
Event 1:{ "EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:05:00.00" }
Event 2 (duplicate): {"EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:08:00.00"}
In this case, the duplicate was created during transfer after 3 mins from the original event
My code is like below,
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring","transferTimestamp");
The above code won't drop the duplicates as transferTimestamp is unique for the event and its duplicate. But currently, this is the only way as Spark forces me to include the watermark column in the dropDuplicates function when watermark is set.
I would really like to see an dropDuplicate implementation like below which would be a valid case for any at-least once semantics streams where I dont have to use the watermark field in dropDuplicates and still the watermark based state eviction is honored. But that is not the case currently
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring");
I cant use the eventtimestamp as it is not ordered and time range varies drastically (delayed events and junk events).
If anyone has an alternate solution or ideas for deduping in such scenario, please let me know.

For your use case you cant use the dropDuplicates API directly . You have to use a arbitrary stateful operation for the same using some spark API like flatmapgroupwithstate

Related

Powerful and cheap array algorithms

This is how the text messages are normally stored in Firebase realtime-Database
I am not fond of the idea that every time someone joins a group chat, they would need to download the entire e.g 20000 history text messages. Naturally, users wouldn't swipe all the way up to the very first message. However, in firebase realtime database, storing all messages under a given parent node will cause all messages to be downloaded once a user queries it (to join the group chat).
One possible efficiency solution:
Adding a second parent node that stores older text messages. E.g latest 500 text messages are saved under the main messages parent node and the rest of 19500 old text messages are saved on a different parent node. However, the parent node for old text messages will also need to be updated with newer old text messages. We would then need to download all 19500 old text messages as a consequence.
Perhaps the ideal case is to create up to N parent nodes that store packets of 300 text messages each. However, what consequence would their be with excessively creating parent nodes?
What efficiency solutions are recommended with a problem like this? Is there some technique I am forgetting or unaware of?
Just sort the list by date desc and limit to the last N messages. Or save the "inverse" date and sort on that. You can read more about it here.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Delete data after a period of time Using Firebase

I am trying to implement a method where a post(s) will be deleted after a given time frame. In my case 10 weeks.
How would I go about implementing a feature? I've read that Firebase does not support server-sided scripting. So how could I go about it? When users uploads a post, I do have a timestamp node attached. Is it a case of comparing the post's timestamp to a timestamp of 10 weeks? And then removing the post? Or is there another, more efficient way to achieve such a thing?
If I was to implement the aforementioned method, this would mean I'd require an observer/method to first analyse ALL posts, then do the comparison and then execute the second phase, depending on the timestamp - removeValue or simply return. And I wouldn't I need to use NotificatonCenter so I can call this code throughout my whole app?
Any input/advice would be appreciated.

possible to load the latest available datapoint and discard the rest in Druid?

Consider raw events (alpha set in Druid parlance) of the form timestamp | compoundId | dimension 1 | dimension 2 | metric 1 | metric 2
Normally in Druid data can be loaded in Realtime nodes and historic nodes based on some rules. These rules seem to be related to time-ranges. E.g.:
load the last day of data on boxes A
load the last week (except last day) on boxes B
keep the rest in deep storage but don't load segments.
In contrast I want to support the use-case of:
load the last event for each given compoundId on boxes A. Regardless if that last event happened to be loaded today or yesterday.
Is this possible?
Alternatively, if the above is not possible, I figured it would perhaps be possible as a workaround to create a betaset (finest granulation level as follows):
Given an alphaset with schema as defined above, create a betaset so that:
all events for a given compoundId are rolled-up.
metric1 and metric2 are set to the metrics from the last occurring (largest timestamp) event.
Any advice much appreciated.
I believe the first and last aggregators is what you are looking for.