State processing in pyspark structured settlement - pyspark

I am running a simple logic in pyspark structured streaming.
I have n events for each id
These events come in or out of order and continuously from a kafka topic.
I have to apply a logic where if event of id '123' comes check if it is final event, if it is final then process all events with that id. I need to wait for all the events of this id to perform calculation.
Sample Dataset:
Id
eventtype
value
123
load
22
123
data
32
123
unload
20
I have to group by id 123 and result should have average of all events. But I need to wait for unload event .
Result:
Id
eventtype
value
123
aggregate
24.75
How can I achieve this
Using a state store?
Using some window?
Please explain the approach with an example.

Related

With KSQL, why does my table keep data with older ROWTIME and discard updates with newer ROWTIME?

I have a process that feeds relatively simple vehicle data into a kafka topic. The records are keyd by registration and the values contain things like latitude/longitude etc + a value called DateTime which is a timestamp based on the sensor that took the readings (not the producer or the cluster).
My data arrives out of order in general and also especially if I keep on pumping the same test data set into the vehicle-update-log topic over and over. My data set contains two records for the vehicle I'm testing with.
My expectation is that when I do a select on the table, that it will return one row with the most recent data based on the ROWTIME of the records. (I've verified that the ROWTIME is getting set correctly.)
What happens instead is that the result has both rows (for the same primary KEY) and the last value is the oldest ROWTIME.
I'm confused; I thought ksql will keep the most recent update only. Must I now write additional logic on the client side to pick the latest of the data I get?
I created the table like this:
CREATE TABLE vehicle_updates
(
Latitude DOUBLE,
Longitude DOUBLE,
DateTime BIGINT,
Registration STRING PRIMARY KEY
)
WITH
(
KAFKA_TOPIC = 'vehicle-update-log',
VALUE_FORMAT = 'JSON_SR',
TIMESTAMP = 'DateTime'
);
Here is my query:
SELECT
registration,
ROWTIME,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS', 'Africa/Johannesburg') AS rowtime_formatted
FROM vehicle_updates
WHERE registration = 'BT66MVE'
EMIT CHANGES;
Results while no data is flowing:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
Here's the same query, but I'm pumping the data set into the topic again while the query is running. I'm surprised to be getting the older record as updates.
Results while feeding data:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
What gives?
In the end, it's a issue in Kafka Streams, that is not easy to resolve: https://issues.apache.org/jira/browse/KAFKA-10493 (we are working on some long term solution for it already though).
While event-time based processing is a central design pillar, there are some gaps that still needs to get closed.
The underlying issue is, that Kafka itself was originally designed based on log-append order only. Timestamps got added later (in 0.10 release). There are still some gaps today (eg, https://issues.apache.org/jira/browse/KAFKA-7061) in which "offset order" is dominant. You are hitting one of those cases.

Apply Rank or partitioned row_num function in Data Fusion

I want to implement rank or partitioned row_num function on my data in Data Fusion but I don't find any plugin to do so.
Is there any way to have this ?
I want to implement the below,
Suppose I have this above data, now I want to group the data based on AccountNumber and send the most recent record into one sink and rest to the others.
So from the above data,
Sink1 is expected to have,
Sink2 ,
I was planning to have this segregation by applying the rank or row_number partition by AccountNumber and sort by Record_date desc like functionality and send the records with rank=1 or row_num=1 to one sink and rest to other.
A good approach to solve your problem is using the Spark plugin.
In order to add it to your Datafusion instance, go to HUB -> Plugins -> Search for Spark -> Deploy the plugin .Then you can find it on Analytics tab.
To give you an example of how could you use it I created the pipeline below:
This pipeline basically:
Reads a file from GCS.
Executes a rank function in your data
Filter the data with rank=1 and rank>1 in different branches
Save your data in different locations
Now lets take a look more deeply in each component:
1 - GCS: this is a simple GCS source. The file used for this example has the data showed below
2 - Spark_rank: this is a Spark plugin with the code below. The code basically created a temporary view with your data and them apply a query to rank your rows. After that your data comes back to the pipeline. Below you can also see the input and output data for this step. Please notice that the output is duplicated because it is delivered to two branches.
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source")
df.sparkSession.sql("SELECT AccountNumber, Address, Record_date, RANK() OVER (PARTITION BY accountNumber ORDER BY record_date DESC) as rank FROM source")
}
3 - Spark2 and Spark3: like the step below, this step uses the Spark plugin to transform the data. Spark2 gets only the data with rank = 1 using the code below
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source_0")
df.sparkSession.sql("SELECT AccountNumber, Address, Record_date FROM
source_0 WHERE rank = 1")
}
Spark3 gets the data with rank > 1 using the code below:
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
df.createTempView("source_1")
df.sparkSession.sql("SELECT accountNumber, address, record_date FROM source_1 WHERE rank > 1")
}
4 - GCS2 and GCS3: finally, in this step your data gets saved into GCS again.

Kafka Streams as table Patch log not full Post

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

How to express the event de-duplication logics in Siddhi stream processing

Hi: I need the following de-duplication logics to be implemented in Siddhi stream processing. Assume I have an InputStream, and I want to produce the OutputStream as the following:
(1) when the event is the first one (since the Event Processing engine starts) in the InputStream, insert the event to the OutputStream.
(2)if the event with the same signature, for example, the same event name, arrives in within a 2-minute windows, we consider that the event is identical, and we should NOT insert the event into the OutputStream. Otherwise, we should insert the event into the OutputStream.
I tried to use event pattern to do the filtering. However, I can not find that I can express the "negation logics" in Siddhi, that is, if (not ( e1 --> e2 with same signature in 2 minute window)). Is there a clever way to perform such event-deduplication logics? Note that event deduplication is a very common expression needed for event processing.
If I would implement it in Java, that is relatively straightforward. I will create a hash table. When the first event arrives, I register it to the hash able, and set the event acceptable time of this registered event to be 2 minutes later. When the next event arrives, I look up the hash table, and compare the retrieved event's acceptable time with my current event time, and if the current event time is smaller than the acceptable time, I will not consider it as an output event. Instead of the Java implementation, I prefer to having a declarative solution implemented in Siddhi's stream processing query, if that is possible.
You can use an in-memory table and achieve that; Please find the sample below; it's pretty much similar to your approach with Java.
define stream InputStream (event_id string, data string);
define stream OutputStream (event_id string, data string);
define table ProcessedEvents (event_id string);
from InputStream[not(ProcessedEvents.event_id == event_id in ProcessedEvents)]
insert into OutputStream ;
from OutputStream
select event_id
insert into ProcessedEvents ;
from OutputStream#window.time(2 sec)
select event_id
insert expired events into PurgeStream ;
from PurgeStream
delete ProcessedEvents
on ProcessedEvents.event_id == event_id ;

Siddhi CEP: Aggregate the String values of grouped events in a batch time window

I'm using Siddhi to reduce the amount of events existing in a system. To do so, I declared a batch time window, that groupes all the events based on their target_ip.
from Events#window.timeBatch(30 sec)
select id as meta_ID, Target_IP4 as target_ip
group by Target_IP4
insert into temp;
The result I would like to have is a single event for each target_ip and the meta_ID parameter value as the concatenation of the distinct events that forms the event.
The problem is that the previous query generates as many events as distinct meta_ID values. for example, I'm getting
"id_10", "target_1"
"id_11", "target_1"
And I would like to have
"id_10,id_11", "target_1"
I'm aware that some aggregation method is missing in my query, I saw a lot of aggregation function in Siddhi, including the siddhi-execution-string extension which has the method str:concat, but I don't know how to use it to aggregate the meta_ID values. Any idea?
You could write an execution plan as shown below, to achieve your requirement:
define stream inputStream (id string, target string);
-- Query 1
from inputStream#window.timeBatch(30 sec)
select *
insert into temp;
-- Query 2
from temp#custom:aggregator(id, target)
select *
insert into reducedStream;
Here, the custom:aggregator is the custom stream processor extension that you will have to implement. You can follow [1] when implementing it.
Let me explain a bit about how things work:
Query 1 generates a batch of events every 30 seconds. In other words, we use Query 1 for creating a batch of events.
So, at the end of every 30 second interval, the batch of events will be fed into the custom:aggregator stream processor. When an input is received to the stream processor, its process() method will be hit.
#Override
protected void process(ComplexEventChunk<StreamEvent> streamEventChunk, Processor nextProcessor, StreamEventCloner streamEventCloner, ComplexEventPopulater complexEventPopulater) {
//implement the aggregation & grouping logic here
}
The batch of events is there in the streamEventChunk. When implementing the process() method, you can iterate over the streamEventChunk and create one event per each destination. You will need to implement this logic in the process() method.
[1] https://docs.wso2.com/display/CEP420/Writing+a+Custom+Stream+Processor+Extension