Is it possible to join 2 Kafka KStreams where the JoinWindows duration is stored in the object of 1 of the streams? - apache-kafka

Let say I have 2 streams:
TimeWindow (with begin time, end time)
Numbers (with time stamp)
Is it possible to user either DSL API or Process API to join the streams such that the output will contain TimeWindow object that contains the sum of the numbers that is within the time range specified in TimeWindow?
To be specific, how do you set XXX where it is the duration store in win.getDuration() where win is the one referenced in ValueJoiner.
timeWindow.join(
numbers,
(ValueJoiner<TimeWindow, Number, TimeWindow>) (win, num) -> win.addToTotal(num),
new JoinWindows(XXX, 0)
).to("output_Topic");
The JoinWindows after is 0 because TimeWindow's timestamp is endtime. XXX duration should be calculate as TimeWindows end time - begin time in milli seconds.
Many thanks for any help!

Thanks to Matthias' incite, I end up roll back to use Processor API with the implementation of TimestampExtractors and usage of in memory state store (default to use RockDB) to implemented this function.

Related

Apache Flink - SQL Kafka connector Watermark on event time doesn't pull records

I have a question similar to the one Apache Flink Tumbling Window delayed result. The difference is, I'm using the SQL using kafka connect to read records from topic. I get the records on regular intervals, but somehow, I don't get the last few records in the output. For example, the last record in Kafka topic is with timestamp 2020-11-26T13:11:36.605Z and the last timestamp for aggregated value is 2020-11-26T12:59:59.999. I don't understand why I'm not getting the aggregation on the last record in topic. Please help. Here is my code.
sourceSQL = "CREATE TABLE flink_read_kafka (clientId INT, orderId INT, contactTimeStamp, WATERMARK FOR contactTimeStamp AS contactTimeStamp - INTERVAL '5' SECOND with (kafka config) ";
sinkSQL = "CREATE TABLE flink_aggr_kafka (contactTimeStamp STRING, clientId INT, orderCount BIGINT) with (kafka config) ";
aggrSQL = "insert into flink_aggr_kafka SELECT TUMBLE_ROWTIME(contactTimeStamp, INTERVAL '5' MINUTE) as contactTimeStamp, clientId, COUNT(*) orderCount from flink_read_kafka GROUP BY clientId , TUMBLE(commsTimestamp, INTERVAL '5' MINUTE)";
blinkStreamTableEnv.executeSql(sourceSQL);
blinkStreamTableEnv.executeSql(sinkSQL);
blinkStreamTableEnv.executeSql(aggrSQL);
First, some background: A tumbling window only emits results once the watermark has passed the maximum timestamp of the window. The watermark indicates to the framework that all records with a lower timestamp have arrived, and hence the window is complete and the results can be emitted.
The watermark can only advance based on the timestamp of records coming in, so if no more records are coming in the watermark will not advance and currently open windows will not be closed. So, it is expected that last windows remain open when there is no influx of data anymore.
In your example, one would normally assume that the windows with a rowtime of 2020-11-26T13:04:59.999 and 26T13:09:59.999 are also emitted, because the latest records should have pushed the watermark beyond these timestamps.
I can think of two reasons right now why this might not be the case:
not all parallel source instances have seen a timestamp higher than 26T13:05:04.999 and hence the output watermark has actually not passed that value. You can test this by either running the job with a parallelism of 1 which would mitigate the problem or verify if this is the case by checking the watermark of the window operator in the Flink Web UI.
if you are using the Kafka producer in exactly-once mode and only consume records that have been comitted the records will only become visible once a checkpoint has been completed after the window has fired.

Count and Time window in Esper EPL

I have the following use case, which I'm trying to write in EPL, without success. I'm generating analytics events of different types, generated in different intervals (1min, 5min, 10min, ...). In special kind of analytics, I need to collect 4 specific
Analytics events (from which I will count another analytic event) of different types, returned every interval (1min, 5min, 10min, ...). The condition there is, that on every whole interval, e.g., every whole minute 00:01:00, 00:02:00 I want to have returned either 4 events or nothing if the events don't arrive in some slack period after (e.g., 2s).
case 1: events A,B,C,D arrive at times 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:00.800 - right after fourth event arrives to esper, the aggregated event with all 4 events is returned
case 2: slack period is 2seconds, events A,B,C,D arrives at 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:02.200 - nothing is arrived, as the last event is out of the slack period
You could create a trigger event every minute like this:
insert into TriggerEvent select * from pattern[timer:schedule(date:'1970-01-01T00:00:00.0Z', period: 1 minute, repetitions: -1)]
The trigger that arrives every minute can kick off a pattern or context. A pattern would seem to be good enough. Here is something like that:
select * from pattern [every TriggerEvent -> (a=A -> b=B -> c=C -> d=D) where timer:within(2 seconds)]

Use External Window Time Stamp to Debug Siddhi Stream Query

I am planning to use the historical event traces (stored in JSON with my own event time stamp recorded for each event) to debug the Siddhi stream queries that I have just created. My stream starts with:
from MyInputEventStream#window.externalTime(my_own_timestamp, 10 min)
select some_fields
insert into MyOutpuStream;
and I will input my events from traces, one by one.
Supposed event 1 arrives at the specified my_own_timestamp = 1528905600000, which is 9 am PST time, June 13. and event 2 arrives at 11 minutes later, my_own_timestamp = 1528906260000. I believe that I will get the output at MyOutpuStream at 9:10 am, as time_stamp(e2) - time_stamp(e1) > 10 min, and e2 will trigger the system after the windows passes.
Now supposed event 1 arrives at my_own_timestamp = 1528905600000, that is, 9:00 am. But no events will arrive in the next 2 hours. Do I still get the output at 9:10 am, as in reality, the window time should expire at 9:10 am, independent of when the next event should arrive? But it seems that in this case, the internal timing system of Siddhi will have to incorporate my event input's time stamp, and then set the expiration time of the events based on the clock system of the process on which the Siddhi is running. Is this correct? could you help clarify it.
You won't get an output at 9:10 am. Because if you use externalTime, the event expiration logic will entirely base on the timestamp that you defined. And it will wait for a timestamp that satisfies the time difference which is greater than or equal to expire the previous event.
What internally happens is;
def array previousEvents;
foreach currentEvent in currentEvents (events that are coming in):
def currentTime = currentEvent.timestamp;
foreach previousEvent in previousEvents:
def previousTime = previousEvent.timestamp;
def timeDiff = previousTime - currentTime + windowLength;
if (timeDiff <= 0) {
remove previousEvent from previousEvents;
set expired timestamp of previousEvent to currentTime;
expire previousEvent;
}
previousEvents.add(currentEvent);

How to use Unix timestamp to get offset using SimpleConsumer API?

I'm trying to use SimpleConsumer example.
I modify the offset in the code:
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
It works well when I use kafka.api.OffsetRequest.EarliestTime() or kafka.api.OffsetRequest.LatestTime(). But when I set it to a UNIX TIMESTAMP, it doesn't return the message at that moment.
For example
long readOffset = getLastOffset(consumer, a_topic, a_partition, 1439196000000L, clientName);
I set the timestamp to 1439196000000L which is 2015/8/10 16:40:0. It however returns a message about one hour before that time.
Is it the right way to assign the time stamp? the time stamp should be 13bit, not 10bit, right?
I am in China, using Beijing time. Does it matter?
Is it possible that Kafka has any parameter to set the time of the cluster?

redis key scheme for analytics

I want to create analytics using redis - basic counters per object, per hour/day/week/month/year and total
what redis data structure would be effective for this and how can I avoid doing many calls to redis?
would it better to have each model have this sets of keys:
hash - model:<id>:years => every year has a counter
hash - model:<id>:<year> => every month has a counter
hash - model:<id>:<year>:<month> => every day has a counter
hash - model:<id>:<year>:<month>:<day> => every hour has a counter
if this scheme is correct, how would I chart this data without doing many calls to redis? I would have to loop on all year in model:<id>:years and fetch the month, then loop on the month, etc? Or I just grab all fields and their values from all keys as a batch request and then process that in the server?
It's better to use a zset for this instead of an hash. Using timestamp as score you will be able to retrieve data for specific time range
For a date range you will use model:<id>:<year>:<month>, for an hour range (using model:<id>:<year>:<month>:<day>) and so on...
Indeed, if the date range is larger than a month (e.g. from January 1st 2014 to March 20th 2014), you will have to retrieve multiple zset (model:<id>:2014:01, model:<id>:2014:02 and model:<id>:2014:03) and merge the results.
If you really want to do a date range inside a single request, you can always store day precision data inside model:<id>:<year>. And if you want to handle date range over multiple years, you will just need to have a single zset e.g. model:<id>:byDay.
However, please note that storing historical data will increase memory consumption over time so you should already think about data retention. With Redis you can either use EXPIRE on zset or do it yourself with crons.