Trigger Beam ParDo at window closing only - apache-beam

I have a pipeline that read events from Kafka. I want to count and log the event count only when the window closes. By doing this I will only have one output log per Kafka partition/shard on each window. I use a timestamp in the header which I truncate to the hour to create a collection of hourly timestamps. I group the timestamps by hour and I log the hourly timestamp and count. This log will be sent to Grafana to create a dashboard with the counts.
Below is how I fetch the data from Kafka and where it defines the window duration:
int windowDuration = 5;
p.apply("Read from Kafka",KafkaIO.<byte[], GenericRecord>read()
.withBootstrapServers(options.getSourceBrokers().get())
.withTopics(srcTopics)
.withKeyDeserializer(ByteArrayDeserializer.class)
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider
.of(options.getSchemaRegistryUrl().get(), options.getSubject().get()))
.commitOffsetsInFinalize())
.apply("Windowing of " + windowDuration +" seconds" ,
Window.<KafkaRecord<byte[], GenericRecord>>into(
FixedWindows.of(Duration.standardSeconds(windowDuration))));
The next step in the pipeline is to produce two collections from the above collection one with the events as GenericRecord and the other with the hourly timestamp, see below. I want a trigger (I believe) to be applied only two the collection holding the counts. So that it only prints the count once per window. Currently as is, it prints a count every time it reads from Kafka creating a large number of entries.
tuplePCollection.get(createdOnTupleTag)
.apply(Count.perElement())
.apply( MapElements.into(TypeDescriptors.strings())
.via( (KV<Long,Long> recordCount) -> recordCount.getKey() +
": " + recordCount.getValue()))
.apply( ParDo.of(new LoggerFn.logRecords<String>()));
Here is the DoFn I use to log the counts:
class LoggerFn<T> extends DoFn<T, T> {
#ProcessElement
public void process(ProcessContext c) {
T e = (T)c.element();
LOGGER.info(e);
c.output(e);
}
}

You can use the trigger “Window.ClosingBehavior”. You need to specify under which conditions a final pane will be created when a window is permanently closed. You can use these options:
FIRE_ALWAYS: Always Fire the last Pane.
FIRE_IF_NON_EMPTY: Only Fire the last pane if there is new data since
previous firing.
You can see this example.
// We first specify to never emit any panes
.triggering(Never.ever())
// We then specify to fire always when closing the window. This will emit a
// single final pane at the end of allowedLateness
.withAllowedLateness(allowedLateness, Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
You can see more information about this trigger.

Related

How to use Kafka time window for historical aggregation?

I need to create state store with number of authenticated users per day so I can get number of authenticated users in the last day, in the last 7 days and in the last 30 days.
In order to achieve this, every authentication event is sent to auth-event topic.
I am streaming this topic and creating window for every day.
Code :
KStream<String, GenericRecord> authStream = builder.stream("auth-event", Consumed.with(stringSerde, valueSerde)
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST)
.withTimestampExtractor(new TransactionTimestampExtractor()));
authStream
.groupBy(( String key, GenericRecord value) -> value.get("tenantId").toString(), Grouped.with(Serdes.String(), valueSerde))
.windowedBy(TimeWindows.of(Duration.ofDays(1)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("auth-result-store")
.withKeySerde(stringSerde)
.withValueSerde(longSerde))
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.to("auth-result-topic", Produced.with(timeWindowedSerdeFrom(String.class), Serdes.Long()));
After that I am inserting records on the topic.
Also I have rest controller and i am reading the store using ReadOnlyWindowStore.
day parameter is sent from UI and can be 1, 7 or 30 days. That means I would like to read last 7 windows.
Code :
final ReadOnlyWindowStore<String, Long> dayStore = kafkaStreams.store(KStreamsLdapsExample.authResultTable, QueryableStoreTypes.windowStore());
Instant timeFrom = (Instant.now().minus(Duration.ofDays(days)));
LocalDate currentDate = LocalDate.now();
LocalDateTime currentDayTime = currentDate.atTime(23, 59, 59);
Instant timeTo = Instant.ofEpochSecond(currentDayTime.toEpochSecond(ZoneOffset.UTC));
try(WindowStoreIterator<Long> it1 = dayStore.fetch(tenant, timeFrom, timeTo)) {
Long count = 0L;
JsonObject jsonObject = new JsonObject();
while (it1.hasNext())
{
final KeyValue<Long, Long> next = it1.next();
Date resultDate = new Date(next.key);
jsonObject.addProperty(resultDate.toString(), next.value);
count += next.value;
}
jsonObject.addProperty("tenant", tenant);
jsonObject.addProperty("Total number of events", count);
return ResponseEntity.ok(jsonObject.toString());
}
The problem is that, I can get results only for 1-2 days. After that older windows are lost.
The other problem is the information written in the output topic : "auth-result-topic"
I am reading the results with console-consumer, and there are a lot of empty records, no key, no value, and some record with some random number.
enter image description here
Any idea what is going on with my store ? How to read past N windows?
Thanks
You will need to increase the store retention time (default is 1 day), via Materialize.as(...).withRetention(...) that you can pass into count() operator.
You may also want to increase the window grace period via TimeWindows.of(Duration.ofDays(1)).grace(...).
For reading the data with the console consumer: you will need to specify the correct deserializer. The window-serde and long-serde that you use to write into the output topic uses binary formats while the console consumer assumes string data type by default. There are corresponding command line parameters you can specify to set different key and value deserializers that must match the serializers you use when writing into the topic.

reading files and folders in order with apache beam

I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.
Is it possible to do this with apache beam?
So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.
First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:
p
.apply(FileIO.match()
.filepattern(inputPath)
.continuously(
// Check for new files every minute
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.<String>never()))
.apply(FileIO.readMatches())
Watch frequency and timeout can be adjusted at will.
Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:
Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:
.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
String lines[];
String[] dateFields = fileName.split("/");
Integer numElements = dateFields.length;
String hour = dateFields[numElements - 2];
String day = dateFields[numElements - 3];
String month = dateFields[numElements - 4];
String year = dateFields[numElements - 5];
String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
Log.info(ts);
try{
lines = file.readFullyAsUTF8String().split("\n");
for (String line : lines) {
c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
}
}
catch(IOException e){
Log.info("failed");
}
}}))
Finally, I window into 1-hour FixedWindows and log the results:
.apply(Window
.<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String file = c.element().getKey();
String value = c.element().getValue();
String eventTime = c.timestamp().toString();
String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
Log.info(logString);
}
}));
For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.
I set the $BUCKET and $PROJECT variables and I just upload two files:
gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/
And run the job with:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
-Dexec.args="--project=$PROJECT \
--path=gs://$BUCKET/data/** \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
Results:
Full code
Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case

How could i use the kafka stream window to gerate one record for the Candlestick chart generation

I have to use Kafka Stream to get the transaction info to draw Candlestick chart in each specific durations from the transaction result topic, it has transaction id, amount, price, deal time, the key is transaction id, which is totally different for each record,
what I want to do is do calculation base on the transaction result to get the
the highest price, lowest price, open price, close price, tx close_time for each duration and use it to create a Candlestick chart.
I have used the kafka stream window to do this:
final KStreamBuilder builder = new KStreamBuilder();
KStream<String, JsonNode> transactionKStream = builder.stream(keySerde, valueSerde, srcTopicName);
KTable<Windowed<String>, InfoRecord> kTableRecords= groupedStream.aggregate(
InfoRecord::new, /* initializer */
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
TimeWindows.of(TimeUnit.SECONDS.toMillis(5)).until(TimeUnit.SECONDS.toMillis(5)),
infoRecordSerde);
As in the source topic, each record has the txId as the key, and the txId is never duplicated, so, when do aggregation, the result K-table will have the same record as K-stream but I could use the window to get all the records in
the specific durations.
I think the kTableRecords should contain all the records in a specific duration, i.e. the 5 seconds,
So, I could loop over all the records in the 5 seconds, to get the high, low, open(the very first record price in the window), close(the very last record price in the window), close_time (tx time for the very last record in the window),
so that I would only get one record for this window and output this result to a sink kafka topic, but I don't know how to do this in these window durations.
I think the code will be like:
kTableRecords.foreach((key, value) -> {
// TODO: Add Logic Here
})
the IDE show this foreach has been deprecated,
But I don't know how to distinct the record in this window or in next window
or I need a window record retain time use until in the sample code above.
I have struggle in this for several days, and I still don't know the correct way to complete my jobs, appreciate anyone's help for make me in the right way, thanks
kafka version is: 0.11.0.0
Update:
With the hints from Michal in his post, I changed my code, and do the
high, low, open, close price calculation in the aggregator instance,
but the results makes me reallize for each different key in the spcific
window, the logic create a new instance for the key, and do the add excutaions for the current key only, not interact with values of other keys,
what i really want is to caluate the
high, low, open, close price for each record with different key in that
window duration, so what i need is not create a new instance for each key,
it shoule be create only one aggregate instance for each specific window
and do the calculation for all the record values in the durations, each duration window get one set of (high, low, open, close price).
I have read the topic :
How to compute windowed aggregations over successively increasing timed windows?
So, i am doubt, i am not sure, if this is the right solution for me, thanks.
By the way, K-line means Candlestick chart.
UPDATE II:
Based on your updates, i create the code as bellow shows:
KStream<String, JsonNode> transactionKStream = builder.stream(keySerde, valueSerde, srcTopicName);
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupBy((k,v)-> "constkey", keySerde, valueSerde);
KTable<Windowed<String>, MarketInfoRecord> kTable =
groupedStream.aggregate(
MarketInfoRecord::new, /* initializer */
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
TimeWindows.of(TimeUnit.SECONDS.toMillis(100)).until(TimeUnit.SECONDS.toMillis(100)),
infoRecordSerde, "test-state-store");
KStream<String, MarketInfoRecord> newS = kTable.toStream().map(
(k,v) -> {
System.out.println("key: "+k+", value:"+v);
return KeyValue.pair(k.window().start() + "_" + k.window().end(), v);
}
);
newS.to(Serdes.String(),infoRecordSerde, "OUTPUT_NEW_RESULT");
If i use a static string as the key when doing group, it's sure that when doing windowed aggregation,
only one aggregator instance has been created for the window, and we could get the (high, low, open, close)
for all the record in that window, but
as the key a same for all the record, this window will gets updated for several times, and produce several record for one window,as:
key: [constkey#1521304400000/1521304500000], value:MarketInfoRecord{high=11, low=11, openTime=1521304432205, closeTime=1521304432205, open=11, close=11, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=44, low=44, openTime=1521304622655, closeTime=1521304622655, open=44, close=44, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=44, low=33, openTime=1521304604182, closeTime=1521304622655, open=33, close=44, count=2}
key: [constkey#1521304400000/1521304500000], value:MarketInfoRecord{high=22, low=22, openTime=1521304440887, closeTime=1521304440887, open=22, close=22, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=55, low=55, openTime=1521304629943, closeTime=1521304629943, open=55, close=55, count=1}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=77, low=77, openTime=1521304827181, closeTime=1521304827181, open=77, close=77, count=1}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=77, low=66, openTime=1521304817079, closeTime=1521304827181, open=66, close=77, count=2}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=88, low=66, openTime=1521304817079, closeTime=1521304839047, open=66, close=88, count=3}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=99, low=66, openTime=1521304817079, closeTime=1521304848350, open=66, close=99, count=4}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=100.0, low=66, openTime=1521304817079, closeTime=1521304862006, open=66, close=100.0, count=5}
so we need do dedupe as your posted link described in "38945277/7897191", right?
So, I want to know if i could do something like:
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupByKey();
// as key was unique txId, so this group is just for doing next window operation, the record number is not changed.
KTable<Windowed<String>, MarketInfoRecord> kTable =
groupedStream.SOME_METHOD(
// just use some method to deliver the records in different windows,
// no sure if this is possible?
TimeWindows.of(TimeUnit.SECONDS.toMillis(100)).until(TimeUnit.SECONDS.toMillis(100))
// use until here to let the record purged if out of the window,
// please correct me if i am wrong?
we could transform the time based series of input record turn to several windowed groups,
each group have the window (or use window start time, end time combined as string key),
so, for each group, the key is different, but it has several record which has different values,
then we do aggregation(no need use windowed aggregation here), the values has been calculated, and
from each key:value pair, i.e. , we could get one result record,
and next window has different windowBased Key name, so in this way, the execution downstream shoud have multiple threads(
as the key changes)
I suggest you do all the calculations you mention not in a foreach but directly in your aggregator, that is, in the adder:
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
the add method can do all the things you mentioned (I suggest you first map the JsonNode to a Java object, let's call it Transaction), consider this pseudo-code:
private int low = Integer.MAX; // whatever type you use to represent prices
private int high = Integer.MIN;
private long openTime = Long.MAX; // whatever type you use to represent time
private long closeTime = Long.MIN;
...
public InfoRecord add(String key, Transaction tx) {
if(tx.getPrice() > this.high) this.high = tx.getPrice();
if(tx.getPrice() < this.low) this.low = tx.getPrice();
if(tx.getTime() < this.openTime) {
this.openTime = tx.getTime();
this.open = tx.getPrice();
}
if(tx.getTime() > this.closeTime) {
this.closeTime = tx.getTime();
this.close = tx.getPrice();
}
return this;
}
Keep in mind that you may in reality get more than one record on output for each window as the windows can be updated multiple times (they're never final) as is explained in more detail here: https://stackoverflow.com/a/38945277/7897191
I don't know what a K-line is but if you want multiple windows of increasing duration, the pattern is outlined here
UPDATE:
To aggregate all records in a window, just change the key to some static value before doing the aggregation. So to create your grouped stream you can use groupBy(KeyValueMapper), something like:
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupBy( (k, v) -> ""); // give all records the same key (empty string)
Please be aware that this will cause repartitioning (since partition is determined by the key and we're changing the key) and the execution downstream will become single threaded (since there will now be just one partition).

Esper. Unable to receive remove events in listeners

When using Time windows in Esper, the old or removed events from the window is sent as output to UpdateListener attached to the statement. This is what should be occurring according to the document. But when I execute the code like below, it doesn't has any events in oldEvents even a new sliding window starts. It even happens with length window.
EPStatement statement1 = epAdmin.createEPL("select current_timestamp, sum(price)" + " from StockTick.win:time(5 sec)");
statement1.addListener(new UpdateListener() {
#Override
public void update(EventBean[] newEvents, EventBean[] oldEvents) {
System.out.println("sum \t" + newEvents[0].getUnderlying() + "\n");
System.out.println("old sum \t" + oldEvents[0].getUnderlying() + "\n");
}
});
When I send events into this query, the UpdateListener gets newEvents entering into windows in newEvents but when a event is removed from further sliding windows, it should be present in oldEvents but I did not get any events into it.
Is there any mistake I am doing while constructing listeners or statements.
By default engine does not output remove stream unless the select clause has "irstream".
select irstream current_timestamp, ....

Batching large result sets using Rx

I've got an interesting question for Rx experts. I've a relational table keeping information about events. An event consists of id, type and time it happened. In my code, I need to fetch all the events within a certain, potentially wide, time range.
SELECT * FROM events WHERE event.time > :before AND event.time < :after ORDER BY time LIMIT :batch_size
To improve reliability and deal with large result sets, I query the records in batches of size :batch_size. Now, I want to write a function that, given :before and :after, will return an Observable representing the result set.
Observable<Event> getEvents(long before, long after);
Internally, the function should query the database in batches. The distribution of events along the time scale is unknown. So the natural way to address batching is this:
fetch first N records
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
... and so on (the idea should be clear)
My question is:
Is there a way to express this function in terms of higher-level Observable primitives (filter/map/flatMap/scan/range etc), without using the subscribers explicitly?
So far, I've failed to do this, and come up with the following straightforward code instead:
private void observeGetRecords(long before, long after, Subscriber<? super Event> subscriber) {
long start = before;
while (start < after) {
final List<Event> records;
try {
records = getRecordsByRange(start, after);
} catch (Exception e) {
subscriber.onError(e);
return;
}
if (records.isEmpty()) break;
records.forEach(subscriber::onNext);
start = Iterables.getLast(records).getTime();
}
subscriber.onCompleted();
}
public Observable<Event> getRecords(final long before, final long after) {
return Observable.create(subscriber -> observeGetRecords(before, after, subscriber));
}
Here, getRecordsByRange implements the SELECT query using DBI and returns a List. This code works fine, but lacks elegance of high-level Rx constructs.
NB: I know that I can return Iterator as a result of SELECT query in DBI. However, I don't want to do that, and prefer to run multiple queries instead. This computation does not have to be atomic, so the issues of transaction isolation are not relevant.
Although I don't fully understand why you want such time-reuse, here is how I'd do it:
BehaviorSubject<Long> start = BehaviorSubject.create(0L);
start
.subscribeOn(Schedulers.trampoline())
.flatMap(tstart ->
getEvents(tstart, tstart + twindow)
.publish(o ->
o.takeLast(1)
.doOnNext(r -> start.onNext(r.time))
.ignoreElements()
.mergeWith(o)
)
)
.subscribe(...)