Apache Beam Windowing on a signals phase - apache-beam

Updated: Is it possible to window a data stream on a signals phase.
For example, there is a stream of timestamp, key, value:
[<t0, k1, 0>, <t1, k1, 98>, <t2, k1, 145>, <t4, k1, 0>, <t3, k1, 350>, <t5, k1, 40>, <t6, k1, 65>, <t7, k1, 120>, <t8, k1, 240>, <t9, k1, 352>].
The output would be two windows for key k1:
t0 - t3: [0, 98, 145, 350]
t4 - t9: [0, 40, 65, 120, 240, 352]
E.g. every time the value hits 0, start a new window for the group.

After your question edit and use case clarification I would recommend to look into custom windowing to extend the standard sessions. As a starting point I built the following example (it can be improved upon).
Through WindowFn.AssignContext we can access the element() that it's being windowed into a proto-session. If it's equal to a given stopValue the window length will be confined to the minimum instead of using gapDuration for that purpose:
#Override
public Collection<IntervalWindow> assignWindows(AssignContext c) {
Duration newGap = c.element().getValue().equals(this.stopValue) ? new Duration(1) : gapDuration;
return Arrays.asList(new IntervalWindow(c.timestamp(), newGap));
}
Then, when merging the sorted windows we'll check if they do overlap but also that the window duration is not equal to 1 ms.
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
// get window duration and check if it's a stop session request
Long windowDuration = new Duration(window.start(), window.end()).getMillis();
if (current.intersects(window) && !windowDuration.equals(1L)) {
current.add(window);
} else {
merges.add(current);
current = new MergeCandidate(window);
}
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
Of course, we also can add some code so that we can provide different stopping values: a stopValue field, a withStopValue method, constructors, display data if using the Dataflow Runner, etc.
/** Value that closes the session. */
private final Integer stopValue;
/** Creates a {#code StopSessions} {#link WindowFn} with the specified gap duration. */
public static StopSessions withGapDuration(Duration gapDuration) {
return new StopSessions(gapDuration, 0);
}
/** Creates a {#code StopSessions} {#link WindowFn} with the specified stop value. */
public StopSessions withStopValue(Integer stopValue) {
return new StopSessions(gapDuration, stopValue);
}
/** Creates a {#code StopSessions} {#link WindowFn} with the specified gap duration and stop value. */
private StopSessions(Duration gapDuration, Integer stopValue) {
this.gapDuration = gapDuration;
this.stopValue = stopValue;
Now in our pipeline we can import and use the new StopSessions class with:
import org.apache.beam.sdk.transforms.windowing.StopSessions; // custom one
...
.apply("Window into StopSessions", Window.<KV<String, Integer>>into(StopSessions
.withGapDuration(Duration.standardSeconds(10))
.withStopValue(0)))
To mimic your example we create some data with:
.apply("Create data", Create.timestamped(
TimestampedValue.of(KV.of("k1", 0), new Instant()), // <t0, k1, 0>
TimestampedValue.of(KV.of("k1",98), new Instant().plus(1000)), // <t1, k1, 98>
TimestampedValue.of(KV.of("k1",145), new Instant().plus(2000)), // <t2, k1, 145>
TimestampedValue.of(KV.of("k1",0), new Instant().plus(4000)), // <t4, k1, 0>
...
With standard sessions the output would be:
user=k1, scores=[0,145,350,120,0,40,65,98,240,352], window=[2019-06-08T19:13:46.785Z..2019-06-08T19:14:05.797Z)
And with the custom one I get the following:
user=k1, scores=[350,145,98], window=[2019-06-08T21:18:51.395Z..2019-06-08T21:19:03.407Z)
user=k1, scores=[0], window=[2019-06-08T21:18:54.407Z..2019-06-08T21:18:54.408Z)
user=k1, scores=[65,240,352,120,40], window=[2019-06-08T21:18:55.407Z..2019-06-08T21:19:09.407Z)
user=k1, scores=[0], window=[2019-06-08T21:18:50.395Z..2019-06-08T21:18:50.396Z)
Changing the stopValue with .withStopValue(<int>) works as expected. The 98, 145 and 350 events are in a different session than the rest. Please note that this is not exactly like in the description as the stopValue gets assigned to a separate window instead of the new one but it can be filtered downstream and it gives you an idea on how to proceed. I would like to revisit this and also look for a Python implementation, too.
All files here.

Likely not, from your description. There are at least two problems:
PCollections in Beam are unordered and distributed:
there are no guarantees in the model that events from one group will arrive in that order;
data-driven triggers are not supported (probably for similar reasons):
https://beam.apache.org/documentation/programming-guide/#data-driven-triggers
However you can look into stateful processing and see if you can handle this manually. E.g. you accumulate all the incoming events in the state and then from time to time analyze the accumulated events and emit the results.
Or if you can extract/assign a common key in your business logic, then you might want to check if GroupByKey+ParDo or Combine would be helpful.
See:
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
https://docs.google.com/document/d/1zf9TxIOsZf_fz86TGaiAQqdNI5OO7Sc6qFsxZlBAMiA/edit
https://beam.apache.org/documentation/programming-guide/#combine
https://beam.apache.org/documentation/programming-guide/#groupbykey

Related

How could i use the kafka stream window to gerate one record for the Candlestick chart generation

I have to use Kafka Stream to get the transaction info to draw Candlestick chart in each specific durations from the transaction result topic, it has transaction id, amount, price, deal time, the key is transaction id, which is totally different for each record,
what I want to do is do calculation base on the transaction result to get the
the highest price, lowest price, open price, close price, tx close_time for each duration and use it to create a Candlestick chart.
I have used the kafka stream window to do this:
final KStreamBuilder builder = new KStreamBuilder();
KStream<String, JsonNode> transactionKStream = builder.stream(keySerde, valueSerde, srcTopicName);
KTable<Windowed<String>, InfoRecord> kTableRecords= groupedStream.aggregate(
InfoRecord::new, /* initializer */
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
TimeWindows.of(TimeUnit.SECONDS.toMillis(5)).until(TimeUnit.SECONDS.toMillis(5)),
infoRecordSerde);
As in the source topic, each record has the txId as the key, and the txId is never duplicated, so, when do aggregation, the result K-table will have the same record as K-stream but I could use the window to get all the records in
the specific durations.
I think the kTableRecords should contain all the records in a specific duration, i.e. the 5 seconds,
So, I could loop over all the records in the 5 seconds, to get the high, low, open(the very first record price in the window), close(the very last record price in the window), close_time (tx time for the very last record in the window),
so that I would only get one record for this window and output this result to a sink kafka topic, but I don't know how to do this in these window durations.
I think the code will be like:
kTableRecords.foreach((key, value) -> {
// TODO: Add Logic Here
})
the IDE show this foreach has been deprecated,
But I don't know how to distinct the record in this window or in next window
or I need a window record retain time use until in the sample code above.
I have struggle in this for several days, and I still don't know the correct way to complete my jobs, appreciate anyone's help for make me in the right way, thanks
kafka version is: 0.11.0.0
Update:
With the hints from Michal in his post, I changed my code, and do the
high, low, open, close price calculation in the aggregator instance,
but the results makes me reallize for each different key in the spcific
window, the logic create a new instance for the key, and do the add excutaions for the current key only, not interact with values of other keys,
what i really want is to caluate the
high, low, open, close price for each record with different key in that
window duration, so what i need is not create a new instance for each key,
it shoule be create only one aggregate instance for each specific window
and do the calculation for all the record values in the durations, each duration window get one set of (high, low, open, close price).
I have read the topic :
How to compute windowed aggregations over successively increasing timed windows?
So, i am doubt, i am not sure, if this is the right solution for me, thanks.
By the way, K-line means Candlestick chart.
UPDATE II:
Based on your updates, i create the code as bellow shows:
KStream<String, JsonNode> transactionKStream = builder.stream(keySerde, valueSerde, srcTopicName);
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupBy((k,v)-> "constkey", keySerde, valueSerde);
KTable<Windowed<String>, MarketInfoRecord> kTable =
groupedStream.aggregate(
MarketInfoRecord::new, /* initializer */
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
TimeWindows.of(TimeUnit.SECONDS.toMillis(100)).until(TimeUnit.SECONDS.toMillis(100)),
infoRecordSerde, "test-state-store");
KStream<String, MarketInfoRecord> newS = kTable.toStream().map(
(k,v) -> {
System.out.println("key: "+k+", value:"+v);
return KeyValue.pair(k.window().start() + "_" + k.window().end(), v);
}
);
newS.to(Serdes.String(),infoRecordSerde, "OUTPUT_NEW_RESULT");
If i use a static string as the key when doing group, it's sure that when doing windowed aggregation,
only one aggregator instance has been created for the window, and we could get the (high, low, open, close)
for all the record in that window, but
as the key a same for all the record, this window will gets updated for several times, and produce several record for one window,as:
key: [constkey#1521304400000/1521304500000], value:MarketInfoRecord{high=11, low=11, openTime=1521304432205, closeTime=1521304432205, open=11, close=11, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=44, low=44, openTime=1521304622655, closeTime=1521304622655, open=44, close=44, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=44, low=33, openTime=1521304604182, closeTime=1521304622655, open=33, close=44, count=2}
key: [constkey#1521304400000/1521304500000], value:MarketInfoRecord{high=22, low=22, openTime=1521304440887, closeTime=1521304440887, open=22, close=22, count=1}
key: [constkey#1521304600000/1521304700000], value:MarketInfoRecord{high=55, low=55, openTime=1521304629943, closeTime=1521304629943, open=55, close=55, count=1}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=77, low=77, openTime=1521304827181, closeTime=1521304827181, open=77, close=77, count=1}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=77, low=66, openTime=1521304817079, closeTime=1521304827181, open=66, close=77, count=2}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=88, low=66, openTime=1521304817079, closeTime=1521304839047, open=66, close=88, count=3}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=99, low=66, openTime=1521304817079, closeTime=1521304848350, open=66, close=99, count=4}
key: [constkey#1521304800000/1521304900000], value:MarketInfoRecord{high=100.0, low=66, openTime=1521304817079, closeTime=1521304862006, open=66, close=100.0, count=5}
so we need do dedupe as your posted link described in "38945277/7897191", right?
So, I want to know if i could do something like:
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupByKey();
// as key was unique txId, so this group is just for doing next window operation, the record number is not changed.
KTable<Windowed<String>, MarketInfoRecord> kTable =
groupedStream.SOME_METHOD(
// just use some method to deliver the records in different windows,
// no sure if this is possible?
TimeWindows.of(TimeUnit.SECONDS.toMillis(100)).until(TimeUnit.SECONDS.toMillis(100))
// use until here to let the record purged if out of the window,
// please correct me if i am wrong?
we could transform the time based series of input record turn to several windowed groups,
each group have the window (or use window start time, end time combined as string key),
so, for each group, the key is different, but it has several record which has different values,
then we do aggregation(no need use windowed aggregation here), the values has been calculated, and
from each key:value pair, i.e. , we could get one result record,
and next window has different windowBased Key name, so in this way, the execution downstream shoud have multiple threads(
as the key changes)
I suggest you do all the calculations you mention not in a foreach but directly in your aggregator, that is, in the adder:
(k, v, aggregate) -> aggregate.add(k,v), /* adder */
the add method can do all the things you mentioned (I suggest you first map the JsonNode to a Java object, let's call it Transaction), consider this pseudo-code:
private int low = Integer.MAX; // whatever type you use to represent prices
private int high = Integer.MIN;
private long openTime = Long.MAX; // whatever type you use to represent time
private long closeTime = Long.MIN;
...
public InfoRecord add(String key, Transaction tx) {
if(tx.getPrice() > this.high) this.high = tx.getPrice();
if(tx.getPrice() < this.low) this.low = tx.getPrice();
if(tx.getTime() < this.openTime) {
this.openTime = tx.getTime();
this.open = tx.getPrice();
}
if(tx.getTime() > this.closeTime) {
this.closeTime = tx.getTime();
this.close = tx.getPrice();
}
return this;
}
Keep in mind that you may in reality get more than one record on output for each window as the windows can be updated multiple times (they're never final) as is explained in more detail here: https://stackoverflow.com/a/38945277/7897191
I don't know what a K-line is but if you want multiple windows of increasing duration, the pattern is outlined here
UPDATE:
To aggregate all records in a window, just change the key to some static value before doing the aggregation. So to create your grouped stream you can use groupBy(KeyValueMapper), something like:
KGroupedStream<String, JsonNode> groupedStream = transactionKStream.groupBy( (k, v) -> ""); // give all records the same key (empty string)
Please be aware that this will cause repartitioning (since partition is determined by the key and we're changing the key) and the execution downstream will become single threaded (since there will now be just one partition).

Flink: join file with kafka stream

I have a problem I don't really can figure out.
So I have a kafka stream that contains some data like this:
{"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
And I want to replace 'adId' with another value 'bookingId'.
This value is located in a csv file, but I can't really figure out how to get it working.
Here is my mapping csv file:
9001;8
9002;10
So my output would ideally be something like
{"bookingId":"8", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
This file can get refreshed every hour at least once, so it should pick up changes to it.
I currently have this code which doesn't work for me:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // create a checkpoint every 30 seconds
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<String> adToBookingMapping = env.readTextFile(parameters.get("adToBookingMapping"));
DataStream<Tuple2<Integer,Integer>> input = adToBookingMapping.flatMap(new Tokenizer());
//Kafka Consumer
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", parameters.get("bootstrap.servers"));
properties.setProperty("group.id", parameters.get("group.id"));
FlinkKafkaConsumer010<ObjectNode> consumer = new FlinkKafkaConsumer010<>(parameters.get("inbound_topic"), new JSONDeserializationSchema(), properties);
consumer.setStartFromGroupOffsets();
consumer.setCommitOffsetsOnCheckpoints(true);
DataStream<ObjectNode> logs = env.addSource(consumer);
DataStream<Tuple4<Integer,String,Integer,Float>> parsed = logs.flatMap(new Parser());
// output -> bookingId, action, impressions, sum
DataStream<Tuple4<Integer, String,Integer,Float>> joined = runWindowJoin(parsed, input, 3);
public static DataStream<Tuple4<Integer, String, Integer, Float>> runWindowJoin(DataStream<Tuple4<Integer, String, Integer, Float>> parsed,
DataStream<Tuple2<Integer, Integer>> input,long windowSize) {
return parsed.join(input)
.where(new ParsedKey())
.equalTo(new InputKey())
.window(TumblingProcessingTimeWindows.of(Time.of(windowSize, TimeUnit.SECONDS)))
//.window(TumblingEventTimeWindows.of(Time.milliseconds(30000)))
.apply(new JoinFunction<Tuple4<Integer, String, Integer, Float>, Tuple2<Integer, Integer>, Tuple4<Integer, String, Integer, Float>>() {
private static final long serialVersionUID = 4874139139788915879L;
#Override
public Tuple4<Integer, String, Integer, Float> join(
Tuple4<Integer, String, Integer, Float> first,
Tuple2<Integer, Integer> second) {
return new Tuple4<Integer, String, Integer, Float>(second.f1, first.f1, first.f2, first.f3);
}
});
}
The code only runs once and then stops, so it doesn't convert new entries in kafka using the csv file. Any ideas on how I could process the stream from Kafka with the latest values from my csv file?
Kind regards,
darkownage
Your goal appears to be to join steaming data with a slow-changing catalog (i.e. a side-input). I don't think the join operation is useful here because it doesn't store the catalog entries across windows. Also, the text file is a bounded input whose lines are read once.
Consider using connect to create a connected stream, and store the catalog data as managed state to perform lookups into. The operator's parallelism would need to be 1.
You may find a better solution by researching 'side inputs', looking at the solutions that people use today. See FLIP-17 and Dean Wampler's talk at Flink Forward.

How to sessionize stream with Apache Flink?

I want to sessionize this stream: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,0,3,3,3,5, ... to these sessions:
1,1,1
2,2,2,2,2
3,3,3,3,3,3,3
0
3,3,3
5
I've wrote CustomTrigger to detect when stream elements change from 1 to 2 (2 to 3, 3 to 0 and so on) and then fire the trigger. But this is not the solution, because when I processing the first element of 2's, and fire the trigger the window will be [1,1,1,2] but I need to fire the trigger on the last element of 1's.
Here is the pesudo of my onElement function in my custom trigger class:
override def onElement(element: Session, timestamp: Long, window: W, ctx: TriggerContext): TriggerResult = {
if (prevState == element.value) {
prevState = element.value
TriggerResult.CONTINUE
} else {
prevState = element.value
TriggerResult.FIRE
}
}
How can I solve this problem?
I think a FlatMapFunction with a ListState is the easiest way to implement this use-case.
When a new element arrives (i.e., the flatMap() method is called), you check if the value changed. If the value did not changed, you append the element to the state. If the value changed, you emit the current list state as a session, clear the list, and insert the new element as the first to the list state.
However, you should keep in mind that this assumes that the order of elements is preserved. Flink ensures within a partition, i.e, as long as elements are not shuffled and all operators run with the same parallelism.

Batching large result sets using Rx

I've got an interesting question for Rx experts. I've a relational table keeping information about events. An event consists of id, type and time it happened. In my code, I need to fetch all the events within a certain, potentially wide, time range.
SELECT * FROM events WHERE event.time > :before AND event.time < :after ORDER BY time LIMIT :batch_size
To improve reliability and deal with large result sets, I query the records in batches of size :batch_size. Now, I want to write a function that, given :before and :after, will return an Observable representing the result set.
Observable<Event> getEvents(long before, long after);
Internally, the function should query the database in batches. The distribution of events along the time scale is unknown. So the natural way to address batching is this:
fetch first N records
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
... and so on (the idea should be clear)
My question is:
Is there a way to express this function in terms of higher-level Observable primitives (filter/map/flatMap/scan/range etc), without using the subscribers explicitly?
So far, I've failed to do this, and come up with the following straightforward code instead:
private void observeGetRecords(long before, long after, Subscriber<? super Event> subscriber) {
long start = before;
while (start < after) {
final List<Event> records;
try {
records = getRecordsByRange(start, after);
} catch (Exception e) {
subscriber.onError(e);
return;
}
if (records.isEmpty()) break;
records.forEach(subscriber::onNext);
start = Iterables.getLast(records).getTime();
}
subscriber.onCompleted();
}
public Observable<Event> getRecords(final long before, final long after) {
return Observable.create(subscriber -> observeGetRecords(before, after, subscriber));
}
Here, getRecordsByRange implements the SELECT query using DBI and returns a List. This code works fine, but lacks elegance of high-level Rx constructs.
NB: I know that I can return Iterator as a result of SELECT query in DBI. However, I don't want to do that, and prefer to run multiple queries instead. This computation does not have to be atomic, so the issues of transaction isolation are not relevant.
Although I don't fully understand why you want such time-reuse, here is how I'd do it:
BehaviorSubject<Long> start = BehaviorSubject.create(0L);
start
.subscribeOn(Schedulers.trampoline())
.flatMap(tstart ->
getEvents(tstart, tstart + twindow)
.publish(o ->
o.takeLast(1)
.doOnNext(r -> start.onNext(r.time))
.ignoreElements()
.mergeWith(o)
)
)
.subscribe(...)

Empty data while reading data from kafka using Trident Topology

I am new to Trident. I am writing a trident topology which reads data from kafka. Topic name is 'test'. I have local kafka setup. I started zookeeper, kafka in local. And created a topic 'test' in kafka and opened the producer and typed the message 'Hello Kafka!'.
I want to read the message 'Hello Kafka' from the 'test' topic using trident.
Below is my code. I am getting empty tuple.
TridentTopology topology = new TridentTopology();
BrokerHosts brokerHosts = new ZkHosts("localhost:2181");
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(brokerHosts, "test");
kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
kafkaConfig.bufferSizeBytes = 1024 * 1024 * 4;
kafkaConfig.fetchSizeBytes = 1024 * 1024 * 4;
kafkaConfig.forceFromStart = false;
OpaqueTridentKafkaSpout opaqueTridentKafkaSpout = new OpaqueTridentKafkaSpout(kafkaConfig);
topology.newStream("TestSpout", opaqueTridentKafkaSpout).parallelismHint(1)
.each(new Fields(), new TestFilter()).parallelismHint(1)
.each(new Fields(), new Utils.PrintFilter());
and this is my TestFilter class code
public TestFilter()
{
//
}
#Override
public boolean isKeep(TridentTuple tuple) {
boolean isKeep=true;
System.out.println("TestFilter is called...");
if (tuple != null && tuple.getValues().size()>0) {
System.out.println("data from kafka ::: "+tuple.getValues());
}
return isKeep;
}
Whenever i type message in kafka producer to the 'test' topic, first sysout getting printed but it doesn't pass the if loop. I am simply getting message 'TestFilter is called...' not more than that.
I want to get the actual data i produced to the 'test' topic. How?
The problems lies in the parameters to Stream.each. The relevant portion of the javadoc for the method is:
each(Fields inputFields, Filter filter)
The documentation is't too clear about it, but the semantic is that you should specifies all the fields used by your filter using the inputFields parameter.
Storm will then apply a projection on the input tuple and forward it to the filter.
Given that you didn't specified any input fields, the projection resulted in an empty tuple thus resulting in the failure of the tuple.getValues().size()>0 condition inside the filter.
It's worth mentioning also the other variants of each:
each(Fields inputFields, Function function, Fields functionFields)
each(Function function, Fields functionFields)
These will apply the provided function on the projection of the input tuple, appending the resulting tuple to the original input tuple renaming the new fields as functionFields (i.e. the projection is only used for applying the function).
In particular the second version is equivalent to invoke each with inputFields set to null (or new Fields()) and will result in an empty tuple getting passed to function.