I tried to create my first real-time analytics job in Flink. The approach is kappa-architecture-like, so I have my raw data on Kafka where we receive a message for every change of state of any entity.
So the messages are of the form:
(id,newStatus, timestamp)
We want to compute, for every time window, the count of items in a given status. So the output should be of the form:
(outputTimestamp, state1:count1,state2:count2 ...)
or equivalent. These rows should contain, at any given time, the count of the items in a given status, where the status associated to an Id is the most recent message observed for that id. The status for an id should be counted in any case, even if the event is way older than those getting processed. So the sum of all the counts should be equal to the number of different IDs observed in the system. The following step could be forgetting about the items in a final item after a while, but this is not a strict requirement right now.
This will be written on elasticsearch and then queried.
I tried many different paths and none of them completely satisfied the requirement. Using a sliding window I could easily achieve the expected behaviour, except that when the beginning of the sliding window surpassed the timestamp of an event, it was lost for the count, as you may expect. Others approaches failed to be consistent when working with a backlog because I did some tricks with keys and timestamps that failed when the data was processed all at once.
So I would like to know, even at an high level, how should I approach this problem. It looks like a relatively common use-case but the fact that the relevant information for a given ID must be retained indefinitely to count the entities correctly creates a lot of problems.
I think I have a solution for your problem:
Given a DataStream of (id, state, time) as:
val stateUpdates: DataStream[(Long, Int, ts)] = ???
You derive the actual state changes as follows:
val stateCntUpdates: DataStream[(Int, Int)] = s // (state, cntUpdate)
.keyBy(_._1) // key by id
.flatMap(new StateUpdater)
StateUpdater is a stateful FlatMapFunction. It has a keyed state that stores the last state of each id. For each input record it returns two state count update records: (oldState, -1), (newState, +1). The (oldState, -1) record ensures that counts of previous states are reduced.
Next you aggregate the state count changes per state and window:
val cntUpdatesPerWindow: DataStream[(Int, Int, Long)] = stateCntUpdates // (state, cntUpdate, time)
.keyBy(_._1) // key by state
.timeWindow(Time.minutes(10)) // window should be non-overlapping, e.g. Tumbling
.apply(new SumReducer(), new YourWindowFunction())
SumReducer sums the cntUpdates and YourWindowFunction assigns the timestamp of your window. This step aggregates all state changes for each state in a window.
Finally, we adjust the current count with the count updates.
val stateCnts: DataStream[(Int, Int, Long)] = cntUpdatesPerWindow // (state, count, time)
.keyBy(_._1) // key by state again
.map(new CountUpdater)
CountUpdater is a stateful MapFunction. It has a keyed state that stores the current count for each state. For each incoming record, the count is adjusted and a record (state, newCount, time) is emitted.
Now you have a stream with new counts for each state (one record for each state). If possible, you can update your Elasticsearch index using these records. If you need to collect all state counts for a given time you can do that using a window.
Please note: The state size of this program depends on the number of unique ids. That might cause problems if the id space grows very fast.
Related
I'm trying to come up with a solution which involves applying some logic after the join operation to pick one event from streamB among multiple EventBs. It would be like a reduce function but it only returns 1 element instead of doing it incrementally. So the end result would be a single (EventA, EventB) pair instead of a cross product of 1 EventA and multiple EventB.
streamA
.keyBy((a: EventA) => a.common_key)
.intervalJoin(
streamB
.keyBy((b: EventB) => b.common_key)
)
.between(Time.days(-30), Time.days(0))
.process(new MyJoinFunction)
The data would be ingested like (assuming they have the same key):
EventB ts: 1616686386000
EventB ts: 1616686387000
EventB ts: 1616686388000
EventB ts: 1616686389000
EventA ts: 1616686390000
Each EventA key is guaranteed to arrive only once.
Assume a join operation like above and it generated 1 EventA with 4 EventB, successfully joined and collected in MyJoinFunction. Now what I want to do is, access these values at once and do some logic to correctly match the EventA to exactly one EventB.
For example, for the above dataset I need (EventA 1616686390000, EventB 1616686387000).
MyJoinFunction will be invoked for each (EventA, EventB) pair but I'd like an operation after this, that lets me access an iterator so I can look through all EventB events for each EventA.
I am aware that I can apply another windowing operation after the join to group all the pairs, but I want this to happen immediately after the join succeeds. So if possible, I'd like to avoid adding another window since my window is already large (30 days).
Is Flink the correct choice for this use case or I am completely in the wrong?
This could be implemented as a KeyedCoProcessFunction. You would key both streams by their common key, connect them, and process both streams together.
You can use ListState to store the events from B (for a given key), and ValueState for A (again, for a given key). You can use an event time timer to know when the time has come to look through the B events in the ListState, and produce your result. Don't forget to clear the state once you are finished with it.
If you're not familiar with this part of the Flink API, the tutorial on Process Functions should be helpful.
As far as I understood, change log topic for window aggregation should contain at least one key/value for each window?
input
.groupByKey() // group by user
.windowedBy(
TimeWindows
.of(Duration.ofSeconds(60))
.advanceBy(Duration.ofSeconds(10))
.grace(Duration.ofSeconds(60)))
.aggregate(
() -> new Aggregate(config),
(userId, msg, aggregate) -> aggregate.addAndReturn(msg),
Materialized
.<String, Aggregate>as(inMemoryWindowStore(
config.getOutputStore(),
Duration.ofSeconds(300),
Duration.ofSeconds(60),
false))
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(new MyCustomSerde()));
When I query state store, I would expect to get one Key/Value for each window:
WindowStoreIterator<Aggregate> iter = store.fetch(userId, start, end)
But either I don't get anything (the iterator is empty) or sometimes it is less than the actual number of windows between start-end.
You use the parameters of store.fetch(key, startTs, endTs) incorrectly. The two timestamps startTs and endTs do not refer to a single window's start/end timestamp, but it's a time range: fetch() will return all windows with a start-timestamps that is included in the time range.
The JavaDocs in older versions are not very good to be fair and may be miss leading. Newer versions have improved JavaDocs: https://kafka.apache.org/23/javadoc/org/apache/kafka/streams/state/ReadOnlyWindowStore.html
Note that the parameter types are changes and renamed in newer versions:
WindowStoreIterator<V> fetch(K key,
Instant from,
Instant to)
Get all the key-value pairs with the given key and the time range from all the existing windows.
This iterator must be closed after use.
The time range is inclusive and applies to the starting timestamp of the window.
KGroupedTable.count() is returning negative values?
idAndJobTransaction
.filter((k,v) -> v!=null)
.mapValues(jobTransaction -> {
jobTransaction.setCount(0);
jobTransaction.setId(0L);
jobTransaction.setRunsheet_id(0L);
jobTransaction.setTimestamp(0L);
if(jobTransaction.getDelete_flag() == 1)
return null;
else
return jobTransaction;
} )
.groupBy((id,jobTransaction)->new KeyValue<>(jobTransaction,jobTransaction),Serialized.with(jobTransactionSerde,jobTransactionSerde))
.count()
.toStream()
.mapValues((k,v)-> new JobSummary(k,v))
.peek((k,v)->{
log.info(k.toString());
log.info(v.toString());
}).selectKey((k,v)-> v.getCompany_id()) // So that the count is consumed in order for each company
.to(JOB_SUMMARY,Produced.with(Serdes.Long(),jobSummarySerde));
The count method is sometimes returning negative values. Around 1% percent of the values are negative. How is that possible?
EDIT 1:
I push the results of this aggregation to a Postgres table. The negative values are not limited to -1 but it goes to very high values.
I am using 2 consumers. Does that make any difference?
Can it be an issue with Kafka streams? or should I look into other possible reasons?
EDIT 3:
I was able to capture some of the available logs and I did see the negative values in the peek:
As for the JobSummary class, Its really a very simple POJO class. Here's the constructor called in the KStream app.
public JobSummary(JobTransaction j, Long count){
this.setUser_id(j.getUser_id());
this.setHub_id(j.getHub_id());
this.setCity_id(j.getCity_id());
this.setCompany_id(j.getCompany_id());
this.setJob_master_id(j.getJob_master_id());
this.setJob_status_id(j.getJob_status_id());
this.setCount(count);
this.setDate(j.getDate());
}
I guess (it's the only explanation I can come up with), that this is a special corner case. First you have to understand how a KTable aggregation work internally. This is explained on a different question: TopologyTestDriver sending incorrect message on KTable aggregations
With this background, a negative count can happen, if the current count in the result table is zero, and the upstream base-table (ie, idAndJobTransaction) gets an idempotent update (ie, a record in the base-table is updated from <K,V> to <K,V>. This would result in one subtraction and one addition record that go to the same row in the result table (note, that Kafka Streams does not compare old and new value on a table update and blindly assumes that both are different). Also, subtraction and addition record are sent downstream independently and the downstream count() updates its result in two steps. Thus, the count in the result table goes from 0 to -1 processing the subtraction record and goes back from -1 to 0 processing the addition record.
From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.
After upgrading to Spark 1.6.1, I've started refactoring an application to replace updateStateByKey with mapWithState.
In order to take advantage of the performance advantages of the new API, I don't want to call stateSnapshots, which loads all states. I only want the updated states.
The mapWithState API returns a DStream of [key, input, state, output], where each state is the partially updated state after an input is ingested. How can I extract the latest states alone from this DStream (i.e. the state after all corresponding inputs have been ingested / mapped)?
I can do a map (to drop the input and output) and reduceByKey on the MapWithStateDStream, choosing the state with the newer timestamp (which I set inside the update function), but I have no guarantee there won't be two partial states with the same timestamp, even if using a custom, by key, partitioner.
How can I tell which partial state is the latest in the MapWithStateDStream output of mapWithState?
mapWithState will only be called for each state which is being updated in the current micro batch. One way to achieve what you want is to return an Some[S] in case the state has been updated.
StateSpec.function takes a method with the following signature:
mappingFunction:
(Time, KeyType, Option[ValueType], State[StateType]) => Option[MappedType]
What we can do is make sure that our Option[MappedType] is always Some[MappedType] when the value has been updated, otherwise None.
For example:
def updateState(key: Int, value: Option[Int], state: State[Int]): Option[Int] = {
value match {
case Some(something) if something > 10 =>
val updatedVal = something * something
state.update(updatedVal)
Some(updatedVal)
case _ => None
}
}
And then you can do:
val spec = StateSpec.function(updateState _)
ssc.mapWithState(spec).filter(!_.isEmpty).foreachRDD(/* do stuff on updated state */)
This way you filter out any none updated state and keep only the updated snapshots you're looking for.
One solution that would work if it is possible for your update algorithm is to call reduceByKey on the input stream before calling mapWithstate. Then there would only be a single update for each key and no partial states output.