I'm using Kafka to process log events. I have basic knowledge of Kafka Connect and Kafka Streams for simple connectors and stream transformations.
Now I have a log file with the following structure:
timestamp event_id event
A log event has multiple log lines which are connected by the event_id (for example a mail log)
Example:
1234 1 START
1235 1 INFO1
1236 1 INFO2
1237 1 END
And in general there are multiple events:
Example:
1234 1 START
1234 2 START
1235 1 INFO1
1236 1 INFO2
1236 2 INFO3
1237 1 END
1237 2 END
The time window (between START and END) could be up to 5 minutes.
As result I want a topic like
event_id combined_log
Example:
1 START,INFO1,INFO2,END
2 START,INFO2,END
What are the right tools to achieve this? I tried to solve it with Kafka Streams but I can figure out how..
In your use case you are essentially reconstructing sessions or transactions based on message payloads. At the moment there is no built-in, ready-to-use support for such functionality. However, you can use the Processor API part of Kafka's Streams API to implement this functionality yourself. You can write custom processors that use a state store to track when, for a given key, a session/transaction is STARTed, added to, and ENDed.
Some users in the mailing lists have been doing that IIRC, though I am not aware of an existing code example that I could point you to.
What you need to watch out for is to properly handle out-of-order data. In your example above you listed all input data in proper order:
1234 1 START
1234 2 START
1235 1 INFO1
1236 1 INFO2
1236 2 INFO3
1237 1 END
1237 2 END
In practice though, messages/records may arrive out-of-order, like so (I only show messages with key 1 to simplify the example):
1234 1 START
1237 1 END
1236 1 INFO2
1235 1 INFO1
Even if that happens, I understand that in your use case you still want to interpret this data as: START -> INFO1 -> INFO2 -> END rather than START -> END (ignoring/dropping INFO1 and INFO2 = data loss) or START -> END -> INFO2 -> INFO1 (incorrect order, probably also violating your semantic constraints).
Related
I would like to do some further operations on a windowed KTable. To give some background, I have a topic with data in the form of: {clientId, txTimestamp, txAmount}. From this topic, I have created a stream, partitioned by clientId with the underlying topic timestamp equal to the txTimestamp event field. Starting from this stream, I want to aggregate the number of transactions per clientId in every 1 hour windows. This is done with something similar to the following:
CREATE TABLE transactions_per_client WITH (kafka_topic='transactions_per_client_topic') AS SELECT clientId, COUNT(*) AS transactions_per_client, WINDOWSTART AS window_start, WINDOWEND AS window_end FROM transactions_stream WINDOW TUMBLING (SIZE 1 HOURS) GROUP BY clientId;
The aggregations work as expected and yield values similar to:
ClientId
Transactions_per_client
windowsStart
WindowEnd
1
12
1
2
2
8
1
2
1
24
2
3
1
19
3
4
What I want to do now is further process this table to add a column that represents the difference in number of transactions per client between 2 adjacent windows for the same client. For the previous table, that would be something like this:
ClientId
Transactions_per_client
windowsStart
WindowEnd
Deviation
1
12
1
2
0
2
8
1
2
0
1
24
2
3
12
1
19
3
4
-5
What would be the best way to achieve this (either using kafka streams or ksql)? I tried to use the User Defined Aggregation functions to try to create this column but it cannot be applied to a KTable, only to a KStream.
Just for future reference, the official answer at this time (April 2022) is that it cannot be done in kafka-streams through a DSL as "Windowed-TABLE are kind of a “dead end” in ksqlDB atm, and also for Kafka Streams, you cannot really use the DSL to further process the data" (answer on Confluent forum here: https://forum.confluent.io/t/aggregations-on-windowed-ktables/4340). The suggestion there is to use the Processor API, which indeed can be pretty straightforward to implement. At a high level pseudocode, it would be something like this:
topology.addSource(NAME_OF_SOURCE_IN_THE_NEW_TOPOLOGY,
timeWindowedDeserializer, LongDeserializer, SOURCE_TOPIC -> the topic with the windowed KTable);
topology.addProcessor(
NAME_OF_PROCESSOR_IN_THE_NEW_TOPOLOGY,
() -> new Aggregator(storeName),
NAME_OF_SOURCE_IN_THE_NEW_TOPOLOGY);
StoreBuilder storeBuilder = keyValueStoreBuilder for the timeWindowedSerde and a Long serde for value;
topology.addStateStore(storeBuilder, NAME_OF_PROCESSOR_IN_THE_NEW_TOPOLOGY);
topology.addSink(
NAME_OF_SINK_IN_THE_NEW_TOPOLOGY,
sinkTopic,
timeWindowedSerializer,
Serializer for the new structure -> POJO that contains the deviation field,
NAME_OF_PROCESSOR_IN_THE_NEW_TOPOLOGY);
The aggregator in the previous section is a org.apache.kafka.streams.processor.api.Processor implementation that is keeping track of the values it has seen and is able to retrieve the previous seen value for a given key.
Again, at a high level it would be something similar to this:
Long previousTransactionAggregate = kvStore.get(previousWindow);
long deviation;
if (previousTransactionAggregate != null) {
deviation = kafkaRecord.value() - previousTransactionAggregate;
} else {
deviation = 0L;
}
kvStore.put(kafkaRecord.key(), kafkaRecord.value());
Record<Windowed<Long>, TransactionPerNumericKey> newRecord =
new Record<>(
kafkaRecord.key(),
new TransactionPerNumericKey(
kafkaRecord.key().key(), kafkaRecord.value(), deviation),
kafkaRecord.timestamp());
context.forward(newRecord);
TransactionPerNumericKey in the previous section is the name of the structure for the enhanced windowed aggregation (containing the deviation value)
I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)
I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.
We have a stream of events each having the following properties:
public class Event {
private String id;
private String src;
private String dst;
}
Besides, we have a set of hierarchical or nested rules we want to model with EPL and Esper. Each rule should be applied if and only if all of its parent rules have been already activated (a matching instance occurred for all of them). For example:
2 events or more with the same src and dst in 10 seconds
+ 5 or more with src, dst the same as the src, dst in the above rule in 20s
+ 100 or more with src, dst the same as the src, dst in the above rules in 30s
We want to retrieve all event instances corresponding to each level of this rule hierarchy.
For example, considering following events:
id ---- source -------------- destination ---------------- arrival time (second)
1 192.168.1.1 192.168.1.2 1
2 192.168.1.1 192.168.1.2 2
3 192.168.1.1 192.168.1.3 3
4 192.168.1.1 192.168.1.2 4
5 192.168.1.5 192.168.1.8 5
6 192.168.1.1 192.168.1.2 6
7 192.168.1.1 192.168.1.2 7
8 192.168.1.1 192.168.1.2 8
.....
100 other events from 192.168.1.1 to 192.168.1.2 in less than 20 seconds
We want our rule hierarchy to report this instance together with the id of all events corresponding to each level of the hierarchy. For example, something like the following report is required:
2 or more events with src 1928.168.1.1 and dst 192.168.1.2 in 10 seconds ( Ids:1,2 )
+ 5 or more with the same src (192.168.1.1) and dst (192.168.1.2) in 20s (Ids:1,2,4,6,7)
+ 100 or more events from 192.168.1.1 to 192.168.1.2 in 30s (Ids:1,2,4,6,7,8,...)
How can we achieve this (retrieve the ids of the events matched with all rules) in Esper EPL?
Complex use case, it will take some time to model this. I'd start simple with keeping a named window and using some match-recognize or EPL patterns. For the rule nesting, I'd propose triggering other statements using insert-into. A context partition that gets initiated by a triggering event may also come in handy. For the events that are shared between rules, if any, go against the named window using a join or subquery, for example. For the events that arrive after a triggering event of the first or second rule, just use EPL statements that consume the triggering event. Start simple and build it up, become familiar with insert-into and declaring an overlapping context.
You could use each rule as the input for the next rule in the hierarchy, for example, the rule listens for matching events in the last 10 secs ans inserts the results with the 'insert into' clause to a new stream, so the next rule is triggered for events in the new steam and so on... it it a pretty simple use case and can be done even without context partitions.
Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.
Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.