Kafka kstream comparing two values from two different topics - apache-kafka

Im currently trying to send two different formats of the same event on two different topics. Lets say format A to topic A and format B to topic B.
And Format B is only sent approx 15% of the times since it's not supported for older stuff. And if B is sent there will be an A equivalent of each event.
What i want to do is listen to them at the same time and if B exists i need to discard A.
What i've tried so far is to listen on both (im using kstreams),
and doing stream - stream join
streamA.leftJoin(streamB, (A_VALUE, B_VALUE) -> {
if (B_VALUE != null && A_VALUE != null) {
return B_VALUE
} else if (A_VALUE != null && B_VALUE == null) {
return A_VALUE
}
return null;
},
JoinWindows.of(Duration.ofMinutes(5)).grace(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(),
Serdes.String(),
Serdes.String()
))
And running tests on a load between 50-200 events/s i've seen results as such:
number B_VALUEs sent is always correct,
but number A_VALUE is larger than expected.
I think that sometimes it's sending both A and B.
I've tried using guava cache as a "hashmap with TTL", Storing all the B Events and then comparing that way. Here i find that the total amount is always correct but there is less B events then expected meaning that sometimes it does not find a match.
If there is a better way of doing it without using databases then im open for it!
note: unique events correlated always have the same key e.g. <432, A_VALUE> , <432, B_VALUE>.

Related

Kafka Stream aggregate function sinking each joined record in an incremental way, instead of a single list of aggregated records

I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable and a KStream, and yields another KTable with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream input, and for each iteration, join it with the KTable, and aggregate into a list of events present in the KTable and sink to a topic, containing the original KTable event and the list of joined KStream (1 to N join).
Context
This is how my component interacts with its context. MovementEvent contains a list of transaction_ids that should match the transaction_id of TransactionEvent, where the joiner should match & generate a new event (Sinked Event) with the original MovementEvent and a list of the matched TransactionEvent.
For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.
Implementation
public class SinkEventProcessor implements BiFunction<
KTable<TransactionKey, Transaction>,
KStream<String, Movement>,
KTable<SinkedEventKey, SinkedEvent>> {
#Override
public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
final KStream<String, Movement> movementStream) {
return movementStream
// [A]
.flatMap((movementKey, movement) -> movement
.getTransactionIds()
.stream()
.distinct()
.map(transactionId -> new KeyValue<>(
TransactionKey.newBuilder()
.setTransactionId(transactionId)
.build(),
movement))
.toList())
// [B]
.join(transactionTable, (movement, transaction) -> Pair.newBuilder()
.setMovement(movement)
.setTransaction(transaction)
.build())
// [C]
.groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
.setMovementId(pair.getMovement().getMovementId())
.build())
// [D]
.aggregate(SinkedEvent::new, (key, pair, collectable) ->
collectable.setMovement(pair.getMovement())
.addTransaction(pair.getTransaction()));
}
}
[A] I have started the implementation by iterating the Movement KStream, extracting the transactionId and creating a TransactionKey to use as the new key for the following operation, to facilitate the join with each transactionId present in the Movement entity. This operation returns a KStream<TransactionKey, Movement>
[B] Joins the formerly transformed KStream and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.
[C] Groups the pairs by movementId and constructs the new key (SinkedEventKey) for the sink operation.
[D] Aggregates into the result object (SinkedEvent) by adding the transaction to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>
Problem
The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id, the output topic will become something like this:
partition
offset
count of [TransactionEvent]
expected count
0
1
1
4
0
2
2
4
0
3
4
4
0
4
4
4
And the same happens for other records (e.g. a Movement with 13 transaction_id will yield 13 menssages). So for some reason that I can't compreehend, the aggregate operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.
I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?

Siddhi query with conditions within multiple occurrences

We can write Siddhi query with few occurrences of events with some condition like
For 3 events with customerId 'xyz' and source as 'log', we can use
from every (e1 = CargoStream[e1.customerId == 'xyz' AND e1.source = 'log']<3>)
But what we need to do is add conditions between these 3 events.
Something like all these three elements should have the same source, not a specific value.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND all these 3 events have same source does not matter the value]<3>)
We tried query with access to indexed events in occurrences but does not seem to be triggering events well.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND (e1[0].source == e1[1].sourse AND e1[1].source == e1[2].source)]<3>)
Is this even possible with Siddhi Query? If yes, then how?
For your question, for having the same condition across the events. you can use partitions
https://siddhi.io/en/v5.1/docs/query-guide/#partition
also, look into this issue - https://github.com/siddhi-io/siddhi/issues/1425
the query would be like -
define stream AuthenticationStream (ip string, type string);
#purge(enable='true', interval='15 sec', idle.period='2 min')
partition with (ip of AuthenticationStream)
begin
from every (e1=AuthenticationStream[type == 'FAILURE' ]<1:> ->
e2=AuthenticationStream[type == 'SUCCESS' ]) within 1 min
select e1[0].ip as ip, e1[3].ip as ip4
having not(ip4 is null)
insert into BreakIn
end;

TopologyTestDriver sending incorrect message on KTable aggregations

I have a topology that aggregates on a KTable.
This is a generic method I created to build this topology on different topics I have.
public static <A, B, C> KTable<C, Set<B>> groupTable(KTable<A, B> table, Function<B, C> getKeyFunction,
Serde<C> keySerde, Serde<B> valueSerde, Serde<Set<B>> aggregatedSerde) {
return table
.groupBy((key, value) -> KeyValue.pair(getKeyFunction.apply(value), value),
Serialized.with(keySerde, valueSerde))
.aggregate(() -> new HashSet<>(), (key, newValue, agg) -> {
agg.remove(newValue);
agg.add(newValue);
return agg;
}, (key, oldValue, agg) -> {
agg.remove(oldValue);
return agg;
}, Materialized.with(keySerde, aggregatedSerde));
}
This works pretty well when using Kafka, but not when testing via `TopologyTestDriver`.
In both scenarios, when I get an update, the subtractor is called first, and then the adder is called. The problem is that when using the TopologyTestDriver, two messages are sent out for updates: one after the subtractor call, and another one after the adder call. Not to mention that the message that is sent after the subrtractor and before the adder is in an incorrect stage.
Any one else could confirm this is a bug? I've tested this for both Kafka versions 2.0.1 and 2.1.0.
EDIT:
I created a testcase in github to illustrate the issue: https://github.com/mulho/topology-testcase
It is expected behavior that there are two output records (one "minus" record, and one "plus" record). It's a little tricky to understand how it works, so let me try to explain.
Assume you have the following input table:
key | value
-----+---------
A | <10,2>
B | <10,3>
C | <11,4>
On KTable#groupBy() you extract the first part of the value as new key (ie, 10 or 11) and later sum the second part (ie, 2, 3, 4) in the aggregation. Because A and B record both have 10 as new key, you would sum 2+3 and you would also sum 4 for new key 11. The result table would be:
key | value
-----+---------
10 | 5
11 | 4
Now assume that an update record <B,<11,5>> change the original input KTable to:
key | value
-----+---------
A | <10,2>
B | <11,5>
C | <11,4>
Thus, the new result table should sum up 5+4 for 11 and 2 for 10:
key | value
-----+---------
10 | 2
11 | 9
If you compare the first result table with the second, you might notice that both rows got update. The old B|<10,3> record is subtracted from 10|5 resulting in 10|2 and the new B|<11,5> record is added to 11|4 resulting in 11|9.
This is exactly the two output records you see. The first output record (after subtract is executed), updates the first row (it subtracts the old value that is not part of the aggregation result any longer), while the second record adds the new value to the aggregation result. In our example, the subtract record would be <10,<null,<10,3>>> and the add record would be <11,<<11,5>,null>> (the format of those record is <key, <plus,minus>> (note that the subtract record only set the minus part while the add record only set the plus part).
Final remark: it is not possible to put plus and minus records together, because the key of the plus and minus record can be different (in our example 11 and 10), and thus might go into different partitions. This implies that the plus and minus operation might be executed by different machines and thus it's not possible to only emit one record that contains both plus and minus part.

EntityFramework counting of query results vs counting list

Should efQuery.ToList().Count and efQuery.Count() produce the same value?
How is it possible that efQuery.ToList().Count and efQuery.Count() don't produce the same value?
//GetQuery() returns a default IDbSet which is used in EntityFramework
using (var ds = _provider.DataSource())
{
//return GetQuery(ds, filters).Count(); //returns 0???
return GetQuery(ds, filters).ToList().Count; //returns 605 which is correct based on filters
}
Just ran into this myself. In my case the issue is that the query has a .Select() clause that causes further relationships to be established which end up filtering the query further as the relationship inner join's constrain the result.
It appears that .Count() doesn't process the .Select() part of the query.
So I have:
// projection created
var ordersData = orders.Select( ord => new OrderData() {
OrderId = ord.OrderId,
... more simple 1 - 1 order maps
// Related values that cause relations in SQL
TotalItemsCost = ord.OrderLines.Sum(lin => lin.Qty*lin.Price),
CustomerName = ord.Customer.Name,
};
var count = ordersData.Count(); // 207
var count = ordersData.ToList().Count // 192
When I compare the SQL statements I find that Count() does a very simple SUM on the Orders table which returns all orders, while the second query is a monster of 100+ lines of SQL that has 10 inner joins that are triggered by the .Select() clause (there are a few more related values/aggregations retrieved than shown here).
Basically this seems to indicate that .Count() doesn't take the .Select() clause into account when it does its count, so those same relationships that cause further constraining of the result set are not fired for .Count().
I've been able to make this work by explicitly adding expressions to the .Count() method that pull in some of those aggregated result values which effectively force them into the .Count() query as well:
var count = ordersData.Count( o=> o.TotalItemsCost != -999 &&
o.Customer.Name != "!##"); // 207
The key is to make sure that any of the fields that are calculated or pull in related data and cause a relationship to fire, are included in the expression which forces Count() to include the required relationships in its query.
I realize this is a total hack and I'm hoping there's a better way, but for the moment this has allowed us at least to get the right value without pulling massive data down with .ToList() first.
Assuming here that efQuery is IQueryable:
ToList() actually executes a query. If changes to data in the datastore, between calls to ToList() and .Count(), result in a different resultset, calling ToList() will repopulate the list. ToList().Count and .Count() should then match until the data in the store changes the resultset again.

Limit the rows returned in Include entity

I have simple data model Project,Member and ProjectMember where Project to Member has many-to-many relationship. Therefore ProjectMember table contains the both foreign keys.
I writes the code
var result= db.Projects.Include(p=>p.ProjectMembers).Where(p=>p.ProjectMembers.Any(pm=>pm.DeletedUser==1));
and I see result.ProjectMembers count is 2 . Here I have got additional record with DeletedUser is not equal to 1
Did I do something wrong here?
What is the expression I have to use to get the only one record (or many with DeletedUser=1) for result.ProjectMembers
You are asking for Projects having at least one (= any) ProjectMember with DeletedUser == 1. This condition is met. Other ProjectMembers of the Project can have any other value than 1 for DeletedUser.
If you want the Project with only ProjectMembers with DeletedUser == 1 you should start the query at ProjectMember:
ProjectMembers.Include("Project").Where(pm => pm.DeletedUser == 1)