Advanced counting in Kafka Streams - apache-kafka

I'd like to determine the number of assigned companies to a category. The (simplified) data structure of the input topic looks something like this:
Key: 3298440
Value: {"company_id": 5678, "category_id": 9876}
Key: 4367848
Value: {"company_id": 35383, "category_id": 9876}
[...]
So, like in this example, I'd like to count the different companies for the category 9876.
My idea was to group by category_id (the new key) and in- or decrease the count (via reduce()) as the value.
So in principle, I'm only interested in inserts and deletes (tombstones) for altering the count. My problem is, that updates will also produce a record and invalidate the count.
I believe I'm on the wrong track here. Is there any way to do it?
This is my incorrect java code, which also counts updates:
companyCategoriesKTable
.groupBy(KeyValue::pair, Grouped.with(Serdes.String(), companyCategoriesSerde))
.aggregate(
() -> new CompanyCount(0L, 0L), /* category_id, count */
(key, newValue, aggValue) -> {
aggValue.setCategoryId(newValue.getBrId());
aggValue.setCount(aggValue.getCount() + 1);
return aggValue;
},
(key, oldValue, aggValue) -> {
aggValue.setCategoryId(oldValue.getBrId());
aggValue.setCount(aggValue.getCount() - 1);
return aggValue;
},
Materialized.<String, CompanyCount, KeyValueStore<Bytes, byte[]>>as("CompanyCountStore").withKeySerde(Serdes.String()).withValueSerde(new JsonSerde<>(CompanyCount.class))
)
.toStream()
.map((k, v) -> new KeyValue<>(v.getCategoryId(), v.getCount() > 0 ? 1L : -1L))
.groupByKey(Grouped.with(Serdes.Long(), Serdes.Long()))
.reduce(Long::sum)
.toStream()
.to("output-topic", Produced.with(Serdes.Long(), Serdes.Long()));

Related

How can I count users that generate events within a certain period of time with Kafka Streams?

I have streaming events which has user-id in it. I want to count how many distinct user generate an event within certain of time. However, I am beginner in Kafka and I cannot cope with the problem.
Example events in 1 minutes;
{"event_name": "viewProduct", "user_id": "12"}
{"event_name": "viewProductDetails", "user_id": "23"}
{"event_name": "viewProductComments", "user_id": "12"}
{"event_name": "viewProduct", "user_id": "23"}
{"event_name": "viewProductComments", "user_id": "32"}
My code should generate there are 3 active users according to the events above.
My approach is as follows, however this solution cannot eliminate multiple event from same user and count the same user multiple times.
builder.stream("orders") // read from orders toic
.mapValues(v -> { // get user_id via json parser
JsonNode jsonNode = null;
try {
jsonNode = objectMapper.readTree((String) v);
return jsonNode.get("user_id").asText();
} catch (JsonProcessingException e) {
e.printStackTrace();
}
return "";
})
.selectKey((k, v) -> "1") // put same key to every user_id
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(1))) // use time windows
.count() // count values
I might be missing something here by why don't you just do:
.selectKey((k, v) -> v)
That will group the record by value, which you previously populated with user_id.

Query listener does not work when 'Where' clause is added

I have a query that provides 'feed' data to a collection view using RxSwift and RxFirebase.
This feed data now has a 'Privacy' field (possible values: "Public", "Sub", and "Private") and I need to filter out the 'Private' entities.
However when I add a 'Where' clause to do this, the listener no longer adds newly posted entities from this collection. The first call to this function always has the 'listens' bool set to true, because it wants to listen for new entities posted/deleted by a user.
I do not know why the events do not trigger the query.
Here is the current query code:
func fetchGlobalFeed(limit: Int, startAfter: Timestamp?, listens: Bool) -> Observable<[Entity.Fragment]> {
var query = db.collection(k.mCollection)
.limit(to: limit)
.order(by: "PublishedAt", descending: true)
if let timestamp = startAfter {
query = query.start(after: [timestamp])
}
let observable = listens ? query.rx.listen() : query.rx.getDocuments()
return observable
.catchErrorJustComplete()
.map(FirestoreHelpers.dataFromQuerySnapshot)
.map { data in
data.compactMap {
try? FirestoreDecoder()
.decode(Entity.Fragment.self, from: $0)
}
}
}
And changing the query to:
var query = db.collection( k.mCollection )
.limit( to: limit )
.whereField( "Privacy", in: ["Public", "Sub"] ) // <- this causes issue with listener
.order( by: "PublishedAt", descending: true )
Maybe an explanation of how listeners work and any suggestions would be appreciated.
Thanks.
RxFirebase version: 0.3.8
*Edit:
I don't have enough Rep to post pictures which is weird because I used to be able to. (says it requires 10 rep pts)
Here is the debug output:
2020-09-10 11:37:03.101: MagmaFirestoreProvider.swift:165
(fetchGlobalFeed(limit:startAfter:listens:)) -> subscribed
2020-09-10 11:37:03.324: MagmaFirestoreProvider.swift:165
(fetchGlobalFeed(limit:startAfter:listens:)) -> Event
next([Entity.Fragment(ref: <FIRDocumentReference: 0x60000109ef40>,
publisher: Optional(UserEntity.Fragment(ref: <FIRDocumentReference:
0x60000109ee80>, username: "Art_Collective", name: nil .........
privacy: Optional(magma.MagmaMagPrivacy.Public))])
2020-09-10 11:37:03.458: MagmaFirestoreProvider.swift:165
(fetchGlobalFeed(limit:startAfter:listens:)) -> Event completed
2020-09-10 11:37:03.458: MagmaFirestoreProvider.swift:165
(fetchGlobalFeed(limit:startAfter:listens:)) -> isDisposed
When a new item is posted to the feed with a privacy of "Sub", the item does not appear and there is zero debug output.
I have to refresh the collection view to get the above output and the new item in the feed.
I assume that's because the observable is being disposed?
As suggested in the comments above, I created a composite index for the query in the Firebase console.
I will note that it was important to have the index structure a certain way.
I tried creating an index with the following settings:
Collection ID : 'mags'
Field 1: 'PublishedAt' : 'Descending'
Field 2: 'Privacy' : 'Descending'
(note: firebase docs say to use either 'Ascending' or 'Descending'
when you're using the 'in' clause even if you're not ordering on this
field, it will not affect the equality of the query results)
This index did not work for some reason, I'm not sure why...
However, I realized other indexes created by a dev before me were mechanically similar but structured differently. So I mirrored that with the following:
Collection ID : 'mags'
Field 1: 'Privacy' : 'Ascending'
Field 2: 'PublishedAt' : 'Descending'
And it worked!

Aggregate to compacted topic with unlimited retention

I'm setting up a Kafka Streams application that consumes from a topic (retention: 14 days, cleanup.policy: delete, partitions: 1).
I wish to consume the messages and output it into another topic (retention: -1, cleanup.policy: compact, partitions: 3).
The grouping is by key on the input topic.
So:
Input-topic:
Key: A Value: { SomeJson }
Key: A Value: { Other Json}
Key: B Value: { TestJson }
Output:
Key: A Value: {[ { SomeJson }, { Other Json } ]}
Key: B Value: {[ { TestJson } ]}
It's important that the content on the output topic is never lost, so it's ack: all and 3x replicas.
Each key in the compacted topic will have around 100 json records. Estimated to less than 20kb for each key.
I was also hoping that the output topic worked as a state-topic, so that it wouldn't have to create another topic to contain the same information.
Anyone know how to do this? Most of the examples I find relate to windowing: https://github.com/confluentinc/kafka-streams-examples/tree/5.3.1-post/src/main/java/io/confluent/examples/streams
Current code:
val mapper = new ObjectMapper();
builder.stream(properties.getInputTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupByKey()
.aggregate(
() -> new GroupedIdenthendelser(Collections.emptyList()),
(key, value, currentAggregate) -> {
val items = new ArrayList<>(currentAggregate.getIdenthendelser());
items.add(value);
return new GroupedIdenthendelser(items);
},
Materialized.with(Serdes.String(), new JsonSerde<>(GroupedIdenthendelser.class, mapper)))
.toStream()
.to(properties.getOutputTopic(), Produced.with(Serdes.String(), new JsonSerde<>(mapper)));
If someone has some other tips to give, please do tell as this data is customer information so if there is some config's I should tweak do tell. Or if you know some blog posts/examples out there it's appreciated.
Edit: The code example above seems to work but it creates its own state topic which is something that is not needed as the output topic will always contain the same state. There will also only every be 1 application running this since input topic has 1 partition and as it's related to people in a rather fixed size (10 000 000 people give or take), the size of the data won't increase above 20kb per person either. Event's per second is estimated to be around 1/s, so the load it not much either.
The Topology:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input-topic])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
--> KTABLE-TOSTREAM-0000000003
<-- KSTREAM-SOURCE-0000000000
Processor: KTABLE-TOSTREAM-0000000003 (stores: [])
--> KSTREAM-SINK-0000000004
<-- KSTREAM-AGGREGATE-0000000002
Sink: KSTREAM-SINK-0000000004 (topic: output-topic)
<-- KTABLE-TOSTREAM-0000000003
Looking at your example dataset, I guess what you might need is the realtime aggregation. Please take a look at this blog post of Confluent as a starting point.

Grouping and Combining Observables in RxJava

I would like to do the following with RxJava
class Invoice(val dayOfMonth:Int,val amount:Int)
below is the sample monthInvoices:List< Invoice > to process
Invoice(3,100)
Invoice(3,150)
Invoice(3,50)
Invoice(4,350)
Invoice(8,400)
Invoice(8,100)
First, I would like to group it by the day of the month like the following
Invoice(3,300)
Invoice(4,350)
Invoice(8,500)
Then I would like to create a list containing all the days of the month. Say, we are having 30 days for this month, then the output list must contain inserting a empty Invoice object with 0 amount for the days where there is no invoice
Desired output List
Invoice(1,0) //Since day 1 is not in the group summed list
Invoice(2,0) //day 2 is also not there
Invoice(3,300)
Invoice(4,350)
Invoice(5,0)
Invoice(6,0)
Invoice(7,0)
Invoice(8,500)
…..
Invoice(30,0)
Hope I have explained the need clearly. Can anyone please answer me a solution to do it entirely using RxJava?
Try this
fun task(invoices: List<Invoice>) =
Observable.fromIterable(invoices)
.groupBy { it.dayOfMonth }
.flatMapSingle { group -> group.reduce(0) { t1, t2 -> t1 + t2.amount }
.map { group.key to it }}
.toMap({ it.first }, { it.second })
.flatMapObservable { map ->
Observable.range(1, 30)
.map { Invoice(it, map[it] ?: 0) }
}
This can be achieved much more easily using the collection operators inside Kotlin's standard library, but in pure RxJava you can do this by using groupBy and reduce.
val invoices = listOf(
Invoice(3, 100),
Invoice(3, 150),
Invoice(3, 50),
Invoice(4, 350),
Invoice(8, 400),
Invoice(8, 100)
)
Observable.range(1, 30)
.map { Invoice(it, 0) } // Create an Observable of Invoice([day], 0)
.mergeWith(Observable.fromIterable(invoices))
.groupBy { it.dayOfMonth } // Merge the sources and groupBy day
.flatMapMaybe { group ->
group.reduce { t1: Invoice, t2: Invoice ->
Invoice(t1.dayOfMonth, t1.amount + t2.amount) // Reduce each group into a single Invoice
}
}
.subscribe {
// Optionally you can call toList before this if you want to aggregate the emissions into a single list
println(it)
}

Group By (Aggregate Map Reduce Functions) in MongoDB using Scala (Casbah/Rogue)

Here's a specific query I'm having trouble with. I'm using Lift-mongo-
records so that i can use Rogue. I'm happy to use Rogue specific
syntax , or whatever works.
While there are good examples for using javascript strings via java noted below, I'd like to know what the best practices might be.
Imagine here that there is a table like
comments {
_id
topic
title
text
created
}
The desired output is a list of topics and their count, for example
cats (24)
dogs (12)
mice (5)
So a user can see an list, ordered by count, of a distinct/group by
Here's some psuedo SQL:
SELECT [DISTINCT] topic, count(topic) as topic_count
FROM comments
GROUP BY topic
ORDER BY topic_count DESC
LIMIT 10
OFFSET 10
One approach is using some DBObject DSL like
val cursor = coll.group( MongoDBObject(
"key" -> MongoDBObject( "topic" -> true ) ,
//
"initial" -> MongoDBObject( "count" -> 0 ) ,
"reduce" -> "function( obj , prev) { prev.count += obj.c; }"
"out" -> "topic_list_result"
))
[...].sort( MongoDBObject( "created" ->
-1 )).skip( offset ).limit( limit );
Variations of the above do not compile.
I could just ask "what am I doing wrong" but I thought I could make my
confusion more acute:
can I chain the results directly or do I need "out"?
what kind of output can I expect - I mean, do I iterate over a
cursor, or the "out" param
is "cond" required?
should I be using count() or distinct()
some examples contain a "map" param...
A recent post I found which covers the java driver implies I should
use strings instead of a DSL :
http://blog.evilmonkeylabs.com/2011/02/28/MongoDB-1_8-MR-Java/
Would this be the preferred method in either casbah or Rogue?
Update: 9/23
This fails in Scala/Casbah (compiles but produces error {MapReduceError 'None'} )
val map = "function (){ emit({ this.topic }, { count: 1 }); }"
val reduce = "function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; }"
val out = coll.mapReduce( map , reduce , MapReduceInlineOutput )
ConfiggyObject.log.debug( out.toString() )
I settled on the above after seeing
https://github.com/mongodb/casbah/blob/master/casbah-core/src/test/scala/MapReduceSpec.scala
Guesses:
I am misunderstanding the toString method and what the out.object is?
missing finalize?
missing output specification?
https://jira.mongodb.org/browse/SCALA-43 ?
This works as desired from command line:
map = function (){
emit({ this.topic }, { count: 1 });
}
reduce = function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; };
db.tweets.mapReduce( map, reduce, { out: "results" } ); //
db.results.ensureIndex( {count : 1});
db.results.find().sort( {count : 1});
Update
The issue has not been filed as a bug at Mongo.
https://jira.mongodb.org/browse/SCALA-55
The following worked for me:
val coll = MongoConnection()("comments")
val reduce = """function(obj,prev) { prev.csum += 1; }"""
val res = coll.group( MongoDBObject("topic"->true),
MongoDBObject(), MongoDBObject( "csum" -> 0 ), reduce)
res was an ArrayBuffer full of coll.T which can be handled in the usual ways.
Appears to be a bug - somewhere.
For now, I have a less-than-ideal workaround working now, using eval() (slower, less safe) ...
db.eval( "map = function (){ emit( { topic: this.topic } , { count: 1 }); } ; ");
db.eval( "reduce = function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; }; ");
db.eval( " db.tweets.mapReduce( map, reduce, { out: \"tweetresults\" } ); ");
db.eval( " db.tweetresults.ensureIndex( {count : 1}); ");
Then I query the output table normally via casbah.