KStream -- iterate LIST of values and send it to output topic - apache-kafka

I have a below scenario
KStream<String List<Foo> > fooList = returns me this KStream;
now I have to send this to outputTopic one by one rather than a whole list
(i.e) fooList iterate the values alone and push them to the destination topic, so that Individual messages are pushed

I believe this will work:
fooList.flatMapValues(v -> v).to("output");
You can read more about flatMapValues here: https://kafka.apache.org/31/javadoc/org/apache/kafka/streams/kstream/KStream.html#flatMapValues(org.apache.kafka.streams.kstream.ValueMapperWithKey).
From a mathematical point of view, you asking how to flatten the KStream of a List<Foo>. Since flatten is the same as the flatMap of the identify function, this ought to be what you want.

Related

Kafka Streams join unrelated streams

I have a stream of events i need to match against a ktable / changelog topic but the matching is done by pattern matching on a property of the ktable entries. so i cannot join the streams based on a key since i dont know yet which one is matching.
example:
ktable X:
{
[abc]: {id: 'abc', prop: 'some pattern'},
[efg]: {id: 'efg', prop: 'another pattern'}
}
stream A:
{ id: 'xyz', match: 'some pattern'}
so stream A should forward something like {match: 'abc'}
So i basically need to iterate over the ktable entries and find the matching entry by pattern matching on this property.
Would it be viable to create a global state store based on the ktable and then access it from the processor API and iterate over the entries?
I could also aggregate all the entries of the ktable into 1 collection and then join on a 'fake' key? But this seems also rather hacky.
Or am i just forcing something which is not really streams and rather just put it into a redis cache with the normal consumer API, which is also kinda awkward since i rather have it backed by rocksDB.
edit: i guess this is kinda related to this question
A GlobalKTable won't work, because a stream-globalTable join allows you to extract a non-key join attribute from the stream -- but the lookup into the table is still based on the table key.
However, you could read the table input topic as a KStream, extract the join attribute, set it as key, and do an aggregation that returns a collection (ie, List, Set, etc). This way, you can do a stream-table join on the key, followed by a flatMapValues() (or flatMap()) that splits the join-result into multiple records (depending on how many records are in the collection of the table).
As long as your join attribute has not too many duplicates (for the table input topic), and thus the value side collection in the table does not grow too large, this should work fine. You will need to provide a custom value-Serde to (de)serialize the collection data.
Normally I would map the table data so I get the join key I need. We recently had a similar case, where we had to join a stream with the corresponding data in a KTable. In our case, the stream key was the first part of the table key, so we could group by that first key part and aggregate the results in a list. At the end it looked something like this.
final KTable<String, ArrayList<String>> theTable = builder
.table(TABLE_TOPIC, Consumed.with(keySerde, Serdes.String()))
.groupBy((k, v) -> new KeyValue<>(k.getFirstKeyPart(), v))
.aggregate(
ArrayList::new,
(key, value, list) -> {
list.add(value);
return list;
},
(key, value, list) -> {
list.remove(value);
return list;
},
Materialized.with(Serdes.String(), stringListSerde));
final KStream<String, String> theStream = builder.stream(STREAM_TOPIC);
theStream
.join(theTable, (streamEvent, tableEventList) -> tableEventList)
.flatMapValues(value -> value)
.map(this::doStuff)
.to(TARGET_TOPIC);
I am not sure, if this is also possible for you, meaning, maybe it is possible for you to map the table data in some way to to the join.
I know this does not completely belong to your case, but I hope it might be of some help anyway. Maybe you can clarify a bit, how the matching would look like for your case.

Iterating over PTable in crunch

I have following PTables,
PTable<String, String> somePTable1 = somePCollection1.parallelDo(new SomeClass(),
Writables.tableOf(Writables.strings(), Writables.strings()));
PTable<String, Collection<String>> somePTable2 = somePTable1.collectValues();
For somePTable2 described above, I want to make a new file for every record in somePTable2, Is there any way to iterate over somePTable2 so that I can access the record.I know I can apply the DoFn on somePTable2, but is it possible to apply pipeline.write() operation in DoFn ?
Try this to store your list as is
somePTable2.values().write()
If you want generate one record for each element in the collection inside your PTable, you will need apply a DoFn and emit one record for each element in the collection before write it.

Pair Rx Sequences with one sequence as the master who controls when a new output is published

I'd like to pair two sequences D and A with Reactive Extensions in .NET. The resulting sequence R should pair D and A in a way that whenever new data appears on D, it is paired with the latest value from A as visualized in the following diagram:
D-1--2---3---4---
A---a------b-----
R----2---3---4---
a a b
CombineLatest or Zip does not exactly what I want. Any ideas on how this can be achieved?
Thanks!
You want Observable.MostRecent:
var R = A.Publish(_A => D.SkipUntil(_A).Zip(_A.MostRecent(default(char)), Tuple.Create));
Replace char with whatever the element type of your A observable.
Conceptually, the query above is the same as the following query.
var R = D.SkipUntil(A).Zip(A.MostRecent(default(char)), Tuple.Create));
The problem with this query is that subscribing to R subscribes to A twice. This is undesirable behavior. In the first (better) query above, Publish is used to avoid subscribing to A twice. It takes a mock of A, called _A, that you can subscribe to many times in the lambda passed to Publish, while only subscribing to the real observable A once.

Spark in Scala: How to avoid linear scan for searching a key in each partition?

I have one huge key-value dataset named A, and a set of keys named B as queries. My task is that for each key in B, return the key exists in A or not, if it exists, return the value.
I partition A by HashParitioner(100) first. Currently I can use A.join(B') to solve it, where B' = B.map(x=>(x,null)). Or we can use A.lookup() for each key in B.
However, the problem is that both join and lookup for PairRDD is linear scan for each partition. This is too slow. As I desire, each partition could be a Hashmap, so that we can find the key in each parition in O(1). So the ideal strategy is that when the master machine receives a bunch of keys, the master assigns each key to its corresponding partition, then the partition uses its Hashmap to find the keys and return the result to the master machine.
Is there an easy way to achieve it?
One potential way:
As I searched online, a similar question is here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMwrk0kPiHoX6mAiwZTfkGRPxKURHhn9iqvFHfa4aGj3XJUCNg#mail.gmail.com%3E
As it said, I built the Hashmap for each partition using the code as follows
val hashpair = A.mapPartitions(iterator => {
val hashmap = new HashMap[Long, Double]
iterator.foreach { case (key, value) => hashmap.getOrElseUpdate(key,value) }
Iterator(hashmap)
})
Now I get 100 Hashmap (if I have 100 partitions for data A). Here I'm lost. I don't know how to ask query, how to use the hashpair to search keys in B, since hashpair is not a regular RDD. Do I need to implement a new RDD and implement RDD methods for hashpair? If so, what is the easiest way to implement join or lookup methods for hashpair?
Thanks all.
You're probably looking for the IndexedRDD:
https://github.com/amplab/spark-indexedrdd

groupBy toList element order

I have a RichPipe with several fields, let's say:
'sex
'weight
'age
I need to group by 'sex and then get a list of tuples ('weight and 'age). I then want to do a scanLeft operation on the list for each group and get a pipe with 'sex and 'result. I currently do this by doing
pipe.groupBy('sex) {_.toList('weight -> 'weights).toList('age - 'ages)}
and then zipping the two lists together. I'm not sure this is the best possible way, and also I'm not sure if the order of the values in the lists is the same, so that when I zip the two lists the tuples don't get mixed up with wrong values. I found nothing about this in the documentation.
Ok, so it looks like I've answered my own question.
You can simply do
pipe.groupBy('sex) {_.toList[(Int, Int)](('weight, 'age) -> 'list)}
which results in a list of tuples. Would've saved me a lot of time if the Fields API Reference mentioned this.