Why do some joins not work without selectKey first? - apache-kafka

In doing my joins, I am finding that the 2nd block tends to give the expected result, whereas the 1st block does not and never hits the (aValue, bValue) -> myFunc(aValue, bValue). I didn't think the actual key mattered as long as I set the right field to join on (aKey, aValue) -> aValue.get("someField").asText(), but there is something about using .selectKey((aKey, aValue) -> aValue.get("someField").asText()) beforehand that makes the join go through correctly. I have also seen some cases that did not require the selectKey. Can someone explain the difference?
// does not join correctly and gives unexpected result
KStream<String, JsonNode> c = a
.leftJoin(b,
(aKey, aValue) -> aValue.get("someField").asText(),
(aValue, bValue) -> myFunc(aValue, bValue)
);
// does join correctly and gives expected result
KStream<String, JsonNode> c = a
.selectKey((aKey, aValue) -> aValue.get("someField").asText())
.leftJoin(b,
(aKey, aValue) -> aKey,
(aValue, bValue) -> myFunc(aValue, bValue)
);

There are many different joins with different semantics in Kafka Streams and I am not 100% sure what join you execute?
Given your example, it seems you are using a KStream-GlobalKTable join; b seems to be a GlobalKTable and the second argument (i.e., (aKey, aValue) -> aValue.get("someField").asText() is your keySelector?
If this it correct, the first code snippet looks correct to me. What version are you using (maybe there is some bug in Kafka Streams)? Can you also share the output of Topology#describe()#toString() for both cases?

Related

get one element from each GroupedObservable in RxJava

I'm struggling with groupBy in RxJava.
The problem is - I cant get only one element from each group.
For example i have a list of elements:
SomeModel:
class SomeModel {
int importantField1;
int mainData;
}
My list of models for example:
List<SomeModel> dataList = new ArrayList<>();
dataList.add(new SomeModel(3, 1));
dataList.add(new SomeModel(3, 1));
dataList.add(new SomeModel(2, 1));
In my real project there is more complex data model. I added same models on purpose. It is matter for my project.
Then I'm trying to take one element from group in this manner:
List<SomeModel> resultList = Observable.fromIterable(dataList)
.sorted((s1, s2) -> Long.compare(s2.importantField1, s1.importantField1))
.groupBy(s -> s.importantField1)
.firstElement()
// Some transformation to Observable. May be it is not elegant, but it is not a problem
.flatMapObservable(item -> item)
.groupBy(s -> s.mainData)
//Till this line I have all I need. But then I need to take only one element from each branch
.flatMap(groupedItem -> groupedItem.firstElement().toObservable())
.toList()
.blockingGet();
And of course it's not working. I still have two same elements in the resultList.
I cant add .firstElement(), after last .flatMap operator, because there could be situations when after last .groupBy may be more then one branch.
I need only one element from each branch.
I've tryed this way:
.flatMap(groupedItem -> groupedItem.publish(item -> item.firstElement().concatWith(item.singleElement()).toObservable())
no effect. This sample of code I took from this post: post
There author suggests this :
.flatMap(grp -> grp.publish(o -> o.first().concatWith(o.ignoreElements())))
but even if I remove last two rows of my code:
.toList()
.blockingGet();
And change resultList to Disposable disposable suggested option not working because of error:
.concatWith(o.ignoreElements()) - concatWith not taking Completable.
Some method signatures changed with 3.x since my post so you'll need these:
first -> firstElement
firstElement returns Maybe which is no good inside publish, plus there is no Maybe.concatWith(CompletableSource), thus the need to convert to Observable.
.flatMap(grp ->
grp.publish(o ->
o.firstElement()
.toObservable()
.concatWith(o.ignoreElements())
)
)

How can I join two ktables with custom values (and custom serdes)

I want to join two ktables with custom values.
The documentation makes it clear for default types (with default serdes) - https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#ktable-ktable-join
KTable<String, Long> left = ...;
KTable<String, Double> right = ...;
// Java 8+ example, using lambda expressions
KTable<String, String> joined = left.leftJoin(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue /* ValueJoiner */
);
but when I use custom values I get a serialization error and there are no overloads for passing custom serdes. How can I accomplish this?
KTable<String, ModelA> left = ...;
KTable<String, ModelB> right = ...;
// Java 8+ example, using lambda expressions
KTable<String, ModelC> joined = left.leftJoin(right,
(leftValue, rightValue) -> new ModelC("left=" + leftValue.Name + ", right=" + rightValue.Name /* ValueJoiner */
);
I eventually understood what I was doing wrong.
The error message was a bit misleading:
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters
But I did not want to change the default serde and ktables join had no overload to pass serdes.
The problem really was on the fact that I created the ktable by using the stream.toTable method which has no overload to pass serdes either. What I did was to declare the ktable before (with serdes) and then use the stream.to method.
Probably a newbie mistake, but here it is.

is there a Kafka streams method to reduce a stream of numbers to only "output" when the number is changed

I'm trying to use Kafka steams to reduce a series of numbers, and I only want a record out when data has changed. It works perfect, but the problem is that it is not catching up on data from kafka if the service running the code has been down. So I guess the solution is wrong?
My code:
KGroupedStream<String, JsonNode> groupedStream = filteredStream.groupByKey( Serdes.String(), jsonSerde);
KTable<String, JsonNode> reducedTable = groupedStream.reduce(
(aggValue, newValue) -> Calculate.newValue( newValue, aggValue, logger) ,/* adder */
"reduced-stream-store" /* state store name */);
KStream<String, JsonNode> reducedStream = reducedTable.toStream();
the "Calculate" method :
if (value != oldValue)
return value
else return null.
thanks if you have comments/sugestions
return null in your code will delete the entry from the result table. Hence, your code does not do what you expect.
In fact, the DSL operators emit "on update" not "on change" and thus you cannot use the DSL for your use case. There is a ticket that suggests to add "emit on change" semantics (https://issues.apache.org/jira/browse/KAFKA-8770).
As a workaround, you will need to use a custom transform() with stat store instead. For each input record, you check if it exists in the store. If no, emit the record and put it into the store. If if does exist and is the same, don't emit anything. If it is different emit and update the store.

Reactive streams with reactive side-effects

I think this will be done similarly with most reactive implementations so I do not specify any specific library&language here.
What is the proper way to do reactive side effects on streams (not sure what is the proper terminology here)?
So lets say we have a stream of Players. We want to transform that to a stream of only ids. But in addition to that, we want to do some additional reactive processing based on the ids.
Here is how one can do in not so elegant way, but what is the idiomatic way to achieve this:
Observable<Player> playerStream = getPlayerStream();
return playerStream
.map(p -> p.id)
.flatmap(id -> {
Observable<Result> weDontCare = process(id);
// Now the hacky part
return weDontCare.map(__ -> id);
})
Is that ok? Seems not so elegant.
I don't know RxJava, but you also tagged project reactor, and there are two ways to do this with reactor depending on how you want the side-effects to affect your stream. If you want to wait for the side effects to happen so you can handle errors etc (my preferred way) use delayUntil:
return getPlayerStream()
.map(p -> p.id)
.delayUntil(id -> process(id));
If you want the id to pass straight through without waiting, and instead do a fire-and-forget style side-effect, you could use doOnNext:
return getPlayerStream()
.map(p -> p.id)
.doOnNext(id -> process(id).subscribe());
Observable<Player> playerStream = getPlayerStream();
return playerStream
.map(p -> p.id)
.doOnEach(id -> { //side effect with id
Observable<Result> weDontCare = process(id);
weDontCare.map(__ -> id);
})

Does my "zipLatest" operator already exist?

quick question about an operator I've written for myself.
Please excuse my poor-man's marble diagrams:
zip
aa--bb--cc--dd--ee--ff--------gg
--11----22--33--------44--55----
================================
--a1----b2--c3--------d4--e5----
combineLatest
aa--bb--cc--dd--ee--ff--------gg
--11----22--33--------44--55----
================================
--a1b1--c2--d3--e3--f3f4--f5--g5
zipLatest
aa--bb--cc--dd--ee--ff--------gg
--11----22--33--------44--55----
================================
--a1----c2--d3--------f4------g5
zipLatest (the one I wrote) fires at almost the same times as zip, but without the queueing zip includes.
I've already implemented it, I'm just wondering if this already exists.
I know I wrote a similar method in the past, to discover by random chance that I'd written the sample operator without knowing it.
So, does this already exist in the framework, or exist as a trivial composition of elements I haven't thought of?
Note: I don't want to rely on equality of my inputs to deduplicate (a la distinctUntilChanged).
It should work with a signal that only outputs "a" on an interval.
To give an update on the issue: There is still no operator for this requirement included in RxJS 6 and none seems to be planned for future releases. There are also no open pull requests that propose this operator.
As suggested here, a combination of combineLatest, first and repeat will produce the expected behaviour:
combineLatest(obs1, obs2).pipe(first()).pipe(repeat());
combineLatest will wait for the emission of both Observables - throwing away all emissions apart from the latest of each. first will complete the Observable after the emission and repeat resubscribes on combineLatest, causing it to wait again for the latest values of both observables.
The resubscription behaviour of repeat is not fully documented, but can be found in the GitHub source:
source.subscribe(this._unsubscribeAndRecycle());
Though you specifically mentions not to use DistinctUntilChanged, you can use it with a counter to distinct new values:
public static IObservable<(T, TSecond)> ZipLatest<T, TSecond>(this IObservable<T> source, IObservable<TSecond> second)
{
return source.Select((value, id) => (value, id))
.CombineLatest(second.Select((value, id) => (value, id)), ValueTuple.Create)
.DistinctUntilChanged(x => (x.Item1.id, x.Item2.id), new AnyEqualityComparer<int, int>())
.Select(x => (x.Item1.value, x.Item2.value));
}
public class AnyEqualityComparer<T1, T2> : IEqualityComparer<(T1 a, T2 b)>
{
public bool Equals((T1 a, T2 b) x, (T1 a, T2 b) y) => Equals(x.a, y.a) || Equals(x.b, y.b);
public int GetHashCode((T1 a, T2 b) obj) => throw new NotSupportedException();
}
Note that I've used Int32 here - because that's what Select() gives me - but it might be to small for some use cases. Int64 or Guid might be a better choice.