Handle join of KStream with KTable when key is missing from KTable - apache-kafka

I recently started experimenting with kafka streams. I have a scenario where I need to join a KStream with a KTable. It may be the case that the KTable does not contain some of the keys. In that case I get a NullPointerException.
specifically I was getting
stream-thread [StreamThread-1] Streams application error during processing:
java.lang.NullPointerException
I don't know how I can handle that. I cannot somehow filter out the records of the stream that do not correspond to a table entry.
update
Looking a bit further I found that I can query the underlying store to find whether a key exists through the ReadOnlyKeyValueStore interface.
In this case my question is, would that be the best way to go? i.e. Filtering the stream to be joined based on whether a key exists in the local store?
My second question in this case would be, since I care about leveraging the Global State Store introduced in version 10.2 in a next phase, should I expect that I will be also able in the same manner to query the Global State Store?
update
The previous update is not accurate since it's not possible to query the state store from inside the topology
final update
After understanding the join semantics a bit better I was able to solve the issue just be simplifying the valueJoiner to only return the results, instead of performing actions on the joined values, and adding an extra filtering step after the join to filter out null values.

The solution to my problem came from understanding the join semantics a bit better.
Like in database joins (although I am not saying that Kstream joins follow the db join concepts precisely) the left join operation results in rows with null values wherever the right side keys are missing.
So eventually the only thing I had to do was to decouple my valueJoiner from the subsequent calculations / operations (I needed to perform some calculations on fields of the joined records and return a newly constructed object) and have it only return an array of the joined values. Then I could filter out the records that resulted in null values by checking those arrays.
Based on Matthias's J. Sax suggestion, I used the 0.10.2 version instead of the 0.10.1 which is compatible with broker version 0.10.1 and replace the whole leftJoin logic with inner join which removes the need for filtering out null values.

Related

Aggregate elements returned from filetring processor in spring batch

My use case starts by reading PK from database, a filtering processor check if the id is already processed, if not a second processor will get the other columns using the id from db.
I'am trying to reduce the roundtrip with database, is it possible to aggregate the ids so i can execute batch read ?
One way is to use a database table (eg. PROCESSED) to store the ids that have already been processed, or use a status column on the existing table that you're reading. Your reader can use something similar to "WHERE NOT EXISTS (SELECT 1 FROM PROCESSED p WHERE p.ID = x.ID)" to avoid selecting already processed items.
For this solution you need to consider how job restarts behave and make sure to call saveState(false) since you're maintaining the state of already processed items yourself.
See https://docs.spring.io/spring-batch/docs/current/reference/html/readersAndWriters.html#process-indicator

Cloud Firestore Datastore Mode - Transaction Contention without Composite Index

In an application that I am currently working on, we need to ensure uniqueness with respect to the tuple of three properties in a specific kind. As such when creating a new entity, we need to ensure that no entity of that kind with the given tuple exists.
My naïve approach to this problem, was to create a simple query that filters on equality based on the three fields. If an entity with the given fields was found, the operation would abort, otherwise a new entity with those fields and other related data would be inserted. However, when trying to insert many entities in parallel, transaction contention would arise.
However, as soon as I added a composite index of those three properties, no contention occurs. I changed nothing in the code, I merely added a composite index for those fields.
I have been digging through all the documentation and searched around for anyone who has had a similar issue, but nobody has ever mentioned this "workaround".
Have I missed something? Perhaps discovered something? Or this is expected behavior; are the built-in indices not enough?
The main document you'd want to look at is https://cloud.google.com/datastore/docs/concepts/optimize-indexes .
In your case it looks like your merge join ends up locking a number of rows while looking for no-match. However, with the composite index you are only looking up the index entry needed for your query. Thus there is less contention with the composite index vs using a merge join for the query.

Implement SQL update with Kafka

How can I implement an update of the object that I store in Kafka topic / Ktable?
I mean, if I need not the replacement of the whole value (which compacted Ktable would do), but a single field update. Should I read from the topic/Ktable, deserialize, update the object and then store the new value at the same topic/KTable?
Or should I join/merge 2 topics: one with the original value and the second with the update of the field?
What would you do?
Kafka (and RocksDB) stores bytes; it cannot compare nested fields as through they are database columns. In order to do so would require deserialization anyway
To update a field, you need to construct and post that whole value; a JOIN will effectively do the same thing
Related - Is there a KSQL statement to update values in table?

Eclipselink batch fetch VS join fetch

When should I use "eclipselink.join-fetch", when should I use "eclipselink.batch" (batch type = IN)?
Is there any limitations for join fetch, such as the number of tables being fetched?
Answer is alway specific to your query, the specific use case, and the database, so there is no hard rule on when to use one over the other, or if to use either at all. You cannot determine what to use unless you are serious about performance and willing to test both under production load conditions - just like any query performance tweaking.
Join-fetch is just what it says, causing all the data to be brought back in the one query. If your goal is to reduce the number of SQL statements, it is perfect. But it comes at costs, as inner/outer joins, cartesian joins etc can increase the amount of data being sent across and the work the database has to do.
Batch fetching is one extra query (1+1), and can be done a number of ways. IN collects all the foreign key values and puts them into one statement (more if you have >1000 on oracle). Join is similar to fetch join, as it uses the criteria from the original query to select over the relationship, but won't return as much data, as it only fetches the required rows. EXISTS is very similar using a subquery to filter.

Read Model Partition Key Strategy

I have a collection of documents that looks like the following:
There is one document per VIN/SiteID and our access pattern is showing all documents
at a specific site. I see two potential partition keys we could choose from:
SiteID - We only have 75 sites so the cardinality is not very high. Also, the doucments are not very big so the 10GB limit is probably OK.
SiteID/VIN: The data is now more evenly distributed but now that means each logical partition will only store one item. is this an anti-pattern? also, so support our access pattern we will need to use a cross-partition query. again, the data set is small so is this a problem?
Based on what I am describing, which partition key makes more sense?
Any other suggestions would be greatly appreciated!
Your first option makes a lot of sense and could be a good partition key but the words "probably OK" don't really breed confidence. Remember, the only way to change the partition key is to migrate to a new collection. If you can take that risk then SiteId (which I'm guessing you will always have) is a good partition key.
If you have both VIN and SiteId when you are doing the reading or querying then this is the safer combination. There is no problem with having each logical partition to store one item per se. It's only a problem when you are doing cross partition queries. If you know both VIN and SiteId in your queries then it's a great plan.
You also have to remember that your RUs are evenly split between your partitions inside a collection.