How can I implement an update of the object that I store in Kafka topic / Ktable?
I mean, if I need not the replacement of the whole value (which compacted Ktable would do), but a single field update. Should I read from the topic/Ktable, deserialize, update the object and then store the new value at the same topic/KTable?
Or should I join/merge 2 topics: one with the original value and the second with the update of the field?
What would you do?
Kafka (and RocksDB) stores bytes; it cannot compare nested fields as through they are database columns. In order to do so would require deserialization anyway
To update a field, you need to construct and post that whole value; a JOIN will effectively do the same thing
Related - Is there a KSQL statement to update values in table?
Related
Is there a way to share documents between partitions to avoid duplication?
If I have enum data Message Status (sent, sending, failed, undeliverable, etc.), I have to make a copy for each partition, instead of having all share the same statuses.
Example for clarity:
User A has access to partition 1, user B has access to partition 2. For user A to have access to the enum data, it needs to exist with partition key 1. Same thing with user B, he needs a copy of the enum documents with partition key 2, because he cannot access the existing one with partition 1. So the data ends up being duplicated.
Got an answer from another place:
"No, at this time a Realm Object (that translates into a document when stored in MongoDB) can only belong to a single partition: if you need to refer to the same piece of data from multiple partitions, two of the pattern you can follow:
If the data is small, you can use an EmbeddedObject (or a list of them), that, while using more space because of duplicates, allows a quicker access to the data itself without a roundtrip to the DB. If you're talking about enums, this looks like a good choice: an ObjectId would take some amount of data anyway.
Define another partition with the common data, and open a separate, specific realm on the client for it: you'll have to manually handle the relationships, though, i.e. you can store the ObjectIds in your main realm, but will need to query the secondary one for the actual objects. Probably worth only if the common data is complex enough to justify the overhead."
I want to query a Kafka Streams state store based on time range. The use case is that I'll have the streams processor be scheduled every 30 seconds. During each invocation, I want to query a state store but for only the entries which are "new". I thought TimestampedKeyValueStore might help but couldn't find the right APIs to do it. Is it possible to query the state store based on time range (and with exactly-once guarantee)?
You cannot query a KeyValueStore base on a time range, because this does not really align to the semantics of the store. Queries are always against the key, and a TimestampeKeyValueStore stores an additional value-timestamp.
You could use a WindowedStore though: note, a windowed-store is basically also just a key-value store, however, it store a timestamp next to the key (not the value; well, there is also TimestampedWindowStore that also does both). This allows you to query time ranges.
we have been working on kafka ecosystem. let me go through the flow
Source(SQLServer) -> Debezium(CDC) -> Kafka Broker -> Kafka Stream(Processing, joins etc) -> Mongo connector -> Mongo DB
Now we are in last step, we are inserting processed data into mongo dB but now we have requirement to upsert data instead just insert.
Can we get upsert(insert/update) functionality from mongo sink connector. as for now I understand it cant be done.
Please follow the provided link , it has all the information about kafka mongo connector. i have successfully implemented upsert functionilty.you just need to read this document carefully.
Kafka Connector - Mongodb
Effectively this is an upsert, we want to insert if the ${uniqueFieldToUpdateOn} is not in mongo, or update if it exists as follows.
There are two main ways of modelling data changes in a collection depending on your usecase update/replace as outlined below:
UPDATE
The following config states:
Update ${uniqueFieldToUpdateOn} with a field that is unique to that record that you want to model your update on.
AllowList (whitelist) this field For use with the PartialValueStrategy allows custom value fields to be projected for the id strategy.
UpdateOneBusinessKeyTimestampStrategy means that only the one document referenced by the unique field declared above will be updated (Latest timestamp wins).
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"document.id.strategy.partial.value.projection.list":"${uniqueFieldToUpdateOn}",
"document.id.strategy.partial.value.projection.type":"AllowList",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneBusinessKeyTimestampStrategy"
REPLACE
NB this models a REPLACE not an update but may be useful none the less
The following config states:
Replace ${uniqueFieldToUpdateOn} with a field that is unique to that record that you want to model your replace on.
AllowList (whitelist) this field For use with the PartialValueStrategy allows custom value fields to be projected for the id strategy.
ReplaceOneBusinessKeyStrategy means that only the one document referenced by the unique field declared above will be replaced.
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"document.id.strategy.partial.value.projection.list":"${uniqueFieldToUpdateOn}",
"document.id.strategy.partial.value.projection.type":"AllowList",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy"
I recently started experimenting with kafka streams. I have a scenario where I need to join a KStream with a KTable. It may be the case that the KTable does not contain some of the keys. In that case I get a NullPointerException.
specifically I was getting
stream-thread [StreamThread-1] Streams application error during processing:
java.lang.NullPointerException
I don't know how I can handle that. I cannot somehow filter out the records of the stream that do not correspond to a table entry.
update
Looking a bit further I found that I can query the underlying store to find whether a key exists through the ReadOnlyKeyValueStore interface.
In this case my question is, would that be the best way to go? i.e. Filtering the stream to be joined based on whether a key exists in the local store?
My second question in this case would be, since I care about leveraging the Global State Store introduced in version 10.2 in a next phase, should I expect that I will be also able in the same manner to query the Global State Store?
update
The previous update is not accurate since it's not possible to query the state store from inside the topology
final update
After understanding the join semantics a bit better I was able to solve the issue just be simplifying the valueJoiner to only return the results, instead of performing actions on the joined values, and adding an extra filtering step after the join to filter out null values.
The solution to my problem came from understanding the join semantics a bit better.
Like in database joins (although I am not saying that Kstream joins follow the db join concepts precisely) the left join operation results in rows with null values wherever the right side keys are missing.
So eventually the only thing I had to do was to decouple my valueJoiner from the subsequent calculations / operations (I needed to perform some calculations on fields of the joined records and return a newly constructed object) and have it only return an array of the joined values. Then I could filter out the records that resulted in null values by checking those arrays.
Based on Matthias's J. Sax suggestion, I used the 0.10.2 version instead of the 0.10.1 which is compatible with broker version 0.10.1 and replace the whole leftJoin logic with inner join which removes the need for filtering out null values.
I would like to have an action triggered every time an item is created or updated on a DynamoDB. I have been going through the doc, but cannot find anything like this. Is it possible?
Thanks.
This is not possible. DynamoDB doesn't let you run any code server-side. The only thing which might count as server-side actions as part of an update are conditional updates, but those can't trigger changes to other items.
The new update supports triggers.
https://aws.amazon.com/blogs/aws/dynamodb-update-triggers-streams-lambda-cross-region-replication-app/
Now you can use DynamoDb Streams.
A stream consists of stream records. Each stream record represents a single data modification in the DynamoDB table to which the stream belongs. Each stream record is assigned a sequence number, reflecting the order in which the record was published to the stream.
Stream records are organized into groups, or shards. Each shard acts as a container for multiple stream records, and contains information required for accessing and iterating through these records. The stream records within a shard are removed automatically after 24 hours.
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Checkout http://zapier.com/help/dynamodb might be what you are looking for.