we have been working on kafka ecosystem. let me go through the flow
Source(SQLServer) -> Debezium(CDC) -> Kafka Broker -> Kafka Stream(Processing, joins etc) -> Mongo connector -> Mongo DB
Now we are in last step, we are inserting processed data into mongo dB but now we have requirement to upsert data instead just insert.
Can we get upsert(insert/update) functionality from mongo sink connector. as for now I understand it cant be done.
Please follow the provided link , it has all the information about kafka mongo connector. i have successfully implemented upsert functionilty.you just need to read this document carefully.
Kafka Connector - Mongodb
Effectively this is an upsert, we want to insert if the ${uniqueFieldToUpdateOn} is not in mongo, or update if it exists as follows.
There are two main ways of modelling data changes in a collection depending on your usecase update/replace as outlined below:
UPDATE
The following config states:
Update ${uniqueFieldToUpdateOn} with a field that is unique to that record that you want to model your update on.
AllowList (whitelist) this field For use with the PartialValueStrategy allows custom value fields to be projected for the id strategy.
UpdateOneBusinessKeyTimestampStrategy means that only the one document referenced by the unique field declared above will be updated (Latest timestamp wins).
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"document.id.strategy.partial.value.projection.list":"${uniqueFieldToUpdateOn}",
"document.id.strategy.partial.value.projection.type":"AllowList",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneBusinessKeyTimestampStrategy"
REPLACE
NB this models a REPLACE not an update but may be useful none the less
The following config states:
Replace ${uniqueFieldToUpdateOn} with a field that is unique to that record that you want to model your replace on.
AllowList (whitelist) this field For use with the PartialValueStrategy allows custom value fields to be projected for the id strategy.
ReplaceOneBusinessKeyStrategy means that only the one document referenced by the unique field declared above will be replaced.
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"document.id.strategy.partial.value.projection.list":"${uniqueFieldToUpdateOn}",
"document.id.strategy.partial.value.projection.type":"AllowList",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy"
Related
From the official document, we can set up partition fields to speed up the performance when using Online Archive.
The order of fields listed in the path is important in the same way as
it is in Compound Indexes . Data in the specified path is partitioned
first by the value of the first field, and then by the value of the
next field, and so on. Atlas supports queries on the specified fields
using the partitions.
For example, suppose you are configuring the online archive for the
movies collection in the sample_mflix database. If your archived field
is the released date field, which you moved to the third position,
your first queried field is title, and your second queried field is
plot, your partition will look similar to the following:
/title/plot/released Atlas creates partitions first for the title
field, followed by the plot field, and then the released field. Atlas
uses the partitions for queries on the following fields:
the title field,
the title field and the plot field,
the title field and the plot field and the released field.
Atlas can also use the partitions to support a query on the title and
released fields. However, in this case, Atlas would not be as
efficient in supporting the query as it would be if the query were on
the title and plot fields only. Partitions are parsed in order; if a
query omits a particular partition, Atlas is less efficient in making
use of any partitions that follow that. Since a query on title and
released omits plot, Atlas uses the title partition more efficiently
than the released partition to support this query.
Here I simplify the situation for asking. I need to query by:
title(eq)/plot(eq)/released(range)
title(eq)/released(range)
From the document, title/plot/released support 1. but inefficient to 2.. If I change it to released/title/plot, it seems perfect but violates the ESR rule.
Does Partition Fields need to follow the ESR rule? What is the correct way to solve this requirement? Any deep dive explanations are welcome.
I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().
How can I implement an update of the object that I store in Kafka topic / Ktable?
I mean, if I need not the replacement of the whole value (which compacted Ktable would do), but a single field update. Should I read from the topic/Ktable, deserialize, update the object and then store the new value at the same topic/KTable?
Or should I join/merge 2 topics: one with the original value and the second with the update of the field?
What would you do?
Kafka (and RocksDB) stores bytes; it cannot compare nested fields as through they are database columns. In order to do so would require deserialization anyway
To update a field, you need to construct and post that whole value; a JOIN will effectively do the same thing
Related - Is there a KSQL statement to update values in table?
We can update/upsert the record in mongodb BUT is there is any method or function from which we can update or upsert the document directly in mongodb and the source system is kafka and destination is mongodb.
Yes we can update/upsert the data.
For update you have to define a parameter in Kafka connector.
and whitelist the column on which bases you want to update the record. The property is as followed:
document.id.strategy=com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy
value.projection.list=tokenNumber
value.projection.type=whitelist
writemodel.strategy=com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy
I was struggling through it finally I got the answer. I used following Mongodb sink connector
And after bugging my head on their document for sometime ,I finally figured out the solution .
This is the exact mongodb sink connector configuration I am using
{
"name": "mongodbsync",
"connector.class": "at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector",
"topics": "alpha-foobar",
"mongodb.connection.uri": "mongodb://localhost:27017/kafkaconnect?w=1&journal=true",
"mongodb.document.id.strategy": "at.grahsl.kafka.connect.mongodb.processor.id.strategy.ProvidedInValueStrategy"
}
I left mongodb.writemodel.strategy blank in my configuration so it is taking the default one
I used the use case 2 of following docs from github of the same connector
I was dealing with this scenario ,transferring mysql table data with kafka-jdbc-source connect to mongodb sink.
Also the above strategies can be found in the official docs as well
Please free to ask any doubts if you have .Thanks
I am using MongoDB in my web API. MongoDB is being updated/inserted by other sources.
How do I query mongodb to get only newly inserted or updated documents ?
I can get sorted documents by below query but this doesn't solve my purpose
db.collectionName.findOne({}, {sort:{$natural:-1}})
Is there any log or other way like in SQL there is INSERTED and UPDATED
What are these newly inserted/updated documents in your context?
Assuming that you need to fetch newly/inserted documents relative to a time, you need to have a field in your collection that holds the time stamp (for example, inserted, and lastUpdated fields), you have an operator to help with updating. But this needs application changes.
Or you can use change streams, for a trigger like functionality. You can listen for changes and take actions as changes are made.