Batch processing with reactive Couchbase java driver - reactive-programming

Suppose I have a bucket from which I need to fetch documents that have a date older than now.
This document looks like this:
{
id: "1",
date: "Some date",
otherObjectKEY: "key1"
}
For each document, I need to fetch another document using its otherObjectKEY, send the latter to a kafka topic, then delete the original document.
Using the reactive java driver 3.0, I was able to do it with something like this:
public void batch(){
streamOriginalObjects()
.flatMap(originalObject -> fetchOtherObjectUsingItsKEY(originalObject)
.flatMap(otherObject -> sendToKafkaAndDeleteOriginalObject(originalObject))
)
.subscribe();
}
streamOriginalObjects():
public Flux<OriginalObject> streamOriginalObjects(){
return client.query("select ... and date <= '"+ LocalDateTime.now().toString() +"'")
.flux()
.flatMap(result -> result.rowsAs(OriginalObject.class));
}
It works like expected, but I'm wondering if there is a better approach (especially in terms of performance) than streaming and processing element by element.

Doing a N1QL query, and then fanning-out key-value operations from that, is a useful and common pattern. This should make the fan-out happen in parallel:
streamOriginalObjects()
// Split into numberOfThreads 'rails'
.parallel(numberOfThreads)
// Run on an unlimited thread pool
.runOn(Schedulers.elastic())
.concatMap(originalObject -> fetchOtherObjectUsingItsKEY(originalObject)
.concatMap(otherObject -> sendToKafkaAndDeleteOriginalObject(originalObject))
)
// Back out of parallel mode
.sequential()
.subscribe();

Related

How to update related messages in kafka topic if incoming message has reference to them only inside value?

I need to merge the payload if the reference number(refNo) is the same in different messages. My limitation is that I can only use a KTable and if the key is an even number I don't need to merge the payload. Additionally, the order of incoming messages should not change the result.
For example, if we have an empty topic and incoming messages are:
1: { key: "1", value: {refNo:1, payload:{data1}} }
2: { key: "2", value: {refNo:1, payload:{data2}} }
3: { key: "3", value: {refNo:2, payload:{data3}} } // this one should be not effected and left how it is
Expected result:
1: { key: "1", value: {refNo:1, payload:{data1, data2}} }
2: { key: "2", value: {refNo:1, payload:{data2}} }
3: { key: "3", value: {refNo:2, payload:{data3}} }
The only way I can think of to do this is to use two times .groupBy and join with the original topic everything again.
First change the key to refNo, save the key to the value itself, and join the payload during aggregation.
Secondly .groupBy revert key to the initial state.
The last step joins everything to the original topic because I lost one message during grouping by.
I'm pretty sure there's an easier way to do this. What is the most optimized and elegant way to solve this issue?
Edit: Its downstream and there is output topic, original is not edited.
Aggregating within KSQL could probably accomplish exactly this. You can use any of the aggregate functions like COLLECT_LIST(col1) => ARRAY
The only issue is how large of a window you would need. How frequently do you need to concatenate the data?
Also, it feels like a big "no" to write back to the original topic.
You're changing the message format slightly, additional downstream consumers could have been expecting a specific message format.
Writing to a new topic seems like a better route to go, it also decreases the amount of messages additional consumers need to consume.
At the moment I'm going with this solution. It works, but I have no idea how it will perform, or can it be more optimized, or if there is better way to solve my issue.
KStream<String, Value> even = inputTopicStream.filter((key, value) -> value.isEven()));
inputTopicStream.toTable(Materialized.with(String.serdes, Value.serde))
.groupBy(
(key, value) -> KeyValue.pair(new Key(value.getRefNo(), addKeyToValue(key, value)),
Grouped.with("aggregation-internal", String.serdes, Value.serde))
.aggregate(
Value::new,
(key, value, agg) -> mergePayload(key, value, agg), // ensure that key is uneven after merge
(key, value, agg) -> handleSplit(key, value, agg))
.toStream()
.selectKey((key, value) -> new Key(value.getKey())) // restore original key
.merge(even) // need to merge even key stream, because they was lost during aggregation
.to(OUTPUT_TOPIC, Produced.with(String.serde, Value.serde));

Bulk.getOperations() in MongoDB Node driver

I'd like to view the results of a bulk operation, specifically to know the IDs of the documents that were updated. I understand that this information is made available through the Bulk.getOperations() method. However, it doesn't appear that this method is available through the MongoDB NodeJS library (at least, the one I'm using).
Could you please let me know if there's something I'm doing wrong here:
const bulk = db.collection('companies').initializeOrderedBulkOp()
const results = getLatestFinanicialResults() // from remote API
results.forEach(result =>
bulk.find({
name: result.companyName,
report: { $ne: result.report }
}).updateOne([
{ $unset: 'prevReport' },
{ $set: { prevReport: '$report' } },
{ $unset: 'report' },
{ $set: { report: result.report } }
]))
await bulk.execute()
await bulk.getOperations() // <-- fails, undefined in Typescript library
I get a static IDE error:
Uncaught TypeError: bulk.getOperations is not a function
I'd like to view the results of a bulk operation, specifically to know the IDs of the documents that were updated
As of currently (MongoDB server v6.x) There is no methods to return IDs for updated documents from a bulk operations (only insert and upsert operations). However, there may be a work around depending on your use case.
The manual that you linked for Bulk.getOperations() is for mongosh, which is a MongoDB Shell application. If you look into the source code for getOperations() in mongosh, it's just a convenient wrapper for batches'. The method batches` returns a list of operations sent for the bulk execution.
As you are utilising ordered bulk operations, MongoDB executes the operations serially. If an error occurs during the processing of one of the write operations, MongoDB will return without processing any remaining write operations in the list.
Depending on the use case, if you modify the bulk.find() part to contain a search for _id for example:
bulk.find({"_id": result._id}).updateOne({$set:{prevReport:"$report"}});
You should be able to see the _id value of the operation in the batches, i.e.
await batch.execute();
console.log(JSON.stringify(batch.batches));
Example output:
{
"originalZeroIndex":0,
"currentIndex":0,
"originalIndexes":[0],
"batchType":2,
"operations":[{"q":{"_id":"634354787d080d3a1e3da51f"},
"u":{"$set":{"prevReport":"$report"}}],
"size":0,
"sizeBytes":0
}
For additional information, you could also retrieve the BulkWriteResult. For example, the getLastOp to retrieve the last operation (in case of a failure)

Morphia - Merging a complex query with a complex criteria

I am converting filters from the client to a QueryImpl using the setQueryObject method.
When I try to add another complex criteria to that query my original query is moved to a field named baseQuery and the new criteria is the query field.
When the query is executed only the new criteria is used and the baseQuery is not used.
It only happens when the client query is formatted like this: { "$or" : [{ "field1" : { "$regex" : "value1", "$options" : "i"}},...]}
and the new criteria is formatted in the same way(meaning an $or operation).
It seems that when I try to merge 2 $or queries it happens but when I merge $or with $and it concatenates the queries properly.
Am I using it wrong or is it a genuine bug?
Edit:
Code:
public static List<Entity> getData(client.Query query) {
QueryImpl<Entity> finalQuery = Morphia.realAccess().extractFromQuery(Entity.class,query);
finalQuery.and(finalQuery.or(finalQuery.criteria("field").equal(false), finalQuery.criteria("field").doesNotExist()));
return finalQuery.asList();
}
public <E> QueryImpl<E> extractFromQuery(Class<E> clazz, client.Query query) {
QueryImpl<E> result = new QueryImpl<E>(clazz,this.db.getCollection(clazz),this.db);
result.setQueryObject(query.getFiltersAsDBObject);
return result;
}
QueryImpl and setQueryObject() are internal constructs. You really shouldn't be using them as they may change or go away without warning. You should be using the public query builder API to build up your query document.
I am having the same problem, It seems to be working when I do this:
finalQuery.and();
finalQuery.and(finalQuery.or(finalQuery.criteria("field").equal(false), finalQuery.criteria("field").doesNotExist()));
It is kind of ugly and would love to hear if someone have a different approach, but the only way I was able to find to convert the data from the client side which is a DBObject is to use the setQueryObject() of QueryImpl.

Use allowDiskUse in criteria query with Grails and the MongoDB plugin?

In order to iterate over all the documents in a MongoDB (2.6.9) collection using Grails (2.5.0) and the MongoDB Plugin (3.0.2) I created a forEach like this:
class MyObjectService {
def forEach(Closure func) {
def criteria = MyObject.createCriteria()
def ids = criteria.list { projections { id() } }
ids.each { func(MyObject.get(it)) }
}
}
Then I do this:
class AnalysisService{
def myObjectService
#Async
def analyze(){
MyObject.withStatelessSession {
myObjectService.forEach { myObject ->
doSomethingAwesome(myObject)
}
}
}
}
This works great...until I hit a collection that is large (>500K documents) at which point a CommandFailureException is thrown because the size of the aggregation result is greater than 16MB.
Caused by CommandFailureException: { "serverUsed" : "foo.bar.com:27017" , "errmsg" : "exception: aggregation result exceeds maximum document size (16MB)" , "code" : 16389 , "ok" : 0.0}
In reading about this, I think that one way to handle this situation is to use the option allowDiskUse in the aggregation function that runs on the MongoDB side so that the 16MB memory limit won't apply and I can get a larger aggregation result.
How can I pass this option to my criteria query? I've been reading the docs and the Javadoc for the Grails MongoDB plugin, but I can't seem to find it. Is there is another way to approach the generic problem (iterate over all members of a large collection of domain objects)?
This is not possible with the current implementation of MongoDB Grails plugin. https://github.com/grails/grails-data-mapping/blob/master/grails-datastore-gorm-mongodb/src/main/groovy/org/grails/datastore/mapping/mongo/query/MongoQuery.java#L957
If you look at the above line, then you will see that the default options are being used for building AggregationOptions instance so there is no method to provide an option.
But there is another hackish way to do it using the Groovy's metaclass. Let's do it..:-)
Store the original method reference of builder() method before writing criteria in your service:
MetaMethod originalMethod = AggregationOptions.metaClass.static.getMetaMethod("builder", [] as Class[])
Then, replace the builder method to provide your implementation.
AggregationOptions.metaClass.static.builder = { ->
def builderInstance = new AggregationOptions.Builder()
builderInstance.allowDiskUse(true) // solution to your problem
return builderInstance
}
Now, your service method will be called with criteria query and should not results in the aggregation error you are getting since we have not set the allowDiskUse property to true.
Now, reset the original method back so that it should not affect any other call (optional).
AggregationOptions.metaClass.static.addMetaMethod(originalMethod)
Hope this helps!
Apart from this, why do you pulling all IDs in forEach method and then re getting the instance using get() method? You are wasting the database queries which will impact the performance. Also, if you follow this, you don't have to do the above changes.
An example with the same: (UPDATED)
class MyObjectService {
void forEach(Closure func) {
List<MyObject> instanceList = MyObject.createCriteria().list {
// Your criteria code
eq("status", "ACTIVE") // an example
}
// Don't do any of this
// println(instanceList)
// println(instanceList.size())
// *** explained below
instanceList.each { myObjectInstance ->
func(myObjectInstance)
}
}
}
(I'm not adding the code of AnalysisService since there is no change)
*** The main point is here at this point. So whenever you write any criteria in domain class (without projection and in mongo), after executing the criteria code, Grails/gmongo will not immediately fetch the records from the database unless you call some methods like toString(), 'size()ordump()` on them.
Now when you apply each on that instance list, you will not actually loading all instances into memory but instead you are iterating over Mongo Cursor behind the scene and in MongoDB, cursor uses batches to pull record from database which is extremely memory safe. So you are safe to directly call each on your criteria result which will not blow up the JVM unless you called any of the methods which triggers loading all records from the database.
You can confirm this behaviour even in the code: https://github.com/grails/grails-data-mapping/blob/master/grails-datastore-gorm-mongodb/src/main/groovy/org/grails/datastore/mapping/mongo/query/MongoQuery.java#L1775
Whenever you write any criteria without projection, you will get an instance of MongoResultList and there is a method named initializeFully() which is being called on toString() and other methods. But, you can see the MongoResultList is implementing iterator which is in turn calling MongoDB cursor method for iterating over the large collection which is again, memory safe.
Hope this helps!

Ways to implement data versioning in MongoDB

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:
{
'_id': 'new id',
'user': user_id,
'timestamp': timestamp,
'address_book_id': 'id of the address book record'
'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
}
This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.
Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings.
(Modelled after Simple Document Versioning with CouchDB)
The first big question when diving in to this is "how do you want to store changesets"?
Diffs?
Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.
The $and condition will ensure, that 2 is the latest vid.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
I worked through this solution that accommodates a published, draft and historical versions of the data:
{
published: {},
draft: {},
history: {
"1" : {
metadata: <value>,
document: {}
},
...
}
}
I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/
For those that may implement something like this in Java, here's an example:
http://software.danielwatrous.com/using-java-to-work-with-versioned-data/
Including all the code that you can fork, if you like
https://github.com/dwatrous/mongodb-revision-objects
If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format
mongoose-patch-history
Another option is to use mongoose-history plugin.
let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;
let MySchema = Post = new Schema({
title: String,
status: Boolean
});
MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).
nicklozon/meteor-collection-revisions
Another sound option is to use Meteor Vermongo (here)