How to send mongodb collection updates in Spark streaming context? - mongodb

How to send the data from a mongodb collection whenever it gets updated to spark streaming context. I have seen socketTextStream to write the data to mongodb. But is there any way to read the stream when the collection gets updated? Is there any way to avoid to implement Custom Receiver for mongodb ? Or If anyone has already implemented, then it will be nice, if someone can share.
Thanking you for any input on this.

One way to achieve this is using the CDC pattern where on one side you have the reactive mongo library that can hook onto a collection and receive events like update, insert, delete from the mongodb and on the other side you have a spark streaming application that will listen on this events.
The transport engine of the events can be any publish/subscriber system (e.g Kafka)

Related

How can reactive programming react to database changes?

I am new in the topic reactive programming and therefore have some questions.
I am developing a small software.
I would like to take the opportunity to get to know reactive programming better.
So I looked at Spring's project-reactor.
I also use R2DBC to reactively access the database.
I would like to know if there is any way that database responds to changes.
Or rather: If a user saves an entry in the database, then servers (for example, RestController) should be notified.
How could I go about doing that?
Enresponding controllers, configuration, entities, etc. I have already implemented.
Sorry for spelling mistakes.
Complement: The updates to the frontend are then made by Server Sent Events.
Basically, what Nick Tsitlakidis mentioned. Let me add a couple of things here.
The typical database query pattern is to query for a number of records. Databases respond with their results and indicate that the query is complete once a server has sent all records to your application. If new records arrive while the query is active or after the query is complete, you do not see these changes immediately because the of isolation and in case the query is complete, then you no longer have a reference to the query.
The feature you're asking is event-driven consumption of data. Databases call this feature continuous queries. Some stores (such as MongoDB with Tailable cursors or Postgres Logical Decoding) come with features that allow keeping a cursor/query open and your client is able to receive continuous updates.
Kafka and JMS also follow the idea of sending (messages) that are consumed typically by listeners or even through a reactive stream.
So it all boils down to the technology that you're using.
My understanding is that reactor can't solve this problem for you on its own. If you want your application to respond (react) on some database change, then you need to identify who's making this change and implement some kind of integration there.
Example, if you have Service1 updating the database, and Service2 needs to respond then Service1 can either call Service2, or, you can emit an event from Service1 and listen for the event from Service2.
The first approach is simpler and easier to implement but it has the disantvantage that is couples the two services. The second is trickier to implement but services are decoupled.
Reactor can help you in both cases :
For events, reactor can give you a way to listen to the events. For example using the reactor-rabbitmq module or the reactor-kafka.
For service-to-service calls, reactor can help you if you use Spring Webflux.
Perhaps you can tell us more about your case so we can provide a more specific solution?

MongoDB/Spring: Subscribing to collection changes

I'm working with a Spring Boot app. I'm trying to implement callback-based event notification for collection modifications in a MongoDB. I'm running out of ideas, as I have tried the following:
Classic Polling - Redundant, as the existing implementation is a REST endpoint that's polled by the UI, where it queries data.
Tailable Cursors - Requires a collection to be capped, which is likely a limitation that won't suffice for a database with a very high storage forecast.
Change Streams - I got a runtime exception stating that the storage engine doesn't support 'Majority Read Concern'.
collection.watch(asList(Aggregates.match(Filters.in("operationType", asList("insert", "update"))))).forEach(printBlock);
I'm not authorizated to view the engine configuration, but I'm assuming if the DBA can't change the storage engine to wiredTiger, than I can't use change streams. Is this correct? Are there other solutions? How about spring's mongodb-reactive API? I was under the impression the API still depends on tailable cursors or change streams.

Read Directly from the Event Store Or Implement a Copy of the Events in the Read Side

I'am wokring on a project where I have implemented CQRS without event sourcing. I am using Mongo for the read database and also for the write database. When something has to be changed it is first changed in the write db and then the read db is synchronized.
Later I introtoduced somehting like Event Store, also a MongoDB instace. I am making history of all the events that changed the other databases in some way. The read db does not synchronize with the Event Store, so I have no way to read the events.
I've reached a situation where I need the information that's inside the events from the Event Store. Should I connect directly to the Event Store and read from there or I should make the read db synchronize with the Event Store and basically hold a duplicate of the Event Store?
Thank you in advance guys! I am using C# .NET Core if someone needs this kind of info.
There's no functional reason why you shouldn't just read events directly from the event store.
The motivation for read models is usually low latency queries; we take the representation of data in the book of record (the event streams), and reshape it into a data structure that can answer queries. We accept the consequences of eventual consistency to get fast response times.
But if the shape we need for a query is an event stream, then we can just use the source data.
If the queries of the event store are having a negative performance impact on the writes to the store, then we might redesign the system to direct queries to a cached copy of the events instead.

Pushing data from database to UI in realtime

I have a database (MySQL) to which data is being written to. I need to push new records and changed records to UI. A few constraints here: I do not have control on the code which writes to this database and I cannot modify it to write to a queue.
So far, I am reading the DB periodically for changes and new additions (using a last update timestamp) and pushing that data to a mongo db (as I do not want to hit main MySQL server for every request). Then I push these changes to frontend using cramp (ruby framework) and server sent events. To maintain per user queue, I have redis in the mix.
I realize that this is a convulated way of doing realtime push. I was wondering if there is a more neat solution to this mess.
If you want to push data realtime from the server, then make use of technologies that provide real time access. I would recommend you to make use of Websockets.
The only issue is websockets is not supported by all the browsers, to take care of this you can use the available frameworks built over websockets that provide fallback to protocols supported by the browsers such as long polling, streaming etc. Following are the frameworks which I would suggest to use:
Atmosphere framework - https://github.com/Atmosphere/atmosphere
Play framework! - http://www.playframework.org/

CQRS & ElasticSearch - using ElasticSearch to build read model

Does anyone use ElasticSearch for building read model in CQRS approach? I have some questions related to such solution:
Where do you store your domain
events? In JDBC database? In
ElasticSearch?
Do you build indexes by event handlers that processes domain events or using ElasticSearch River functionality?
How do you handle complete rebuild of view model - for example in case when view is corrupted? Do you process all events to rebuid view?
Where the authoritative repository for your domain event is located is an implementation detail. For example, you could store serialized versions on S3 or CouchDB or any other number of storage implementations. The easiest if you're just getting started is a relational database.
Typically people use event handlers that understand the business intent behind each message and can then properly denormalize the message into the read model structure appropriate for the needs of your views.
If the view model is ever corrupted or perhaps you have a bug in a view model handler, there are a few simple steps to follow after fixing the bug:
Temporarily enqueue the flow of events arriving from the domain--these are the typical messages that are being published as your domain is doing work. We still want these messages, but not just yet. This could be done by turning off any message bus or not connecting to your queuing infrastructure if you use one.
Read all events from event storage. As each event is received (this can be done through a simple DB query), run each message through the appropriate message handler. Make sure that you keep track of the last 10,000 (or so) identifiers for all messages processed.
Now reconnect to your queues and start processing normally. If the identifier for the message has been seen, drop the message. Otherwise, process it normally.
The reason for tracking identifiers is to avoid a race condition where you're getting all events from the event store but the same message is coming across through the message queue.
Another technique that's highly related, but involves keeping track of all message identifiers can be found here: http://blog.jonathanoliver.com/2011/03/removing-2pc-two-phase-commit.html