data synchronization between clients - mongodb

I don't know if this is the right place to ask my question, but here it is.
Inspired by Firebase, I decided to write a little framework to synchronize data between clients. That should simplify the development of web applications such as chats, forums, etc...
Let's assume that there are one or more servers. A client can connect to one server and access a particular collection (a list of chat messages, for instance). If and when the client modifies the collection, those modifications will be sent to the other clients who requested access to the same collection.
I'd like my solution to be fast and general. The propagation of the modifications should be very fast and the collections should be persisted on a DB.
The collections may be very large but the clients may request just a view of the collection (for instance, the chat messages of the last 20 minutes).
Possible solution
We have n servers, 1 node with a fast in-memory DB (Redis) and a cluster with a NoSQL DB.
The cluster will contain the full collections.
When a client connects to a server and is given access to a collection for the first time, the requested part of the collection is read directly from the cluster.
When a client modifies a collection C, the modification is written to the in-memory DB which will contain something like:
123 added "message..."
124 deleted id235
125 modified id143 "new message..."
where 123, 124 and 125 are the versions of the collection.
In this case, the cluster contains the entire collection C and its version number which is 122.
When a client first connects to a server and accesses the collection C, the server reads the requested part of the collection from the cluster and then reads the updates from the in-memory DB in order to update the collection from version 122 to version 125.
When a client modifies the collection C,
the description of the modification is inserted into the in-memory DB;
the other servers are informed that a new version of C is available;
the client is sent the update.
Of course, the other servers, once informed, will send the updates to their clients as well.
Another process in the background will update the cluster the following way:
while (the in-memory database contains less than K updates for the collection C)
read the next update, U, from the in-memory database;
use U to update the collection C and its version number in the cluster ATOMICALLY.
The updates must be linearizable, i.e. no server should be able to see the collection C in a state where an update has been applied before a previous update.
When the cluster is fully-consistent, we remove the updates from the in-memory DB from lowest to highest version.
Problem
My solution requires a DB (for the cluster) which supports transactions (ACID?) and offers strong consistence. For instance, I can't use MongoDB.
Question
Can you think of a simpler solution?
or
If my solution is acceptable, what DB do you recommend for the cluster?
Thank you for your patience.

If each element of the collection is assigned a unique id and the updates in the in-memory DB include those ids, then one doesn't need a version number for the collection in the cluster and so transactions are unnecessary.
The idea is that the ids can be used to decide if an update is needed or not. For instance, if an update says
version=123 action=ADD id=05276 msg="Message"
and the collection in the cluster already contains a document with id=05276, then this update is old and was already applied to the collection in the cluster.
We only need to pay attention to this: before we remove some updates from the in-memory DB, we must make sure that those updates have been applied to the collection in the cluster and that the cluster is fully consistent wrt that collection.
When a client requests access to a collection, the server it connected to needs to:
read all the updates from the in-memory DB and save them in memory
read (the relevant part of) the collection from the cluster
update the read collection with the updates saved in memory
It's important to first read all the updates from the in-memory DB to avoid a race condition. Consider this scenario:
we read an old version of the collection from the cluster
the collection is updated
the cluster becomes fully consistent
some updates are removed from the in-memory DB
we update our read collection with the new updates
Problem is, in point 5 we'd miss some updates.

Related

MongoDB multiple collections or multiple databases

We are using .net Core and node.js micro services some of them with mongoDB.
Currently we got the following DB structure :
Every customer gets his own Database.
So if we got a micro service for Invoices, every new customer adds 1 new DB for that micro service.
Invoice_customerA
Invoice_customerB
etc...
While the collections in each such DB remain the same (usually we got 1-3 collections in each DB)
In terms of logic - We choose the right DB by request input in runtime.
I am thinking now about changing it a bit, to start making separation on the collections instead:
So if we take the same example from before this time around this Invoice Service will only have 1 DB,
Invoice_allCustomers
and there will be 1 new collection for each customer in it ( or more if there were more collections for this service).
collection_customerA
collection_customerB
What I am trying to understand is if there is any difference performance wise?
Or is it mostly a "cosmetic" change?
Or maybe there are some other considerations?
P.S.
If the change is mostly cosmetic I am thinking that the new solution is better for us since we usually got only 1-2 collections per each micro service.
And it will be easier to navigate when there are significantly less Databases.
As far as I know in microservices,each service should have its own database. If it is not a different service than you can use one database with different collections in it. It is more of cosmetic changes but I should also warn you that mongodb still has it limits which you can find here. It really depends on the amount of data that will be stored and retrieved.

Multiple applications accessing the same mongodb instance

Application A opens mongodb and reads a document. The document now exists in Application A's memory. Application B, perhaps running on a different box (but in this case as a mock running in a different process on the same box) inserts data into the document. When does Application A see application B's change?
Mongo is configured to use safe writes and a journal. There is only the one shard.
Mongo updates are written immediately to the journal, but updates to the data file can take 60 seconds (syncdelay=60). How is this supposed to work when two different applications share the same instance?
When A queries for the document again.
You should take a look at the WriteConcern you are using.

Can MongoDB be a consistent event store?

When storing events in an event store, the order in which the events are stored is very important especially when projecting the events later to restore an entities current state.
MongoDB seems to be a good choice for persisting the event store, given its speed and flexibel schema (and it is often recommended as such) but there is no such thing as a transaction in MongoDB meaning the correct event order can not be garanteed.
Given that fact, should you not use MongoDB if you are looking for a consistent event store but rather stick with a conventional RDMS, or is there a way around this problem?
I'm not familiar with the term "event store" as you are using it, but I can address some of the issues in your question. I believe it is probably reasonable to use MongoDB for what you want, with a little bit of care.
In MongoDB, each document has an _id field which is by default in ObjectId format, which consists of a server identifier, and then a timestamp and then a sequence counter. So you can sort on that field and you'll get your objects in their creation order, provided the ObjectIds are all created on the same machine.
Most MongoDB client drivers create the _id field locally before sending an insert command to the database. So if you have multiple clients connecting to the database, sorting by _id won't do what you want since it will sort first by server-hash, which is not what you want.
But if you can convince your MongoDB client driver to not include the _id in the insert command, then the server will generate the ObjectId for each document and they will have the properties you want. Doing this will depend on what language you're working in since each language has its own client driver. Read the driver docs carefully or dive into their source code -- they're all open source. Most drivers also include a way to send a raw command to the server. So if you construct an insert command by hand this will certainly allow you to do what you want.
This will break down if your system is so massive that a single database server can't handle all of your write traffic. The MongoDB solution to needing to write thousands of records per second is to set up a sharded database. In this case the ObjectIds will again be created by different machines and won't have the nice sorting property you want. If you're concerned about outgrowing a single server for writes, you should look to another technology that provides distributed sequence numbers.

Replicate only documents where {'public':true} in MongoDB

I have the following network/mongodb setup:
1 primary mongodb database (10.0.0.1, not accessible from the
Internet) - contains private info in collection A, and a collection
B, with documents created by a trusted user. At any point in time, a
user can mark any document in collection B as 'public', which changes
its property from {'public':false} to {'public':true}.
1 public mongodb database (10.0.0.2, runs a webserver accessible from
the Internet via a reverse proxy) - does not contain collection A,
but should contain all documents marked as 'public' from collection
B. This machine will serve those public documents to users outside
the network.
How would I set up mongodb so that when a document in the primary database (10.0.0.1) is updated with {'public':true}, it gets replicated to the public mongodb database (10.0.0.2)?
Other details:
I'm using the PHP driver
The documents are small, max 2KB
The load on these servers will probably never exceed 10 concurrent users
Eventual consistency is ok, up to a few minutes, but I'd like to know what my options are.
So, just to reiterate, here's a use case:
John VPNs into our private network, opens http://10.0.0.1/, creates a
document (call it D2), marks it as private. John then views an older
document, D1, and decides to make it public, by clicking the 'Make
public' button. The server automagically makes the document available
on the public server example.com (public IP x.y.z.w, internal IP
10.0.0.2).
John sends an e-mail to Sarah and asks her to read document D1 (the
one that was made public). Sarah goes to http://example.com and is
able to read D1, but never sees D2.
My goal is to achieve this without having to manually write scripts to synchronize those two databases. I suspect it should be possible, but I can't figure it out from what I've read about MongoDB replication.
I welcome any advice.
Thank you!
MongoDB (as at 2.0.6) does not have support for filtered replication.
However ... it may be possible for you to implement your own scheme to update records based on a tailable cursor of MongoDB's oplog. The local oplog.rs capped collection is the same mechanism used to relay changes to members of a replica set and includes details for inserts, deletes, and updates.
For an example of this technique, see this blog post: Creating Triggers for MongoDB.
In your case the actions would be something like:
copy record from collection A to B if it is inserted or updated with public:true
remove record from collection B if it is deleted or updated in collection A with public:false

MongoDB with redis

Can anyone give example use cases of when you would benefit from using Redis and MongoDB in conjunction with each other?
Redis and MongoDB can be used together with good results. A company well-known for running MongoDB and Redis (along with MySQL and Sphinx) is Craiglist. See this presentation from Jeremy Zawodny.
MongoDB is interesting for persistent, document oriented, data indexed in various ways. Redis is more interesting for volatile data, or latency sensitive semi-persistent data.
Here are a few examples of concrete usage of Redis on top of MongoDB.
Pre-2.2 MongoDB does not have yet an expiration mechanism. Capped collections cannot really be used to implement a real TTL. Redis has a TTL-based expiration mechanism, making it convenient to store volatile data. For instance, user sessions are commonly stored in Redis, while user data will be stored and indexed in MongoDB. Note that MongoDB 2.2 has introduced a low accuracy expiration mechanism at the collection level (to be used for purging data for instance).
Redis provides a convenient set datatype and its associated operations (union, intersection, difference on multiple sets, etc ...). It is quite easy to implement a basic faceted search or tagging engine on top of this feature, which is an interesting addition to MongoDB more traditional indexing capabilities.
Redis supports efficient blocking pop operations on lists. This can be used to implement an ad-hoc distributed queuing system. It is more flexible than MongoDB tailable cursors IMO, since a backend application can listen to several queues with a timeout, transfer items to another queue atomically, etc ... If the application requires some queuing, it makes sense to store the queue in Redis, and keep the persistent functional data in MongoDB.
Redis also offers a pub/sub mechanism. In a distributed application, an event propagation system may be useful. This is again an excellent use case for Redis, while the persistent data are kept in MongoDB.
Because it is much easier to design a data model with MongoDB than with Redis (Redis is more low-level), it is interesting to benefit from the flexibility of MongoDB for main persistent data, and from the extra features provided by Redis (low latency, item expiration, queues, pub/sub, atomic blocks, etc ...). It is indeed a good combination.
Please note you should never run a Redis and MongoDB server on the same machine. MongoDB memory is designed to be swapped out, Redis is not. If MongoDB triggers some swapping activity, the performance of Redis will be catastrophic. They should be isolated on different nodes.
Obviously there are far more differences than this, but for an extremely high overview:
For use-cases:
Redis is often used as a caching layer or shared whiteboard for distributed computation.
MongoDB is often used as a swap-out replacement for traditional SQL databases.
Technically:
Redis is an in-memory db with disk persistence (the whole db needs to fit in RAM).
MongoDB is a disk-backed db which only needs enough RAM for the indexes.
There is some overlap, but it is extremely common to use both. Here's why:
MongoDB can store more data cheaper.
Redis is faster for the entire dataset.
MongoDB's culture is "store it all, figure out access patterns later"
Redis's culture is "carefully consider how you'll access data, then store"
Both have open source tools that depend on them, many of which are used together.
Redis can be used as a replacement for a traditional datastore, but it's most often used with another normal "long" data store, like Mongo, Postgresql, MySQL, etc.
Redis works excellently with MongoDB as a caching server. Here is what happens.
Anytime that mongoose issues a cache query, it will first go over to the cache server.
The cache server will check to see if that exact query has ever been issued before.
If it hasn’t then the cache server will take the query, send it over to mongodb and Mongo will execute the query.
We will then take the result of that query, it then goes back to the cache server, the cache server will store the result of the query on itself.
It will say anytime I execute that query, I get this response and so its going to maintain a record between queries that are issued and responses that come back from those queries.
The cache server will take the response and send it back to mongoose, mongoose will give it to express and it eventually ends up inside the application.
Anytime that the same exact query is issued again, mongoose will send the same query to the cache server, but if the cache server sees that this query was issued before it will not send the query onto mongodb, instead its going to take the response to the query it got the last time and immediately send it back over to mongoose. There is no indices here, no full table scan, nothing.
We are doing a simple lookup to say has this query been executed? Yes? Okay, take the request and send it back immediately and don’t send anything to mongo.
We have the mongoose server, the cache server (Redis) and Mongodb.
On the cache server there might be a datastore with key value type of data store where all the keys are some type of query issued before and the value the result of that query.
So maybe we are looking up a bunch of blogposts by _id.
So maybe the keys in here are the _id of the records we have looked up before.
So lets imagine that mongoose issues a new query where it tries to find a blogpost with _id of 123, the query flows into the cache server, the cache server will check to see if it has a result for any query that was looking for an _id of 123.
If it does not exist in the cache server, this query is taken and sent on to the mongodb instance. Mongodb will execute the query, get a response and send it back.
This result is sent back over to the cache server who takes that result and immediately sends it back to mongoose so we get as fast a response as possible.
Right after that, the cache server will also take the query issued, and add that on to its collection of queries that have been issued and take the result of the query and store it right up against the query.
So we can imagine that in the future we issue the same query again, it hits the cache server, it looks at all the keys it has and says oh I already found that blogpost, it doesn’t reach out to mongo, it just takes the result of the query and sends it directly to mongoose.
We are not doing complex query logic, no indices, nothing like that. Its as fast as possible. Its a simple key value lookup.
Thats an overview of how the cache server (Redis) works with MongoDB.
Now there are other concerns. Are we caching data forever? How do we update records?
We don’t want to always be storing data in the cache and be reading from the cache.
The cache server is not used for any write actions. The cache layer is only used for reading data. If we ever write data, writing will always go over to the mongodb instance and we need to ensure that anytime we write data we clear any data stored on the cache server that is related to the record we just updated in Mongo.