Multiple applications accessing the same mongodb instance - mongodb

Application A opens mongodb and reads a document. The document now exists in Application A's memory. Application B, perhaps running on a different box (but in this case as a mock running in a different process on the same box) inserts data into the document. When does Application A see application B's change?
Mongo is configured to use safe writes and a journal. There is only the one shard.
Mongo updates are written immediately to the journal, but updates to the data file can take 60 seconds (syncdelay=60). How is this supposed to work when two different applications share the same instance?

When A queries for the document again.
You should take a look at the WriteConcern you are using.

Related

Is the update order preserved in MongoDB?

If I trigger two updates which are just 1 nanosecond apart. Is it possible that the updates could be done out-of-order? Say if the first update was more complex than the second.
I understand that MongoDB is eventually consistent, but I'm just not clear on whether or not the order of writes is preserved.
NB I am using a legacy system with an old version of MongoDB that doesn't have the newer transaction stuff
In MongoDB, write operations are atomic on document level as every document in a collection is independent & individual on it's own. So when an operation is executing write on a document then the second operation has to wait until first one finishes writing to the document.
From their docs :
a write operation is atomic on the level of a single document, even if
the operation modifies multiple embedded documents within a single
document.
Ref : atomicity-in-mongodb
So when this can be an issue ? - On reads, This is when if your application is so ready heavy. As reads can happen during updates - if your read happened before update finishes then your app will see old data or in other case reading from secondary can also result in inconsistent data.
In general MongoDB is usually hosted as replica-set (A MongoDB is generally a set of atleast 3 servers/shards/nodes) in which writes has to definitely be targeted to primary shard & by default reads as well are targeted to primary shard but if you overwrite read preference to read from Secondary shards to make primary free(maybe in general for app reporting), then you might see few to zero issues.
But why ? In general in background data gets sync'd from Primary to Secondary, if at any case this got delayed or not done by the time your application reads then you'll see the issue but chances can be low. Anyway all of this is until MongoDB version 4.0 - From 4.0 secondary read preference enabled apps will read from WiredTiger snapshot of data.
Ref : replica-set-data-synchronization

data synchronization between clients

I don't know if this is the right place to ask my question, but here it is.
Inspired by Firebase, I decided to write a little framework to synchronize data between clients. That should simplify the development of web applications such as chats, forums, etc...
Let's assume that there are one or more servers. A client can connect to one server and access a particular collection (a list of chat messages, for instance). If and when the client modifies the collection, those modifications will be sent to the other clients who requested access to the same collection.
I'd like my solution to be fast and general. The propagation of the modifications should be very fast and the collections should be persisted on a DB.
The collections may be very large but the clients may request just a view of the collection (for instance, the chat messages of the last 20 minutes).
Possible solution
We have n servers, 1 node with a fast in-memory DB (Redis) and a cluster with a NoSQL DB.
The cluster will contain the full collections.
When a client connects to a server and is given access to a collection for the first time, the requested part of the collection is read directly from the cluster.
When a client modifies a collection C, the modification is written to the in-memory DB which will contain something like:
123 added "message..."
124 deleted id235
125 modified id143 "new message..."
where 123, 124 and 125 are the versions of the collection.
In this case, the cluster contains the entire collection C and its version number which is 122.
When a client first connects to a server and accesses the collection C, the server reads the requested part of the collection from the cluster and then reads the updates from the in-memory DB in order to update the collection from version 122 to version 125.
When a client modifies the collection C,
the description of the modification is inserted into the in-memory DB;
the other servers are informed that a new version of C is available;
the client is sent the update.
Of course, the other servers, once informed, will send the updates to their clients as well.
Another process in the background will update the cluster the following way:
while (the in-memory database contains less than K updates for the collection C)
read the next update, U, from the in-memory database;
use U to update the collection C and its version number in the cluster ATOMICALLY.
The updates must be linearizable, i.e. no server should be able to see the collection C in a state where an update has been applied before a previous update.
When the cluster is fully-consistent, we remove the updates from the in-memory DB from lowest to highest version.
Problem
My solution requires a DB (for the cluster) which supports transactions (ACID?) and offers strong consistence. For instance, I can't use MongoDB.
Question
Can you think of a simpler solution?
or
If my solution is acceptable, what DB do you recommend for the cluster?
Thank you for your patience.
If each element of the collection is assigned a unique id and the updates in the in-memory DB include those ids, then one doesn't need a version number for the collection in the cluster and so transactions are unnecessary.
The idea is that the ids can be used to decide if an update is needed or not. For instance, if an update says
version=123 action=ADD id=05276 msg="Message"
and the collection in the cluster already contains a document with id=05276, then this update is old and was already applied to the collection in the cluster.
We only need to pay attention to this: before we remove some updates from the in-memory DB, we must make sure that those updates have been applied to the collection in the cluster and that the cluster is fully consistent wrt that collection.
When a client requests access to a collection, the server it connected to needs to:
read all the updates from the in-memory DB and save them in memory
read (the relevant part of) the collection from the cluster
update the read collection with the updates saved in memory
It's important to first read all the updates from the in-memory DB to avoid a race condition. Consider this scenario:
we read an old version of the collection from the cluster
the collection is updated
the cluster becomes fully consistent
some updates are removed from the in-memory DB
we update our read collection with the new updates
Problem is, in point 5 we'd miss some updates.

Are MongoDB indexes persistent across restarts?

Referencing the guide here, http://docs.mongodb.org/manual/core/indexes/ I cannot tell if Mongo indexes for fields are stored persistently.
If ensureIndex() is called (and completes) within an application using MongoDB, what happens if:
The application using MongoDB is restarted. Will a subsequent call to ensureIndex() cause a complete reindex?
The MongoDB server is restarted. Would a later call of ensureIndex() from a client application rebuild?
Is any of this affected by having multiple client sessions? I assume indexing is global across the entire collection per the documentation: "MongoDB defines indexes on a per-collection level."
The application using MongoDB is restarted. Will a subsequent call to ensureIndex() cause a complete reindex?
No, it should (as in every other driver) register as a no op since the index already exists. Some drivers provide a cache mechanism to detect, without going to the server, if an index has been created (i.e. Python).
The MongoDB server is restarted. Would a later call of ensureIndex() from a client application rebuild?
Same as above
Is any of this affected by having multiple client sessions? I assume indexing is global across the entire collection per the documentation: "MongoDB defines indexes on a per-collection level."
Yes indexes are stored in MongoDB on the collection itself (to be technical, as a namespace within the db.ns file). Since it is a single point of knowledge for ensureIndex and an index is a single process (much like the write lock really) multiple connections should not effect whether the index creation is registered twice.

GET Consistency (and Quorum) in ElasticSearch

I am new to ElasticSearch and I am evaluating it for a project.
In ES, Replication can be sync or async. In case of async, the client is returned success as soon as the document is written to the primary shard. And then the document is pushed to other replicas asynchronously.
When written asynchronously, how do we ensure that when GET is done, data is returned even if it has not propagated to all the replicas. Because when we do a GET in ES, the query is forwarded to one of the replicas of the appropriate shard. Provided we are writing asynchronously, the primary shard may have the document but the selected replica for doingthe GET may not have received/written the document yet. In Cassandra, we can specify consistency levels (ONE, QUORUM, ALL) at the time of writes as well as reads. Is something like that possible for reads in ES?
Right, you can set replication to be async (default is sync) to not wait for the replicas, although in practice this doesn't buy you much.
Whenever you read data you can specify the preference parameter to control where the documents are going to be taken from. If you use preference:_primary you make sure that you always take the document from the primary shard, otherwise, if the get is done before the document is available on all replicas, it might happen that you hit a shard that doesn't have it yet. Given that the get api works in real-time, it usually makes sense to keep replication sync, so that after the index operation returned you can always get back the document by id from any shard that is supposed to contain it. Still, if you try to get back a document while indexing it for the first time, well it can happen that you don't find it.
There is a write consistency parameter in elasticsearch as well, but it is different compared to how other data storages work, and it is not related to whether replication is sync or async. With the consistency parameter you can control how many copies of the data need to be available in order for a write operation to be permissible. If not enough copies of the data are available the write operation will fail (after waiting for up to 1 minute, interval that you can change through the timeout parameter). This is just a preliminary check to decide whether to accept the operation or not. It doesn't mean that if the operation fails on a replica it will be rollbacked. In fact, if a write operation fails on a replica but succeeds on a primary, the assumption is that there is something wrong with the replica (or the hardward it's running on), thus the shard will be marked as failed and recreated on another node. Default value for consistency is quorum, and can also be set to one or all.
That said, when it comes to the get api, elasticsearch is not eventually consistent, but just consistent as once a document is indexed you can retrieve it.
The fact that newly added documents are not available for search till the next refresh operation, which happens every second automatically by default, is not really about eventual consistency (as the documents are there and can be retrieved by id), but more about how search and lucene work and how documents are made visible through lucene.
Here is the answer I gave on the mailing list:
As far as I understand the big picture, when you index a document it's written in the transaction log and then you get a succesful answer from ES.
After, in an asynchronous manner, it's replicated on other nodes and indexed by Lucene.
That said, you can not search immediatly for the document, but you can GET it.
ES will read the tlog if needed when you GET a document.
I think (not sure) that if the replica is not up to date, the GET will be sent on the primary tlog.
Correct me if I'm wrong.

how to store User logs on a large server best practice

From my experience this is what i come up with.
Im currently saving Users and Statistic classes into the MongoDb and everything works grate.
But how do i save the log each user generate?
Was thinking to use the LogBack SiftingAppender and delegate the log information
to separate MongoDb Collections. Like every MongoDb Collection have the id of the user.
That way i don't have to create advanced mapreduce query's since logs are neatly stacked.
Or use SiftingAppender with a FileAppender so each user have a separate log file.
I there is a problem with this if the MongoDB have one million Log Collections each one named with the User Id. (is it even possible btw)
If everything is stored in the MongoDb the MongoDb master-slave replication makes
it easy if a master node dies.
What about the FileAppender approach. Feels like there will be a hole lot of log files
to administrate. One could maybe save them in folders according to Alphabet. Folder A
for user/id with names/id starting with A.
What are other options to make this work?
On your qn of 1M collections, the default namespace file for a db is 16MB which allows about 24000 namespaces (12000 collections + their _id indexes). More info on this website
And you can set maximum .ns (namespace) file size to 2GB with --nssize option, which will allow probably 3072000 namespaces.
Make use of Embedded Documents and have one Document for each user with an array of embedded documents containing log files. You can also benefit from sharding if collections get large.