GET Consistency (and Quorum) in ElasticSearch

GET Consistency (and Quorum) in ElasticSearch - nosql

I am new to ElasticSearch and I am evaluating it for a project.
In ES, Replication can be sync or async. In case of async, the client is returned success as soon as the document is written to the primary shard. And then the document is pushed to other replicas asynchronously.
When written asynchronously, how do we ensure that when GET is done, data is returned even if it has not propagated to all the replicas. Because when we do a GET in ES, the query is forwarded to one of the replicas of the appropriate shard. Provided we are writing asynchronously, the primary shard may have the document but the selected replica for doingthe GET may not have received/written the document yet. In Cassandra, we can specify consistency levels (ONE, QUORUM, ALL) at the time of writes as well as reads. Is something like that possible for reads in ES?

Right, you can set replication to be async (default is sync) to not wait for the replicas, although in practice this doesn't buy you much.
Whenever you read data you can specify the preference parameter to control where the documents are going to be taken from. If you use preference:_primary you make sure that you always take the document from the primary shard, otherwise, if the get is done before the document is available on all replicas, it might happen that you hit a shard that doesn't have it yet. Given that the get api works in real-time, it usually makes sense to keep replication sync, so that after the index operation returned you can always get back the document by id from any shard that is supposed to contain it. Still, if you try to get back a document while indexing it for the first time, well it can happen that you don't find it.
There is a write consistency parameter in elasticsearch as well, but it is different compared to how other data storages work, and it is not related to whether replication is sync or async. With the consistency parameter you can control how many copies of the data need to be available in order for a write operation to be permissible. If not enough copies of the data are available the write operation will fail (after waiting for up to 1 minute, interval that you can change through the timeout parameter). This is just a preliminary check to decide whether to accept the operation or not. It doesn't mean that if the operation fails on a replica it will be rollbacked. In fact, if a write operation fails on a replica but succeeds on a primary, the assumption is that there is something wrong with the replica (or the hardward it's running on), thus the shard will be marked as failed and recreated on another node. Default value for consistency is quorum, and can also be set to one or all.
That said, when it comes to the get api, elasticsearch is not eventually consistent, but just consistent as once a document is indexed you can retrieve it.
The fact that newly added documents are not available for search till the next refresh operation, which happens every second automatically by default, is not really about eventual consistency (as the documents are there and can be retrieved by id), but more about how search and lucene work and how documents are made visible through lucene.

Here is the answer I gave on the mailing list:
As far as I understand the big picture, when you index a document it's written in the transaction log and then you get a succesful answer from ES.
After, in an asynchronous manner, it's replicated on other nodes and indexed by Lucene.
That said, you can not search immediatly for the document, but you can GET it.
ES will read the tlog if needed when you GET a document.
I think (not sure) that if the replica is not up to date, the GET will be sent on the primary tlog.
Correct me if I'm wrong.

Related

What Issues could arise, when reading/writing to a single shard in a sharded MongoDB?

In the offical MongoDB Documentation it is stated that:
"Clients should never connect to a single shard in order to perform read or write operations." Source
I did try writing some data to a single shard node and it worked, but it is not a big dataset (certainly not one that would require sharding).
Could this lead to other issues, which I am not yet aware of?
Or are clients discouraged from doing this, simply because of performance reasons?

In a sharded cluster, the individual shards contain the data but some of the operation state is stored on mongos nodes. When you communicate with individual shards you bypass mongoses and as such any state management that they would perform.
Depending on the operation, this could be innocuous or could result in unexpected behavior/wrong results.
There isn't a comprehensive list of issues that would arise because the scenario of direct shard access isn't tested (because it's not supported).

Problem is, by default you don't know which data portion is stored on which shard. The balancer is running in background and may change the data distribution at any time.
If you insert some data manually into one shard, then a properly connected client may not find this data.
If you read some data from a single shard then you don't know whether this data is complete or not.

Is natural order in MongoDB stable?

MongoDB documentation https://docs.mongodb.com/manual/reference/glossary/#term-natural-order defines natural order as
The order in which the database refers to documents on disk. This is the default sort order.
Also, here https://docs.mongodb.com/manual/reference/method/cursor.sort/#return-natural-order it says the following:
This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
So, can the natural ordering change under the following circumstances?
Normal operation: compactions, replication, backup/restore, node bootstrap from healthy replicas
Storage engine upgrade
Something else?
I mean, if a have a query that does not specify an ordering (or requests $natural ordering), is it possible for an existing collection, for the data that remains untouched, to be returned in a different order by that query after some time? Are there any guarantees?

Are there any guarantees?
None whatsoever.
is it possible for an existing collection, for the data that remains untouched, to be returned in a different order by that query after some time?
Yes. This possibility is allowed by the published documentation. It's unlikely for practical reasons but if the database decided to shuffle the order of documents one day out of the blue it wouldn't run afoul of any promises it made in published documentation.
Normal operation: compactions, replication, backup/restore, node bootstrap from healthy replicas
In addition to those, inserting a single document can change the order of existing documents.

Is the update order preserved in MongoDB?

If I trigger two updates which are just 1 nanosecond apart. Is it possible that the updates could be done out-of-order? Say if the first update was more complex than the second.
I understand that MongoDB is eventually consistent, but I'm just not clear on whether or not the order of writes is preserved.
NB I am using a legacy system with an old version of MongoDB that doesn't have the newer transaction stuff

In MongoDB, write operations are atomic on document level as every document in a collection is independent & individual on it's own. So when an operation is executing write on a document then the second operation has to wait until first one finishes writing to the document.
From their docs :
a write operation is atomic on the level of a single document, even if
the operation modifies multiple embedded documents within a single
document.
Ref : atomicity-in-mongodb
So when this can be an issue ? - On reads, This is when if your application is so ready heavy. As reads can happen during updates - if your read happened before update finishes then your app will see old data or in other case reading from secondary can also result in inconsistent data.
In general MongoDB is usually hosted as replica-set (A MongoDB is generally a set of atleast 3 servers/shards/nodes) in which writes has to definitely be targeted to primary shard & by default reads as well are targeted to primary shard but if you overwrite read preference to read from Secondary shards to make primary free(maybe in general for app reporting), then you might see few to zero issues.
But why ? In general in background data gets sync'd from Primary to Secondary, if at any case this got delayed or not done by the time your application reads then you'll see the issue but chances can be low. Anyway all of this is until MongoDB version 4.0 - From 4.0 secondary read preference enabled apps will read from WiredTiger snapshot of data.
Ref : replica-set-data-synchronization

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?

Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/

I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.

I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.

Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.

I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

how do non-ACID RethinkDB or MongoDB maintain secondary indexes for non-equal queries

This is more of 'inner workings' undestanding question:
How do noSQL databases that do not support *A*CID (meaning that they cannot update/insert and then rollback data for more than one object in a single transaction) -- update the secondary indexes ?
My understanding is -- that in order to keep the secondary index in sync (other wise it will become stale for reads) -- this has to happen withing the same transaction.
furthermore, if it is possible for index to reside on a different host than the data -- then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host) , what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?

This is a great question.
RethinkDB always stores secondary indexes on the same host as the primary index/data for the table. Even in case of joins, RethinkDB brings the query to the data, so the secondary indexes, primary indexes, and data always reside on the same node. As a result, there is no need for distributed locking protocols such as two phase commit.
RethinkDB does support a limited set of transactional functionality -- single document transactions. Changes to a single document are recorded atomically. Relevant secondary index changes are also recorded as part of that transaction, so either the entire change is recorded, or nothing is recorded at all.
It would be easy to extend the limited transactional functionality to support multiple documents in a single shard, but it would be hard to do it across shards (for the distributed locking reasons you brought up), so we decided not to implement transactions for multiple documents yet.
Hope this helps.

This is a MongoDB answer.
I am not quite sure what your logic here is. Updating a secondary index has nothing to do with being able to rollback multi statement transactions such as a multiple update.
MongoDB has transcactions per a single document, and that is what matters for updating indexes. These operations can be reversed using the journal if the need arises.
this has to happen withing the same transaction.
Yes, much like a RDBMS would. The more indexes you apply the slower your writes will be, and it seems to me you know why.
As the write occurs MongoDB will update all indexes which apply to that collection with the fields that apply to specific indexes.
furthermore, if it is possible for index to reside on a different host than the data
I am unsure if MongoDB allows that, I believe there is a JIRA for it; however, I cannot find that JIRA currently.
then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
Most likely. Allowing this feature would be...well, let's just say creating a hairball.
Even in a sharded setup the index of each range resides on the shard itself, not on the config servers.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host)
That is not what a two phase commit means. I believe you need to brush up on what a two phase commit is: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
I suppose if you are talking about a transaction covering more than one shard then, hmm ok.
what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?
Agan I am unsure why a multi document transaction would effect whether an index would be stale or not, your not grouping across documents. The exception to that is a unique index but that works on single document updates as well; note that its uniqueness gets kinda hairy in sharded setups and cannot be guaranteed.
In an index you are creating, normally, one entry per document prefix key, uless it is a multikey index on the docment then you can make more than one index, however, either way index updating is done per single object, not by multi document transactions and I am unsure what you logic here is aas such this is the answer I have placed.

RethinkDB always stores secondary index data on the same machine as the data it's indexing. This allows it to be updated within the same transaction. Rethink promises to be ACIDy with single document operations and considers the indexing of a document to be part of the document itself.