How does MongoDB replicate update query affecting multiple documents ?
Will it use statement based approach conserving op-log or row-based approach ?
What are the criteria to select row or statement based replication ?
Will it use statement based approach conserving op-log or row-based approach
MongoDB works upon a row per row basis using an oplog. So when you do an update that effects multiple rows it will actually write each row one by one to the oplog, this is of course a space taker; as noted in the manual: http://docs.mongodb.org/manual/core/replication/#oplog
The oplog must translate multi-updates into individual operations, in order to maintain idempotency. This can use a great deal of oplog space without a corresponding increase in disk utilization.
The oplog is basically a capped collection and it will replicate going oldest first.
As far as I know MongoDB does not do statement based replication unlike many SQL techs can.
Related
I have 200+ millions of records in postgresql-9.5 table. Almost all queries are analytical queries. To increase and optimize the query performance so far I am trying with Indexing and seems that its not sufficient. What are the other options i need to look it into?
Depending on where clause condition create partitioned table (https://www.postgresql.org/docs/10/static/ddl-partitioning.html)
,it will reduce query cost drastically,also if there is certain fixed value in where clause do partial indexing on partitioned table.
Important point check order of columns in where clause and match it while indexing
You should upgrade to PostgreSQL v10 so that you can use parallel query.
That enables you to run sequential and index scans with several background workers in parallel, which can speed up these operations on large tables.
A good database layout, good indexing, lots of RAM and fast storage are also important factors for good performance of analytical queries.
If the analysis involves a lot of aggregation, consider materialized views to store the aggregates. Materialized views do take up space and they need to be refreshed too. But they are very useful for data aggregation.
I have an usecase where a set of records in a collection need to be deleted after a specified interval of time.
For ex: Records older than 10hours be deleted every 10th hour.
We have tried deletion based on id but found it to be slow.
Is there a way to partition the records in a collection and drop a partition as and when required in Mongo
MongoDB does not currently support partitions, there is a JIRA ticket to add this as a feature (SERVER-2097).
One solution is to leverage multiple, time-based collections, cycling collections in a similar way as you would partitions. Typically we would do this when you'd usually only be querying one or few of these time-based collections. If you would often need to read across multiple collections, you could add some wrapper code to simplify that.
There's also TTL Indexes, which leverage a background thread in the mongod server to handle the deletes for you.
Your deletes by _id may have been slow for a number of reasons, and probably warrants more investigation beyond your original question.
Is creating multiple compound indexes for serving various types of queries is better?
or
Is it better to
use a single compound index in a way that supports multiple queries(which is hard to analysis and construct, since there are many number of queries).
My basic question is "Does creating multiple compound indexes will slow down read/write operations?"
Please suggest me a solution.
There isn't any answer that fits all cases, but in general adding the right indexes will give you better performance. You will have less reads when accessing data. Calculating the index will cost you some performance, however if they are correct and used your db will perform better afterwards. Start with monitoring: mongodb monitoring docs
Indices will slow down writes but speed up reads. A high read to write ratio warrants one or more indices on commonly fetched fields (keys). For example our current system sees 25 writes to 20,000 reads (tps) so indices are beneficial to counter the wide margin. That being said, be mindful of retaining the mongo write lock as short as possible.
MongoDB uses a readers-writer 1 lock that allows concurrent reads
access to a database but gives exclusive access to a single write
operation. mongodb docs
This is more of 'inner workings' undestanding question:
How do noSQL databases that do not support *A*CID (meaning that they cannot update/insert and then rollback data for more than one object in a single transaction) -- update the secondary indexes ?
My understanding is -- that in order to keep the secondary index in sync (other wise it will become stale for reads) -- this has to happen withing the same transaction.
furthermore, if it is possible for index to reside on a different host than the data -- then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host) , what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?
This is a great question.
RethinkDB always stores secondary indexes on the same host as the primary index/data for the table. Even in case of joins, RethinkDB brings the query to the data, so the secondary indexes, primary indexes, and data always reside on the same node. As a result, there is no need for distributed locking protocols such as two phase commit.
RethinkDB does support a limited set of transactional functionality -- single document transactions. Changes to a single document are recorded atomically. Relevant secondary index changes are also recorded as part of that transaction, so either the entire change is recorded, or nothing is recorded at all.
It would be easy to extend the limited transactional functionality to support multiple documents in a single shard, but it would be hard to do it across shards (for the distributed locking reasons you brought up), so we decided not to implement transactions for multiple documents yet.
Hope this helps.
This is a MongoDB answer.
I am not quite sure what your logic here is. Updating a secondary index has nothing to do with being able to rollback multi statement transactions such as a multiple update.
MongoDB has transcactions per a single document, and that is what matters for updating indexes. These operations can be reversed using the journal if the need arises.
this has to happen withing the same transaction.
Yes, much like a RDBMS would. The more indexes you apply the slower your writes will be, and it seems to me you know why.
As the write occurs MongoDB will update all indexes which apply to that collection with the fields that apply to specific indexes.
furthermore, if it is possible for index to reside on a different host than the data
I am unsure if MongoDB allows that, I believe there is a JIRA for it; however, I cannot find that JIRA currently.
then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
Most likely. Allowing this feature would be...well, let's just say creating a hairball.
Even in a sharded setup the index of each range resides on the shard itself, not on the config servers.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host)
That is not what a two phase commit means. I believe you need to brush up on what a two phase commit is: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
I suppose if you are talking about a transaction covering more than one shard then, hmm ok.
what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?
Agan I am unsure why a multi document transaction would effect whether an index would be stale or not, your not grouping across documents. The exception to that is a unique index but that works on single document updates as well; note that its uniqueness gets kinda hairy in sharded setups and cannot be guaranteed.
In an index you are creating, normally, one entry per document prefix key, uless it is a multikey index on the docment then you can make more than one index, however, either way index updating is done per single object, not by multi document transactions and I am unsure what you logic here is aas such this is the answer I have placed.
RethinkDB always stores secondary index data on the same machine as the data it's indexing. This allows it to be updated within the same transaction. Rethink promises to be ACIDy with single document operations and considers the indexing of a document to be part of the document itself.
I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.