I am using mongoDB for an application. This application requires high frequency of read, write and update.
I am just concerned about update and delete functions. Which one is fast among these two. I am indexing the collection on one attribute. Update and Delete both fulfils my purpose, but I am not sure which one is perfect and have better performance.
I would suggest that rather than deciding on whether you use Update or Delete for your solution, you look more on the SafeMode attribute.
SafeMode.True indicates that you are expecting a response from the server that will contain among other things, a confirmation of whether the command succeeded or failed. This option blocks the execution until you receive a response from the server.
SafeMode.False will not expect any response, and it is basically an optimistic command. You expect for it to work, but have no way to confirm it. Waiting for the response does not block the execution, therefore, you gain performance because all you need to do is to send the request.
Now you need to consider that Deletes will free us space on the server, but you will lose history and traceability of the data. Updates will allow you to keep historic entries, but you will need to make sure your queries exclude the 'marked for deletion' entries.
It is obviously up to you to find whether a Delete or Update is better, but I think the focus should be on whether you use SafeMode true or false to improve performance.
A rather odd question but here are the things you can base your decision on :
Deleting will keep the collection at an optimum size. Updating (I assume you mean something like setting a deleted flag to true) will result in an ever growing collection which eventually will make things slower.
In-place updates (updates that do not result in the document having to be moved due to an increase in size) are always faster than updates or deleted that require documents to be (re)moved.
Safe = false writes will significantly improve throughput of updates and deletes at the expense of not being able to check if the update/remove was succesful.
Related
We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist
I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?
Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.
I have these operations:
Find a doc from collection.
Manipulate doc.prop base on it's current value, which "prop" is a string.
Update doc back to collection.
So in this case, I have to make sure these operations are atomic, because updating doc.prop must base on the current value.
Here are two approaches:
1. Add "valueKey"(Number) property in doc, make sure valueKey is matched when updating doc. Increase valueKey after updated. If valueKey is not matched, mark this update as failure and retry again.
2. Use "fsyncLock" provided by MongoDB to lock the whole mongod instance, during the operations.
The 1st approach I mentioned above is well, but when facing huge volume of these operations at the same time, the "failure" and "retry" would be frequent.
The 2nd approach, which I haven't tried, I think it is for backing up database and is not good in this case.
So I'm wondering is there any other efficient approach?
The first approach is called an optimistic lock. Optimistic locks assume that the probability of collision is low, otherwise, as you already pointed out, there are a lot of retries. Those retries can also be destructive - if a text is edited, it might make sense to merge the edits, but it hardly ever makes sense for a phone number.
Locking the entire database is an extreme form of a pessimistic (offline) lock, where the concurrency of the system is deliberately reduced. However, that has problems because clients don't know what's going on - their edits will simply fail which is worst-of-a-kind user experience.
So pessimistic locks really only make sense if clients have a chance of actually knowing that something is locked. For instance, you'd somehow need to inform the user that it's not possible to edit the item she wants to edit, because someone else is already in edit-mode for that item. This also has problems, especially if another user left the screen and is blocking all other users.
If you wanted to go for a pessimistic lock, however, that should absolutely never be implemented by something like a global database lock: simply lock the item itself and implement the business logic for the locking in your code.
Morale: This isn't a technology problem, it's a logical problem. Google Docs demonstrates one way to allow concurrent editing of multiple users, but it's hard to implement, has limited use in other types of applications and is still deemed annoying by some users. Git and the likes show another method, where the logic of branches, merging and conflicts is exposed to the user as well, but asynchronously (multi-version concurrency control).
I want to avoid doing two operations to achieve the following :
Find document, update with modifier-1.
If document not exist, populate default fields with modifier-2, then update with modifier-1.
it's a common pattern so it should be possible. At the moment I am having to do two upserts.
( feel free to adjust the psuedocode, I am new to the query language).
update( {...}, modifier-1, true)
if(upserted)
{
// check for race condition, detect if another query from another thread
// hasn't populated the default values.
update ( {...,if_a_default_value_does_not_exist}, modifier-2, true)
}
I assume that two operations would result in two disk writes, I understand mongodb does asynchronous disk writes. If I can't do this with one operation, is there some sort of mechanism in place that would merge the writes into a single write before writing to journal / disk ? And yes this would make a significant difference in loading my 300 gb data set :D
Hassan,
The asynchronous writes to disk you mentioned are accomplished by writing the changes to memory and then fsync'ing them onto disk periodically in the background, so merging the two operations would likely not impact performance here as much as you would think.
The journal is another matter entirely - it is written separately to disk in an idempotent manner for safety to allow for easier recovery/restoration in case of failure or other similar issues. You can always start the DB with journaling off, do the import, and then restart with journaling enabled once the bulk update is done if the journal writes are causing you significant issues.
Finally, be careful of the not exists logic in your second modifier - from an indexing perspective a positive operator such as exits is preferred, otherwise indexes may not be used and that will certainly slow down your inserts.
Away from bulk inserts, for single atomic updates you can also explore the use of findAndModify (http://www.mongodb.org/display/DOCS/findAndModify+Command) to do the check and subsequent change for you, it's hard to tell based on the description if that would be a good fit because it has its own drawbacks.
when dealing with mongodb, when should i use the {safe: true} on queries?
Right now I use the 'safe' option just to check if my queries were inserted or updated successfully. However, I feel this might be over kill.
Should i assume that 99% of the time, my queries (assuming they are properly written) will be inserted/updated, not have to worry about checking if they successfully inputted?
thoughts?
Assuming when you say queries you actually mean writes/inserts (the wording of your question makes me think this) then the Write Concern (safe, none, fsync, etc) can be used to get more speed and less safety when that is acceptable, and less speed and more safety when that is necessary.
As an example, a hypothetical Facebook-style application could use an unsafe write for "Likes" while it would use a very safe write for password changes. The logic behind this is that there will be many thousand "Like"-style updates happening a second, and it doesn't matter if one is lost, whereas password updates happen less regularly but it is essential that they succeed.
Therefore, try to tailor your Write Concern choice to the kind of update you are doing, based upon your speed and data integrity requirements.
Here is another use case where unsafe writes are an appropriate choice: You are making a large number of writes in very short order. In this case you might perform a number of writes, and then call get last error to see if any of them failed.
collection.setWriteConcern(WriteConcern.NORMAL)
collection.getDB().resetError()
List<
for (Something data : importData) {
collection.insert(makeDBObject(data))
}
collection.getDB().getLastError(WriteConcern.REPLICAS_SAFE).throwOnError()
If this block succeeds without an exception, then all of the data was inserted successfully. If there was an exception, then one or more of the write operations failed, and you will need to retry them (or check for a unique index violation, etc). In real life, you might call getLastError every 10 writes or so, to avoid having to resubmit lots of requests.
This pattern is very nice for performance when performing bulk inserts of large amounts of data.
Safe is only necessary on writes, not reads. Queries are only reads.