The efficiency of the Update Query in CrateData - crate

When executing an update query which triggers a lot of (e.g. a million) records to be updated. As I understand, the underlying index system needs to re-ingest the doc. So for this kind of "heavy" job, is there a way to control its working load, i.e., update with a fix rate until it's finished?

currently it is not possible to do throttling of update queries.
probably it would help for your use-case to split the updates into parts, by adding specific filters.
for example if you have a timestamp field you could just update each month seperately by adding queries accordingly.

Related

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!
Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.
This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

Why mongodb only updates the first matching document in the collection?

Consider a collection student contains the following documents.
{name:”Nithin”,age:23}
{name:”Nithin”,age:25}
{name:”Nithin”,age:28}
{name:”Nithin”,age:12}
I want to update all the documents whose name is “Nithin” as age=60.
If we execute the following query it will only update the first document.
db.student.update({name:”Nithin”},{age:60})
For update all the documents I have to use the query
db.student.update({name:”Nithin”},{age:60},false,true)
or
db.student.update({name:”Nithin”},{age:60},multi:true)
What is the reason by default mongodb not updating all the documents by executing db.student.update({name:”Nithin”},{age:60}) ? What is the motivation for creating separate queries for updating all the documents? Is it improving the performance?
Originally, in the early early days of MongoDB (pre 1.1) it was not possible to update multiple documents. This was a feature added around 1.1.3.
You can see it in the release notes, New Feature 268.
I'm guessing this was not enabled by default for backwards compatibility with previous versions.
This may not really be the reason but I find the additional multi parameter as a safeguard to prevent accidental update of multiple records when one intends to update a single document only, something like accidentally performing UPDATE...SET on SQL without specifying additional constraints.
Again this is just an assumption but may not really be the case.
I suppose part of the reason might be to avoid people coming from the SQL world to think about multi-document updates as isolated transactions.
In fact, during a long update MongoDB will periodically yield control to other queries which can potentially modify the same dataset.
So, by explicitly setting multi=true you are somewhat acknowledging this fact (well, not really, but I guess that's the spirit...)

Prioritize specific long-running operation

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.
I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.
If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.

Can MongoDB run the same operation on many documents without querying each one?

I am looking for a way to update every document in a collection called "posts".
Posts get updated periodically with a popularity (sitewide popularity) and a strength (the estimated relevance to that particular user), each from different sources. What I need to do is multiply popularity and strength on each post to get a third field, relevance. Relevance is used for sorting the posts.
class Post
include Mongoid::Document
field :popularity
field :strength
field :relevance
...
The current implementation is as follows:
1) I map/reduce down to a separate collection, which stores the post id and calculated relevance.
2) I update every post individually from the map reduce results.
This is a huge amount of individual update queries, and it seems silly to map each post to its own result (1-to-1), only to update the post again. Is it possible to multiply in place, or do some sort of in-place map?
Is it possible to multiply in place, or do some sort of in-place map?
Nope.
The ideal here would be to have the Map/Reduce update the Post directly when it is complete. Unfortunately, M/R does not have that capability. In theory, you could issue updates from the "finalize" stage, but this will collapse in a sharded environment.
However, if all you are doing is a simple multiplication, then you don't really need M/R at all. You can just run a big for loop, or you can hook up the save event to update :relevance when :popularity or :strength are updated.
MongoDB doesn't have triggers, so it can't do this automatically. But you're using a business layer which is the exact place to put this kind of logic.

What's the best way to find the most frequently occurring value in MongoDB?

I'm looking for the equivalent of this sort of SQL query.
SELECT field, count(*) as counter from table order by counter DESC
What's the best way to achieve this?
Thanks
Use Map-Reduce. Map each document by emitting the key and a value 1, then aggregate them using a simple reduce operation. See http://www.mongodb.org/display/DOCS/MapReduce
I'd handle aggregation queries by keeping track of the respective counts separately, i.e. in their own collection. This way, you can simply query the "most frequently occurring" collection. Downside: you need to perform another write whenever the data changes.
Of course, you could also update that collection from time to time using Map/Reduce. This depends a bit on how accurate the information must be and how often it changes.
Make sure, however, not to call the Map/Reduce operation very often: It is not meant to be used in an interactive fashion (i.e. not in every page view) but rather scarcely in an offline process that updates the counts every hour or so. Hence, if your counts change very quickly, use a counters collection.