mongodb any idiom for collapsing 2 upserts into one : create, modifier-to-set-default-values, modifier-to-update-existing-document - mongodb

I want to avoid doing two operations to achieve the following :
Find document, update with modifier-1.
If document not exist, populate default fields with modifier-2, then update with modifier-1.
it's a common pattern so it should be possible. At the moment I am having to do two upserts.
( feel free to adjust the psuedocode, I am new to the query language).
update( {...}, modifier-1, true)
if(upserted)
{
// check for race condition, detect if another query from another thread
// hasn't populated the default values.
update ( {...,if_a_default_value_does_not_exist}, modifier-2, true)
}
I assume that two operations would result in two disk writes, I understand mongodb does asynchronous disk writes. If I can't do this with one operation, is there some sort of mechanism in place that would merge the writes into a single write before writing to journal / disk ? And yes this would make a significant difference in loading my 300 gb data set :D

Hassan,
The asynchronous writes to disk you mentioned are accomplished by writing the changes to memory and then fsync'ing them onto disk periodically in the background, so merging the two operations would likely not impact performance here as much as you would think.
The journal is another matter entirely - it is written separately to disk in an idempotent manner for safety to allow for easier recovery/restoration in case of failure or other similar issues. You can always start the DB with journaling off, do the import, and then restart with journaling enabled once the bulk update is done if the journal writes are causing you significant issues.
Finally, be careful of the not exists logic in your second modifier - from an indexing perspective a positive operator such as exits is preferred, otherwise indexes may not be used and that will certainly slow down your inserts.
Away from bulk inserts, for single atomic updates you can also explore the use of findAndModify (http://www.mongodb.org/display/DOCS/findAndModify+Command) to do the check and subsequent change for you, it's hard to tell based on the description if that would be a good fit because it has its own drawbacks.

Related

Write operation during a long cursor operation

I use MongoDB 2.4 with a single DB.
I find all items in a collection (50.000+) and for each one, I insert it into another one.
it = coll1.find()
while (it.hasNext()) {
coll2.save(it.next())
}
Is it a performance issue to make intensive writes when a cusor is open on the same database ?
This essentially comes down to a question about concurrency ( http://docs.mongodb.org/manual/faq/concurrency/ ) being able to do reads on a single database level writer greedy lock performantly while creating a write intensive load.
MongoDB should be able to juggle your read lock with the write lock quite well here, interweaving operations and yielding the current operation under certain conditions that it sees fit to keep performance up (see link supplied above).
This is, of course, in contrast to SQL where read and write operations are isolated, as such this means that MongoDBs concurrency rules actually break the I in ACID. Of course, in SQL the lock is much more granular so you would get relative performance normally.
If you do see a performance hit, mainly due to IO (reading requires IO as well remember) then you might find it prudent to batch your writes into groups of maybe 1000, taking about a 5 second break after each batch to let the IO subside.
No as cursors are not atomic. Each read is its own atomic transaction. This means that mongo is not subject to the issues of ensuring that the cursor represents a single snapshot in time.

Why using Locking in MongoDB?

MoongoDB is from the NoSql era, and Lock is something related to RDBMS? from Wikipedia:
Optimistic concurrency control (OCC) is a concurrency control method for relational database management systems...
So why do i find in PyMongo is_locked , and even in driver that makes non-blocking calls, Lock still exists, Motor has is_locked.
NoSQL does not mean automatically no locks.
There always some operations that do require a lock.
For example building of index
And official MongoDB documentation is a more reliable source than wikipedia(none offense meant to wikipedia :) )
http://docs.mongodb.org/manual/faq/concurrency/
Mongo does in-place updates, so it needs to lock in order to modify the database. There are other things that need locks, so read the link #Tigra provided for more info.
This is pretty standard as far as databases and it isn't an RDBMS-specific thing (Redis also does this, but on a per-key basis).
There are plans to implement collection-level (instead of database-level) locking: https://jira.mongodb.org/browse/SERVER-1240
Some databases, like CouchDB, get around the locking problem by only appending new documents. They create a new, unique revision id and once the document is finished writing, the database points to the new revision. I'm sure there's some kind of concurrency control when changing which revision is used, but it doesn't need to block the database to do that. There are certain downsides to this, such as compaction needing to be run regularly.
MongoDB implements a Database level locking system. This means that operations which are not atomic will lock on a per database level, unlike SQL whereby most techs lock on a table level for basic operations.
In-place updates only occur on certain operators - $set being one of them, MongoDB documentation did used to have a page that displayed all of them but I can't find it now.
MongoDB currently implements a read/write lock whereby each is separate but they can block each other.
Locks are utterly vital to any database, for example, how can you ensure a consistent read of a document if it is currently being written to? And if you write to the document how do you ensure that you only apply that single update at once and not multiple updates at the same time?
I am unsure how version control can stop this in CouchDB, locks are really quite vital for a consistent read and are separate to version control, i.e. what if you wish to apply a read lock to the same version or read a document that is currently being written to a new revision? You will obviously see a lock queue appear. Even though version control might help a little with write lock saturation there will still be a write lock and it will still need to work on a level.
As for concurrency features; MongoDB has the ability (for one), if the data is not in RAM, to subside a operation for other operations. This means that locks will not just sit there waiting for data to be paged in and other operations will run in the mean time.
As a side note, MongoDB actually has more locks than this, it also has a JavaScript lock which is global and blocking, it does not have the normal concurrency features of regular locks.
and even in driver that makes non-blocking calls
Hmm I think you might be confused by what is meant as a "non-blocking" application or server: http://en.wikipedia.org/wiki/Non-blocking_algorithm

How to get high performance under a large transaction (postgresql)

I have data with amount of 2 millions needed to insert into postgresql. But it has played an low performance. Can I achieve a high-performance inserter by split the large transaction into smaller ones (Actually, I don't want to do this)? or, there is any other wise solutions?
No, the main idea to have it much faster is doing all inserts in one transaction. Multiple transactions, or using no transaction, is much slower.
And try to use copy, which is even faster: http://www.postgresql.org/docs/9.1/static/sql-copy.html
If you really have to use inserts, you can also try dropping all indexes on this table, and creating them after loading the data.
This can be interesting as well: http://www.postgresql.org/docs/9.1/static/populate.html
Possible methods to improve performance:
Use the COPY command.
Try to decrease the isolation level for the transaction if your data can deal with the consequences.
Tweak the PostgreSQL server configuration. The default memory limits are very low and will cause disk trashing even with a server having gigabytes of free memory.
Turn off disk barriers (e.g. nobarrier flag for the ext4 file system) and/or fsync on the PostgreSQL server. Warning: this is usually unsafe but will improve your performance a lot.
Drop all the indexes in your table before inserting the data. Some indexes require pretty much work to keep up to date while rows are added. PostgreSQL may be able to create indexes faster in the end instead of continuously updating the indexes in paraller with the insertion process. Unfortunately, there's no simple way to "save" current indexes and later restore/create the same indexes again.
Splitting the insert job into series of smaller transaction will help only if you have to retry the transaction because of data dependency issues with paraller transactions. If the transaction succeeds on the first try, splitting it into several smaller transactions run in sequence will only decrease your performance.
In my experience you CAN improve INSERT time-to-completion by splitting a large transaction into smaller ones, but only if the table you are inserting to has NO indexes or constraints applied, and NO default field values that would have to contend for a shared resource under multiple concurrent transactions. In that case, splitting the insert into several distinct parts and submitting each concurrently as separate processes will complete the job in significantly less time.

Update or Delete which is fast?

I am using mongoDB for an application. This application requires high frequency of read, write and update.
I am just concerned about update and delete functions. Which one is fast among these two. I am indexing the collection on one attribute. Update and Delete both fulfils my purpose, but I am not sure which one is perfect and have better performance.
I would suggest that rather than deciding on whether you use Update or Delete for your solution, you look more on the SafeMode attribute.
SafeMode.True indicates that you are expecting a response from the server that will contain among other things, a confirmation of whether the command succeeded or failed. This option blocks the execution until you receive a response from the server.
SafeMode.False will not expect any response, and it is basically an optimistic command. You expect for it to work, but have no way to confirm it. Waiting for the response does not block the execution, therefore, you gain performance because all you need to do is to send the request.
Now you need to consider that Deletes will free us space on the server, but you will lose history and traceability of the data. Updates will allow you to keep historic entries, but you will need to make sure your queries exclude the 'marked for deletion' entries.
It is obviously up to you to find whether a Delete or Update is better, but I think the focus should be on whether you use SafeMode true or false to improve performance.
A rather odd question but here are the things you can base your decision on :
Deleting will keep the collection at an optimum size. Updating (I assume you mean something like setting a deleted flag to true) will result in an ever growing collection which eventually will make things slower.
In-place updates (updates that do not result in the document having to be moved due to an increase in size) are always faster than updates or deleted that require documents to be (re)moved.
Safe = false writes will significantly improve throughput of updates and deletes at the expense of not being able to check if the update/remove was succesful.

mongodb: should i always use the 'safe' option on updates

when dealing with mongodb, when should i use the {safe: true} on queries?
Right now I use the 'safe' option just to check if my queries were inserted or updated successfully. However, I feel this might be over kill.
Should i assume that 99% of the time, my queries (assuming they are properly written) will be inserted/updated, not have to worry about checking if they successfully inputted?
thoughts?
Assuming when you say queries you actually mean writes/inserts (the wording of your question makes me think this) then the Write Concern (safe, none, fsync, etc) can be used to get more speed and less safety when that is acceptable, and less speed and more safety when that is necessary.
As an example, a hypothetical Facebook-style application could use an unsafe write for "Likes" while it would use a very safe write for password changes. The logic behind this is that there will be many thousand "Like"-style updates happening a second, and it doesn't matter if one is lost, whereas password updates happen less regularly but it is essential that they succeed.
Therefore, try to tailor your Write Concern choice to the kind of update you are doing, based upon your speed and data integrity requirements.
Here is another use case where unsafe writes are an appropriate choice: You are making a large number of writes in very short order. In this case you might perform a number of writes, and then call get last error to see if any of them failed.
collection.setWriteConcern(WriteConcern.NORMAL)
collection.getDB().resetError()
List<
for (Something data : importData) {
collection.insert(makeDBObject(data))
}
collection.getDB().getLastError(WriteConcern.REPLICAS_SAFE).throwOnError()
If this block succeeds without an exception, then all of the data was inserted successfully. If there was an exception, then one or more of the write operations failed, and you will need to retry them (or check for a unique index violation, etc). In real life, you might call getLastError every 10 writes or so, to avoid having to resubmit lots of requests.
This pattern is very nice for performance when performing bulk inserts of large amounts of data.
Safe is only necessary on writes, not reads. Queries are only reads.