Retrying or recovering after failed IndexWriter.Commit() in Lucene.NET - lucene.net

Our Lucene.NET index is located on a remote maching, accessible over a UNC path. For performance reasons (and following what appears to be Lucene.NET best practice) IndexWriter is not Commit()ed after each document modification, but rather once every 30 seconds.
Now, sometimes network fails and Commit() errors out with an exception. I know that Lucenet.NET is "fully ACID", and as such these failures do not corrupt the index itself. What worries me is that not-yet-committed documents are lost.
Is there any recommended way of dealing with this? Can I retry IndexWriter.Commit() in hopes that network connectivity is restored? Or should I buffer documents in RAMDirectory and then merge these into FSDirectory, with retry semantics? Or something else entirely?

In my implementation, I use an Oracle table. When a document is created, a row is added to the table with a value indicating it is not indexed. After the IndexWriter commit succeeds, I update the table to indicate it is indexed (along with some other data like indexed_date, etc.) That way if there is any kind of failure, the document will be refound and indexed (or possibly re-indexed) when the system or connectivity is restored. The table also opens up all kinds of reporting & audit capability that would otherwise not be available.
This might not be an option for you. Buffering documents locally to the indexwriter would work. Not sure why you would need retry semantics if you only looked to the local buffer for docs to index. I think you just have to make sure you delete the document from the local buffer after the commit succeeds so you don't keep indexing it forever. ;^)

Related

Manually lock for MongoDB

I have these operations:
Find a doc from collection.
Manipulate doc.prop base on it's current value, which "prop" is a string.
Update doc back to collection.
So in this case, I have to make sure these operations are atomic, because updating doc.prop must base on the current value.
Here are two approaches:
1. Add "valueKey"(Number) property in doc, make sure valueKey is matched when updating doc. Increase valueKey after updated. If valueKey is not matched, mark this update as failure and retry again.
2. Use "fsyncLock" provided by MongoDB to lock the whole mongod instance, during the operations.
The 1st approach I mentioned above is well, but when facing huge volume of these operations at the same time, the "failure" and "retry" would be frequent.
The 2nd approach, which I haven't tried, I think it is for backing up database and is not good in this case.
So I'm wondering is there any other efficient approach?
The first approach is called an optimistic lock. Optimistic locks assume that the probability of collision is low, otherwise, as you already pointed out, there are a lot of retries. Those retries can also be destructive - if a text is edited, it might make sense to merge the edits, but it hardly ever makes sense for a phone number.
Locking the entire database is an extreme form of a pessimistic (offline) lock, where the concurrency of the system is deliberately reduced. However, that has problems because clients don't know what's going on - their edits will simply fail which is worst-of-a-kind user experience.
So pessimistic locks really only make sense if clients have a chance of actually knowing that something is locked. For instance, you'd somehow need to inform the user that it's not possible to edit the item she wants to edit, because someone else is already in edit-mode for that item. This also has problems, especially if another user left the screen and is blocking all other users.
If you wanted to go for a pessimistic lock, however, that should absolutely never be implemented by something like a global database lock: simply lock the item itself and implement the business logic for the locking in your code.
Morale: This isn't a technology problem, it's a logical problem. Google Docs demonstrates one way to allow concurrent editing of multiple users, but it's hard to implement, has limited use in other types of applications and is still deemed annoying by some users. Git and the likes show another method, where the logic of branches, merging and conflicts is exposed to the user as well, but asynchronously (multi-version concurrency control).

Google Cloud Storage transactions?

It does not appear that GCS has any transaction mechanism. Is this correct?
I would like to be able to have a long lived transaction. For example, it would be great if I could start a transaction and specify an expiration time (if not committed within X time it automatically gets rolled back). Then I could use this handle to insert objects, compose, delete etc. and if all goes well, issue a isCommitPossible(), and if yes, then commit().
Is this a possibility?
Object writes are transactional (either the complete object and its metadata are successfully written and the object becomes visible; or it fails without becoming visible). But there's no transaction mechanism spanning multiple GCS operations.
Mike
The Cloud Storage client libraries offer a file-like object to work with, which has an Open() and Close() operation. If a single operation can be transactional then, in theory, it should be possible to open a single "lock file" for the duration of all other operations, only closing it when you're done will all the other files.
In other words, you would have to write your processes to use a "lock file" and, in that way, you could, at the least, know whether or not all your files were written/read or if there was some error. Whenever the next round of operations takes place, it would just look for the existence of the lock file that corresponds to the set of files written (you'd have to arrange your naming, directory layout, etc, to have it make sense for this). If it exists, we can assume that the file group was written successfully. If it doesn't exist, assume that something happened (or that the process hasn't yet completed).
I have not actually tested this out. But I offer it as an idea for others who might be desperate enough to try.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

Update or Delete which is fast?

I am using mongoDB for an application. This application requires high frequency of read, write and update.
I am just concerned about update and delete functions. Which one is fast among these two. I am indexing the collection on one attribute. Update and Delete both fulfils my purpose, but I am not sure which one is perfect and have better performance.
I would suggest that rather than deciding on whether you use Update or Delete for your solution, you look more on the SafeMode attribute.
SafeMode.True indicates that you are expecting a response from the server that will contain among other things, a confirmation of whether the command succeeded or failed. This option blocks the execution until you receive a response from the server.
SafeMode.False will not expect any response, and it is basically an optimistic command. You expect for it to work, but have no way to confirm it. Waiting for the response does not block the execution, therefore, you gain performance because all you need to do is to send the request.
Now you need to consider that Deletes will free us space on the server, but you will lose history and traceability of the data. Updates will allow you to keep historic entries, but you will need to make sure your queries exclude the 'marked for deletion' entries.
It is obviously up to you to find whether a Delete or Update is better, but I think the focus should be on whether you use SafeMode true or false to improve performance.
A rather odd question but here are the things you can base your decision on :
Deleting will keep the collection at an optimum size. Updating (I assume you mean something like setting a deleted flag to true) will result in an ever growing collection which eventually will make things slower.
In-place updates (updates that do not result in the document having to be moved due to an increase in size) are always faster than updates or deleted that require documents to be (re)moved.
Safe = false writes will significantly improve throughput of updates and deletes at the expense of not being able to check if the update/remove was succesful.

When/Where is the best-practices time/place to configure a MongoDB "schema"?

In an app that uses MongoDB, when/where is the best place to make database changes that would be migrations in a relational database?
For example, how should creating indexes or setting shard keys be managed? Where should this code go?
it's probably best to do this in the shell, conciously!, because you could cause havoc if you accidentally start such a command at the wrong moment and on the wrong instance.
Most importantly: do this offline on an extra slave instance if you add an index on an existing DB! For large data sets, building an index can take hours, even days!
see also:
http://www.mongodb.org/display/DOCS/Indexes
http://www.javabeat.net/articles/print.php?article_id=353
http://www.mongodb.org/display/DOCS/Indexing+as+a+Background+Operation
http://nosql.mypopescu.com/post/1312926692/mongodb-indexes-and-indexing
If you have a large data set, make sure to read up on the 4square outage last year..!!
http://www.infoq.com/news/2010/10/4square_mongodb_outage
http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html
one of the main reasons for not wanting to put indexing in a script or config file of some sort is that in MongoDB the index operation is blocking(!) -- that means MongoDB will stop other operations on the database from proceeding until the indexing is completed. Just imagine an innocent change in the code, requiring a new index to improve performance -- and this change is carelessly checked-in and deployed to production ... and suddenly your production MongoDB is feezing up for your app-server, because MongoDB is internally adding the new index first before doing anything else... outch! Apparently that has happened to a couple of folks, that's why they keep reminding people at the MongoDB conferences to be careful to not 'programmatically' require indexes.
New versions of MongoDB allow background indexing -- you should always do that e.g. db.yourcollection.ensureIndex(..., {background: true})
otherwise, not-so-fun stuff happens:
https://jira.mongodb.org/browse/SERVER-1341
https://jira.mongodb.org/browse/SERVER-3067