What happens during Commit while using Lucene NRT - lucene.net

We're using Lucene.NET 2.9.2, and would like to move over to the Near Realtime functionality of Lucene.
We get the IndexReader from the IndexWriter (thus using NRT). My understanding is that when using it this way, the IndexReader will also contain search results of those documents that have been added, but not yet commited (we're checking if the IndexReader iscurrent, and use reopen if it's not).
Let's say I've added 50 documents, and decide to Commit them to the index, and let's say the documents are big, and commiting takes 5 seconds.
What happens during these 5 seconds if a new search comes in? Will the internal RAMDirectory hold on to those 50 documents until the commit is complete? Or will there be a situation that those 50 documents will be lost for 5 seconds?

You will be able to search those records still. Your reader still points to the uncommitted version of the index as if you never called commit. Once commit is finished isCurrent will reflect you need a new reader.

Related

Inserts in transactions visible in Studtio3T before commit

This is a pursue of knowledge.
I have this simple example webapi written in C# that inserts data into MongoDb in a transaction. So by calling a certain endpoint I do several inserts with a certain delay (for example 10 inserts, 1-by-1 with delay of 1 second between each insert). After last insert I commit the transaction.
The same time I check with my other endpoint the total count of existing documents. There is no surprise here - if I request inserting 10 documents, my document count endpoint will return +10 only after the 10th document is inserted (and commit executes).
However - if I keep checking with Studio3T the total count of the documents - it increases gradually with each executed insert. If I cancel the request - the count in Studio3T returns to pre-request value. The screenshot below will hopefully explain it better:
So I have at least 3 questions that I could not find an answer to:
What setting in Studio3T allows for these dirty reads?
What do I need to set in my client to actually allow for dirty reads?
What do I need to set in my session/transaction options to block Studio3T (or any other client) from preforming these dirty reads?
Last thing that is also curious - if I stop the app after several inserts but without commiting - Studio3T keeps seeing the inserted ones (at least counts them) for about a minute. I imagine there must be somekind of time to live for a session, but I did not find anything in the docs. I am sure I probably missed it, but if anyone knows the place where this behaviour is described, I would be grateful for pointing me in the right direction.

Update all documents in collection atomically. What if one of them changes during the transaction?

I'm trying to perform a Mongo transaction that will update all documents in a collection. The problem is that the documents change extremely frequently (say multiple times per second) and I risk sometimes not being able to do commit the transaction because one document changed.
I could retry after this happens, but what if at least one document keeps changing every time I retry?
Then the transaction will just keep retrying and never be committed.
How do I update atomically if the documents in the collection keep changing while the update is taking place?
I'm thinking that this is a pretty common problem and there must be a good solution, but I cannot seem to figure it out.
Thank you for your time.

how does morphia skip deal with new records arriving during pagination?

I've read a lot about using skip with paging (and the related performance issues). For my application, the performance issues are not a problem, however, it's not clear to me what happens with skip if new records arrive in between requesting pages.
For example, let's say I have 10 records, a user requests a page of 5 and we deliver them. When the user is browsing the first page, another 5 records are inserted into the db, the user requests the next page of 5. Assuming we're sorting on id or date, will the user now be returned the same 5 records (because, for the second page, skip skips the first 5 newly added records and returns the next 5, which are now the same records that were originally returned)?
You are correct. Both performance and correctness with added / removed entries is an issue.
For a good explanation see http://use-the-index-luke.com/no-offset (Markus Winand has been fighting offset for years ;-) ).
Keyset pagination is neither supported in MongoDB nor in Morphia from what I know, so you'll have to build it yourself. Make sure you're always working with something unique (like date and ID).
Other systems have implemented this feature natively, for example in Elasticsearch with search after.

Mongo delete and insert vs update

I am using mongo version 3.0 db and java driver. I have a collection 100,000+ entries. Each day there will be approx 500 updates and approx 500 inserts which should be done in a batch. I will get the updated documents with old fields plus some new ones which I have to store. I dont know which are the feilds are newly added also for each field I am maintaining a summary statistic. Since I dont know what were the changes I will have to fetch the records which already exist to see the difference between updated ones and new ones to appropriately set the summary statistics.So I wanted inputs regarding how this can be done efficiently.
Should I delete the existing records and insert again or should I update the 500 records. And should I consider doing 1000 upsers if it has potential advantages.
Example UseCase
initial record contains: f=[185, 75, 186]. I will get the update request as f=[185, 75, 186, 1, 2, 3] for the same record. Also the summary statistics mentioned above store the counts of the ids in f. So the counts for 1,2,3 will be increased and for 185, 75, 186 will remain the same.
Upserts are used to add a document if it does not exist. So if you're expecting new documents then yes, set {upsert: true}.
In order to update your statistics I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index your documents properly it should be fine. I assume that your statistics update is an offline operation.
If you weren't doing the statistics in mongo then you can add another collection where you can save the updates along with the old fields (of course you update your current collection too) so you will know which documents have changed during the day. At the end of the day you can just remove this temporary/log collection once you've extracted the needed information.
Mongo maintains every change log using oplog.rs capped collection in local db. We are creating a tailable cursor on oplog.rs on timestamp basis and each change in db / collection are streamed thru. Believe this is the best way to identify changes in mongo. One can certainly drop no interest document changes.
Further read http://docs.mongodb.org/manual/reference/glossary/#term-oplog
I think the easiest way is to redo the statistics if you were doing it in mongo (e.g. using the aggregation framework). If you index upsers documents properly it should be fine. I assume that your statistics update is an offline operation.

MongoDB update (add field to nearly every document) is very very slow

I am working on a MongoDB cluster.
One DB named bnccdb, with a collection named AnalysedLiterature. It has about 7,000,000 documents in it.
For each document, I want to add two keys and then update this document.
I am using Java client. So I query this document, add these both keys to the BasicDBObject and then I use the save() method to update this object. I found the speed is so slow that it will take me nearly several weeks to complete the update for this whole collection.
I wonder the reason why my update operation is so slow is that what I do is add keys.
This will cause a disk/block re-arrangement in the background, so this operation becomes extremely time-consuming.
After I changed from save() to update, problem remains.This is my status information.
From the output of mongostat,it is very obvious that the faults rate is absolutely high.But I don't know what cased it.
Anyone can help me?