AzureDirectory with Lucene.Net and an Azure Worker Role - lucene.net

I am trying to implement a Lucene.Net index using AzureDirectory inside Azure blob storage.
The indexing process runs from an Azure worker role.
In the local azure emulator, I can process ~3million records into the index and it is very fast to search.
Now I am trying to get it up into live Azure, and the worker role starts processing fine.
The problem I have is that after ~500,000 records or so, the worker role falls over and restarts.
I have exception handling, and I'm using diagnostics with trace statements both throughout the code, in the exception handler and in the OnStop event. The trace statements from the main code appear in the diagnostics table fine, and gives me a log of my records getting processed, but the trace statements from the exception handling and the OnStop never show up.
There is a lot of code to post, so I thought I'd start by asking initially if anyone knew of any limitations around this type of Lucene.Net index with AzureDirectory?
EDIT:
I finally managed to get an exception by moving a little code around.
The index is running out of disk space and I get the below exception. Going to try and increase the space and will post back results.
There is not enough space on the disk. at
System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.FileStream.WriteCore(Byte[] buffer, Int32 offset, Int32
count) at
Lucene.Net.Store.SimpleFSDirectory.SimpleFSIndexOutput.FlushBuffer(Byte[]
b, Int32 offset, Int32 size) at
Lucene.Net.Store.BufferedIndexOutput.Flush() at
Lucene.Net.Store.BufferedIndexOutput.WriteBytes(Byte[] b, Int32
offset, Int32 length) at
Lucene.Net.Store.RAMOutputStream.WriteTo(IndexOutput out_Renamed)
at Lucene.Net.Index.StoredFieldsWriter.FinishDocument(PerDoc perDoc)
at Lucene.Net.Index.DocumentsWriter.WaitQueue.WriteDocument(DocWriter
doc) at Lucene.Net.Index.DocumentsWriter.WaitQueue.Add(DocWriter
doc) at
Lucene.Net.Index.DocumentsWriter.FinishDocument(DocumentsWriterThreadState
perThread, DocWriter docWriter) at
Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document doc, Analyzer
analyzer, Term delTerm) at
Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer
analyzer)
Final Update
So I now have my indexer indexing 3.3million rows of data in approximately 5 minutes.
I have went back to RAM based storage and slightly reduced the data being indexed, there were 3 fields in my document, reduced to 2 now.
Searching the index from an azure webrole is lightning fast as well.
I have taken onboard everyone's comments on and will be monitoring this over the next month or so. I'll be interested to see how it performs under load.

I posted this before...but...
Its not going work for production environments...Here is my answer on why its not and what you can do: How to implement Lucene .Net search on Azure webrole
I should add running your own Azure VM has advantages since you can stripe disks for extra I/O performance (important during indexing and searching outside RAM).
Here is another answer that might help, but I don't agree with the approach: https://azuredirectory.codeplex.com/discussions/402913
Edit: I should clarify that when I say "work" I mean work in a production environment.

I implemented my version of AzureDirectory here
Maybe it will help you, you will always eventually run out of RAM with RAMDirectory - it's just a question of the number of documents.

Related

What could be causing high cpu in google spanner databases? (unresponsive table)

I'm facing an issue where 2 out of 10 spanner databases are showing a high CPU usage (above 40%) whereas the others are around %1 each, with almost identical or more data.
I notice one of our tables has become "unresponsive" no queries work against it. We shutdown all apps that connect to those dbs, and we also deleted all current sessions using gcloud sessions list and then gcloud session delete.
However the table is still unresponsive. A simple select like select id from mytable where name = 'test' is not responding (when tested from an app, and also from gcloud web interface), it only happens with that table, which has only a few columns with normal data and no more than 2000 records. We identified the query that could have been the source of the problem, however the table seems to be locked (only count(*) without any where clause works).
I was wondering if there is any way to "unlock" the table, kill those "transactions" that might be causing the issue, or restart those specific spanner databases, or in the worst case scenario restarting the spanner instance.
I have seen the monitoring high cpu documentation, but even if we can identify the cpu is high, we don't really know how to restart or make it back to normal before reviewing the query/ies that could have caused the issue (if that was the case).
High CPU can be caused by user initiated queries, from different types of operations. It is important to notice that your instance is where you allocate resources to be used by the underlying Cloud Spanner databases. This means, that if all of your databases are in the same instance and if some of your databases are hogging the CPU, all your other databases will struggle.
In terms of a locked table, I would be very surprised if a deadlock is the problem here, since Spanner mitigates those issues using "wound-wait" algorithm. What I suspect might be happening is that a long time is necessary to perform the query in that table and it times out. It would be nice to investigate a bit more on your problem:
How long did you wait for your query on the problematic table (before you deemed to be stuck)? It might be a problem where your query just takes long and you are timing out before getting the results. It might be a problem where there are hotspots in your table and transactions are getting aborted often, preventing you from getting results.
What error did you get when performing the query on the table? The error code can tell you more about what might be happening.
Have you tried doing a stale read on the table to see if any data is returned? If lock contention is the problem, this should succeed and return results faster for you. Thus, you can further investigate the lock usage with the statistics table (as shown below).
Inspect query statistics: you can list the queries with the highest CPU usage, find the average latency for a query and find out the queries that timeout the most. There is much more you can do, as seen here. You can also view the query statistics in cloud console. What I would verify is, by reducing the overall CPU, does your query come back without any issues? You might need more resources if so. Or you might need to reduce hotspots in your database.
Inspect Lock statistics: you can investigate lock conflicts in your rows and tables. I think that an interesting query for your problem would be Discovering which row keys and columns had long lock wait times during the selected period. You can then see if your query is reading those row keys and columns and act accordingly.

mongocxx crash when doing many inserts & catching exceptions in underlying C driver

I have a C++ application which is processing JSON and inserting into DB. It's doing a pretty simple insert, parses the JSON adds a couple of fields and inserts as below.
I have also surrounded the call with a try/catch to handle any exceptions that might be thrown.
The application is probably processing around 360-400 records a minute.
It has been working quite well however it occasionally errors and crashes. Code and stack trace below.
Code
try {
bsoncxx::builder::basic::document builder;
builder.append(bsoncxx::builder::basic::concatenate(bsoncxx::from_json(jsonString)));
builder.append(bsoncxx::builder::basic::kvp(std::string("created"), bsoncxx::types::b_date(std::chrono::system_clock::now())));
bsoncxx::document::value doc = builder.extract();
bsoncxx::stdx::optional<mongocxx::result::insert_one> result = m_client[m_dbName][std::string("Data")].insert_one(doc.view());
if (!result) {
m_log->warn(std::string("Insert failed."), label);
}
} catch (...) {
m_log->warn(std::string("Insert failed and threw exception."), label);
}
Stack
So I guess I have two questions. Any ideas as to why it is crashing? And is there some way I can catch and handle this error in such a way that it does crash the application.
mongocxx version is: 3.3.0
mongo-c version is: 1.12.0
Any help appreciated.
Update 1
I did some profiling on the DB and although its doing a high amount of writes in performing quite well and all operations are using indexes. Most inefficient operation takes 120ms so I don't think its a performance issue on that end.
Update 2
After profiling with valgrind I was unable to find any allocation errors. I then used perf to profile CPU usage and have found over time CPU usage builds up. Typically starting at around 15% base load when the process first starts and then ramping up of the course of 4 or 5 hours until the crash occurs with CPU usage around 40 - 50%. During this whole time the number of DB operations per second remains consistent at 10 per second.
To rule out the possibility of other processing code causing this I removed all DB handling and ran process over night and CPU usage remained flat lined at 15% the entire time.
I'm trying a few other strategies to try and identify root cause now. Will update if I find anything.
Update 3
I believe I have discovered cause of this issue. After the insert process there is also an update which pushes two items onto an array using the $push operator. It does this for different records up to 18,000 times before that document is closed. The fact it was crashing on the insert under high CPU load was a bit of a red-herring.
The CPU usage build up is that process taking longer to execute as the array within document being pushed to gets longer. I re-architected to use Redis and only insert into mongodb once all items to be pushed into array are received. This flat lined CPU usage. Another way around this could be to insert each item into a temporary collection and use aggregation with $push to compile single record with array once all items to be push are received.
Graphs below illustrate issue with using $push on document as array gets longer. The huge drops are when all records are received and new documents are created for next set of items.
In closing this issue I'll take a look over on MongoDB Jira and see if anyone has reported this issue with the $push operator and if not report it myself in case this something that can be improved in future releases.

Is db.stats() a blocking call for MongoDB?

While researching how to check the size of a MongoDB, I found this comment:
Be warned that dbstats blocks your database while it runs, so it's not suitable in production. https://jira.mongodb.org/browse/SERVER-5714
Looking at the linked bug report (which is still open), it quotes the Mongo docs as saying:
Command takes some time to run, typically a few seconds unless the .ns file is very large (via use of --nssize). While running other operations may be blocked.
However, when I check the current Mongo docs, I don't find that text. Instead, they say:
The time required to run the command depends on the total size of the database. Because the command must touch all data files, the command may take several seconds to run.
For MongoDB instances using the WiredTiger storage engine, after an unclean shutdown, statistics on size and count may off by up to 1000 documents as reported by collStats, dbStats, count. To restore the correct statistics for the collection, run validate on the collection.
Does this mean the WiredTiger storage engine changed this to a non-blocking call by keeping ongoing stats?
a bit late to the game but I found this question while looking for the answer, and the answer is: Yes until 3.6.12 / 4.0.5 it was acquiring a "shared" lock ("R") which block all write requests during the execution. After that it's now an "intent shared" lock ("r") which doesn't block write requests. Read requests were not impacted.
Source: https://jira.mongodb.org/browse/SERVER-36437

Lucene searches are slow via AzureDirectory

I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.

SQL Server & ADO NET : how to automatically cancel long running user query?

I have a .NET Core 2.1 application that allows users to search a large database, with the possibility of using lots of parameters. The data access is done through ADO.NET. Some of the queries generated result in long running queries (several hours). Obviously, the user gives up on waiting, but the query chugs along in SQL Server.
I realize that the root cause is the design of the app, but I would like a quick solution for now, if possible.
I have tried many solutions, but none seem to work as expected.
What I have tried:
CommandTimeout
CommandTimeout works as expected with ExecuteNonQuery but does not work with ExecuteReader, as discussed in this forum
When you execute command.ExecuteReader(), you don't get this exception because the server responds on time. The application doesn't respond because it reads data to the memory, and the ExecuteReader() method doesn't return control until all the data is read.
I have also tried using SqlDataAdapter, but this does not work either.
SQL Server query governor
SQL Server's query governor works off of the estimated execution plan, and while it does work sometimes, it does not always catch inefficient queries.
SQL Server execution time-out
Tools > Options > Query Execution > SQL Server > General
I'm not sure what this does, but after entering a value of 1, SQL Server still allows queries to run as long as they need. I tried restarting the server instance, but that did not make any difference.
Again, I realize that the cause of this problem is the way that the queries are generated, but with so many parameters and so much data, fine tuning a solution in the design of the application may take some time. As of now, we are manually killing any spid associated with this app that has run over 10 or so minutes.
EDIT:
I abandoned the hope of finding a simple solution. If you're having a similar issue, here is what we did to address it:
We created a .net core console app that polls the database for queries running over a certain allotted time. The app looks at the login name and the amount of time it's been running and determines whether to kill the process.
https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlcommand.cancel?view=netframework-4.7.2
Looking through the documentation on SqlCommand.Cancel, I think it might solve your issue.
If you were to create and start a Timer before you call ExecuteReader(), you could then keep track of how long the query is running, and eventually call the Cancel method yourself.
(Note: I wanted to add this as a comment but I don't have the reputation to be allowed to yet)