Lucene.NET lifetime management - lucene.net

Let's assume that I have a basic understanding of adding and searching documents.
What would be the best practice for managing instances of IndexWriter and IndexReader?
Currently, my application creates a singleton instance of an IndexWriter. When ever I need to do a search, I just create an IndexSearcher from the IndexWriter by using the following
var searcher = new IndexSearcher(writer.GetReader())
I am doing this because creating a new IndexReader causes the index to get loaded into memory, and then waits for the GC to reallocate the memory. This was causing out of memory errors.
Is this current implementation considered ideal? This implementation has solved the memory issue, but there is an issue with the write.lock file always existing (because IndexWriter is always instantied and opened). Here is the stack trace of the errors I get in the app.
Lock obtain timed out:
NativeFSLock#C:\inetpub\wwwroot\htdocs_beta\App_Data\products3\write.lock:
System.IO.IOException: The process
cannot access the file
'C:\inetpub\wwwroot\htdocs_beta\App_Data\products3\write.lock'
because it is being used by another
process. at
System.IO.__Error.WinIOError(Int32
errorCode, String maybeFullPath) at
System.IO.FileStream.Init(String path,
FileMode mode, FileAccess access,
Int32 rights, Boolean useRights,
FileShare share, Int32 bufferSize,
FileOptions options,
SECURITY_ATTRIBUTES secAttrs, String
msgPath, Boolean bFromProxy, Boolean
useLongPath) at
System.IO.FileStream..ctor(String
path, FileMode mode, FileAccess
access) at
Lucene.Net.Store.NativeFSLock.Obtain()
I'm thinking maybe it would be best to create a singleton instance of IndexSearcher for searching, and then create an IndexWriter as needed in memory. That way, the write.lock file will be created/deleted when updating the index. The only issue I see with this is that the IndexSearcher instance will become outdated, I would need to have a task running that reloads the IndexSearcher if the index has been updates.
What do you think?
How do you handle a large index with live updating?

You should use only one index writer, to avoid your locking issues. Have a look at : Lucene.Net writing/reading synchronization

Related

potential for file id collision in C when doing pthread network io

I have an app in c that listens on a port and creates a pthread upon connection and goes back to the listen. The pthread functions reads from the socket, writes a response and then waits 1/10th of a sec followed by a shutdown() and a close() then pthread_exit(). This can happen very rapidly resulting in possibly hundreds of threads at the same time. My question is can the system reuse a file id before I do the final close()? I'm concerned about the possibility of the socket closing prematurely for some reason. On the listening side the file id cannot be reused until I do the close() call even if the underlying connection is long gone, right? I'm fairly sure that this is how it works but I can't confirm.
On the listening side the file id cannot be reused until I do the
close() call even if the underlying connection is long gone, right?
Yes, this is correct - the file descriptor is not released for re-use until it has been passed to close() (or is an FD_CLOEXEC file descriptor being closed automatically at execve()).
All thread try to enter critical region to be processed if you didn't use semafor,mutex or monitoring probably it uses same id even your files that you get from byte stream may be croupted. I advise to you use semafor, mutex ,or monitoring, and search about dining philosophers problem, because it is very frequent situation. Good luck I hope I can show a clue about your problem.

potential memory leak using TriMap in Scala and Tomcat

I am using a scala.collection.concurrent.TriMap wrapped in an object to store configuration values that are fetched remotely.
object persistentMemoryMap {
val storage: TrieMap[String, CacheEntry] = TrieMap[String, CacheEntry]()
}
It works just fine but I have noticed that when Tomcat is shut down it logs some alarming messages about potential memory leaks
2013-jun-27 08:58:22 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
ALLVARLIG: The web application [] created a ThreadLocal with key of type [scala.concurrent.forkjoin.ThreadLocalRandom$1] (value [scala.concurrent.forkjoin.ThreadLocalRandom$1#5d529976]) and a value of type [scala.concurrent.forkjoin.ThreadLocalRandom] (value [scala.concurrent.forkjoin.ThreadLocalRandom#59d941d7]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak
I am guessing this thread will terminate on it's own eventually but I am wondering if there is some way to kill it or should I just leave it alone?
The scala.concurrent.forkjoin.ThreadLocalRandom's value is created only once per thread. It does not hold any references to objects other than the random value generator used by that thread -- the memory it consumes has a fixed size. Once the thread is garbage collected, its thread local random value will be collected as well -- you should just let the GC do its work.
You could still remove it manually by using Java reflection to remove the private modifier on the static field localRandom in the ThreadLocalRandom class:
https://github.com/scala/scala/blob/master/src/forkjoin/scala/concurrent/forkjoin/ThreadLocalRandom.java#L62
You could then call localRandom.set(null) to null out the reference to the random number generator. You should also then ensure that TrieMap is no longer used from that thread, otherwise ThreadLocalRandom will break by assuming that the random number generator is different than null.
Seems hacky to me, and I think you should just stick to letting the GC collect the thread local value.

mvc-mini-profiler slows down Entity Framework

I've set up mvc-mini-profiler against my Entity Framework-powered MVC 3 site. Everything is duly configured; Starting profiling in Application_Start, ending it in Application_End and so on. The profiling part works just fine.
However, when I try to swap my data model object generation to providing profilable versions, performance slows to a grind. Not every SQL query, but some queries take about 5x the entire page load. (The very first page load after firing up IIS Express takes a bit longer, but this is sustained.)
Negligible time (~2ms tops) is spent querying, executing and "data reading" the SQL, while this line:
var person = dataContext.People.FirstOrDefault(p => p.PersonID == id);
...when wrapped in using(profiler.Step()) is recorded as taking 300-400 ms. I profiled with dotTrace, which confirmed that the time is actually spent in EF as usual (the profilable components do make very brief appearances), only it is taking much longer.
This leads me to believe that the connection or some of its constituent parts are missing sufficient data, making EF perform far worse.
This is what I'm using to make the context object (my edmx model's class is called DataContext):
var conn = ProfiledDbConnection.Get(
/* returns an SqlConnection */CreateConnection());
return CreateObjectContext<DataContext>(conn);
I originally used the mvc-mini-profiler provided ObjectContextUtils.CreateObjectContext method. I dove into it and noticed that it set a wildcard metadata workspace path string. Since I have the database layer isolated to one project and several MVC sites as other projects using the code, those paths have changed and I'd rather be more specific. Also, I thought this was the cause of the performance issue. I duplicated the CreateObjectContext functionality into my own project to provide this, as such:
public static T CreateObjectContext<T>(DbConnection connection) where T : System.Data.Objects.ObjectContext {
var workspace = new System.Data.Metadata.Edm.MetadataWorkspace(
GetMetadataPathsString().Split('|'),
// ^-- returns
// "res://*/Redacted.csdl|res://*/Redacted.ssdl|res://*/Redacted.msl"
new Assembly[] { typeof(T).Assembly });
// The remainder of the method is copied straight from the original,
// and I carried over a duplicate CtorCache too to make this work.
var factory = DbProviderServices.GetProviderFactory(connection);
var itemCollection = workspace.GetItemCollection(System.Data.Metadata.Edm.DataSpace.SSpace);
itemCollection.GetType().GetField("_providerFactory", // <==== big fat ugly hack
BindingFlags.NonPublic | BindingFlags.Instance).SetValue(itemCollection, factory);
var ec = new System.Data.EntityClient.EntityConnection(workspace, connection);
return CtorCache<T, System.Data.EntityClient.EntityConnection>.Ctor(ec);
}
...but it doesn't seem to make much of a difference. The problem still exists whether I use the above hacked version that's more specific with metadata workspace paths or the mvc-mini-profiler provided version. I just thought I'd mention that I've tried this too.
Having exhausted all this, I'm at my wits' end. Once again: when I just provide my data context as usual, no performance is lost. When I provide a "profilable" data context, performance plummets for certain queries (I don't know what influences this either). What could mvc-mini-profiler do that's wrong? Am I still feeding it the wrong data?
I think this is the same problem as this person ran into.
I just resolved this issue today.
see: http://code.google.com/p/mvc-mini-profiler/issues/detail?id=43
It happened cause some of our fancy hacks were not cached well enough. In particular:
var workspace = new System.Data.Metadata.Edm.MetadataWorkspace(
new string[] { "res://*/" },
new Assembly[] { typeof(T).Assembly });
Is a very expensive call, so we need to cache the workspace.
Profiling, by definition, will effect performance of the application being profiled. The profiler needs to insert it's own method calls throughout the application, intercept low level system calls, and record all that data someplace (meaning writes to disk). All of those tasks take up precious CPU cycles, memory, and disk access.

Why SynchronizedCollection<T> does not lock on IEnumerable.GetEnumerator()

Why SynchronizedCollection<T> does not acquire a lock on SyncObj in explicit implementation of IEnumerable.GetEnumerator()
IEnumerator IEnumerable.GetEnumerator()
{
return this.items.GetEnumerator();
}
Implicit implementation does acquire a lock on SyncOb (verified by reflector).
It could be problem during foreach loop on this collection. One thread might have acquired a lock and the other could try to read it using foreach?
Because there is no way for the class to know when the client code is done using the iterator. Which is one reason that the MSDN Library docs on the System.Collection classes always warn that iterating a collection isn't thread-safe.
Although they appeared to have forgotten to mention that in the article for SynchronizedCollection. The irony...
Modifying the collection while someone's using an iterator is a concurrency violation anyway.
What would your alternative be? Lock the collection when the iterator is acquired, and not unlock it until the iterator is destructed?
I'm going to go ahead and say that this could be a bug (ed: or at least an inconsistency) in the implementation. Reflector shows exactly what you're seeing, that every other explicit implementation calls lock on the SyncRoot given, except for IEnumerable.GetEnumerator().
Perhaps you should submit a ticket at Microsoft Connect.
I believe the reason the implicit GetEnumerator() method calls lock is because List<T>.GetEnumerator() creates a new Enumerator<T> which relies on the private field _version on the list. While I agree with the other posters, that I don't see the use in locking the GetEnumerator() call, but since the constructor of Enumerator<T> relies on non-threadsafe fields, it would make sense to lock. Or at least remain consistent with the implicit implementations.

SqlDataReader: In this scenario, will the reader get closed?

I am cleaning up the DataReaders in an old .NET 1.1 project that I inherited.
The previous developer coded the data-access-layer in such a way that most of the DAL methods returned SqlDataReaders (thus leaving it up to the caller to properly call the .Close() or .Dispose() methods).
I have come across a situation, though, where a caller is not catching the returned SqlDataReader (and therefore is not disposing of it properly). See the code below:
Data Access Method:
Public Shared Function UpdateData() As SqlDataReader
...
drSQL = cmdSQL.ExecuteReader(CommandBehavior.CloseConnection)
Return drSQL
End Function
Calling code:
...
DataAccessLayer.UpdateData()
...
As you can see, the calling method does not receive/catch the returned SqlDataReader. So what happens? Is that SqlDataReader still out there and open? Or does it automatically get garbage collected since nothing is addressing it?
I couldn't think of a way to debug and test this. If anybody has any ideas or suggestions that would be great.
i believe that it will get closed but not until the garbage-collector gets 'round to it, which may not be for a very long time...