Is there a port for KStem for .NET? - lucene.net

I'm about to launch into a Lucene.NET implementation and I am concerned about using the PorterStemFilter. Reading here, and reading source code, it appears to be far, far too aggressive for my needs.
I need something simpler that doesn't look for roots but just removes "er", "ed", "s", etc suffixes. From what I've read, KStem would do the trick.
I can't for the life of me find a .NET version of KStem. I can't even find source code for the Java version to handroll a port.
Could someone point me in the right direction?
Looks like it is easy enough to handcraft a reduced PorterStemmer by simply removing steps I don't want. Anyone have success with that?

You could use the HunspellStemmer, part of contrib. It can use freely available hunspell dictionaries to provide proper stemming.

Related

Reading & writing text in Scala, getting the encoding right?

I'm reading and writings some text files in Scala. As a complete beginner in the language, I wanted to make sure to find the right way to do it, e.g. get the encoding right.
So most of the stuff I found (also on SO ) recommends I use io.Source.fromFile.However, after trying it out like so, reading a UTF-8 file:
val user_list = Source.fromFile("usernames.txt").getLines.toList
val user_list = Source.fromFile("usernames.txt", enc="UTF8").getLines.toList
I looked at the docs but was left with some questions.
Get the encoding right:
the docs show that I can set an encoding in Source.fromFile as I tried above. Looking at the man on Codec and the types listed there, I was wondering if those are all my codec options - is there e.g. no Utf-16, Big-Endian vs Little-Endian, etc.?
I am slightly obsessed with this since it used to trip me up in Python a lot. Is this less of concern with Scala for some reason?
Get the reading in right:
All the examples I looked at used the getLines method and postprocessed it with MkString or List, etc. Is there any advantage to that over just reading in the entire file (my files are small) in one go?
Get the writing out right:
Every source I could find tells me that Scala has no file writing function and to use the Java FileWriter. I was surprised by this - is this still accurate?
Looking at it I feel the question might be a little broad for SO, so I'd be happy to take it back if it does not meet the requirements. At this point, I'm not struggling with specific examples but rather trying to set things up in a way I don't get in trouble later.
Thanks!
Scala only has a basic IO api in the standard library. For the most part you just use the java apis. The fact that a decent api from java exists is probably why the Scala team is not prioritizing having a robust and fully featured IO api.
There are also third party scala libraries you could use as well however. Better Files I've never used but heard good things about as a Scala file api. As well as fs2 which provides functional, streaming IO. I'm sure there are others out there as well.
For encoding, there are many possible encoding available. It's just that only a couple of the most common ones are available as static fields, the rest you typically access through Codec("Encoding Name"). Most apis will also let you just enter a String directly instead of needing to get a Codec instance first. The codec is really just a wrapper over java.nio.charset.Charset. You can run java.nio.charset.Charset.availableCharsets() to see all of the encodings available on your system.
As far as reading, if the files are small you can load them fully into memory if you prefer that. The only reason not to do so is if you want to avoid the extra memory use of loading the entire file at once if reading through line by line is enough. You may want to use Vector instead of List for efficiency reasons (Vector is better in many cases and should probably be preferred as a default collection, but tradition and old habits die hard and most people/guides seem to default to List, but this is a whole other topic)

Converting SBML model into a simulatable Matlab Function

I'm looking for a tool to convert a SBML model into a Matlab function. I've tried SBMLTranslate() function from libSBML but this returns a Matlab struct, not a function. Does anybody know if such tool exists? Thanks
There are at least three efforts in this direction:
Frank Bergmann offers an online service for SBML translation where you can upload an SBML file and it will generate a MATLAB file. The comments at the top of the generated MATLAB file explain how to use the results. The C++ source code is available on SourceForge.
Bergmann's code referenced above was used by Stanley Gu to create sbml2matlab, a Windows standalone program. Off-hand, I don't know whether Gu's version changed or enhanced the algorithm used by the Bergmann version, but it seems likely. (Note: Gu now works at Google and does not maintain this code anymore, as far as I know.)
The Systems Biology Format Converter (SBFC) is a framework written principally by Nicolas Rodriguez; it includes a collection of converters, one of which is an SBML-to-MATLAB converter. This converter is written in Java.
I have not compared the results of the translators myself yet, so cannot speak to the differences or quality of output. If you try them and have any feedback to relate, please let the authors know. Knowing what has or hasn't worked for real users will help improve things in the future.
A final caveat is that all of these have been research projects, so make sure to set your expectations accordingly. (This is not a criticism of the authors; the authors are very good – I know most of them personally – but the reality of academic development work is that we all lack the time and resources to make these systems comprehensive, hardened, polished, and documented to the degree that we wish we could.)

Alternative to ask-concurrent

I started today learning Netlogo based on an existing simulation which uses ask-concurrent. When I looked up the documentation to find what ask-concurrent does, I found the following: NOTE: The following information is included only for backwards compatibility. We don't recommend using the ask-concurrent primitive at all in new models.
However, it doesn't say anything about alternatives to this function. What should I use instead of ask-concurrent?
Use regular ask. If you find that changes the meaning of the code in a way that you consider undesirable, then in order to advise you on what to do about it, we will need to know the details of your code.
http://ccl.northwestern.edu/netlogo/docs/programming.html#ask-concurrent contains several examples of pieces of code that use ask-concurrent and how they can be written using regular ask instead.

Why use MEMCACHED_BEHAVIOR_NOREPLY?

I am writing a library that wraps libmemcached.
I noticed that there is an interesting behaviour setting for memcached which says that I can indicate that I do not care about the results of my memcached commands...
It is known as MEMCACHED_BEHAVIOR_NOREPLY. Why would somebody want to use this?
It would be great if someone could point out a few use cases? multi-get / multi-set spring to mind, but I am not clear how this would be useful.
ASCII protocol noreply was a mistake. Don't use it.
Binary protocol noreply can be used to make really awesome optimizations without any loss of correctness.

Are there any tools to visualize template/class methods and their usage?

I have taken over a large code base and would like to get an overview how and where certain classes and their methods are used.
Is there any good tool that can somehow visualize the dependencies and draw a nice call tree or something similar?
The code is in C++ in Visual Studio if that helps narrow down any selection.
Here are a few options:
CodeDrawer
CC-RIDER
Doxygen
The last one, doxygen, is more of an automatic documentation tool, but it is capable of generating dependency graphs and inheritance diagrams. It's also licensed under the GPL, unlike the first two which are not free.
When I have used Doxygen it has produced a full list of callers and callees. I think you have to turn it on.
David, thanks for the suggestions. I spent the weekend trialing the programs.
Doxygen seems to be the most comprehensive of the 3, but it still leaves some things to be desired in regard to callers of methods.
All 3 seem to have problems with C++ templates to varying degrees. CC-Rider simply crashed in the middle of the analysis and CodeDrawer does not show many of the relationships. Doxygen worked pretty well, but it too did not find and show all relations and instead overwhelmed me with lots of macro references until I filtered them out.
So, maybe I should clarify "large codebase" a bit for eventual other suggestions: >100k lines of code overall spread out over more than 100 template files plus several actual class files pulling it all together.
Any other tools out there, that might be up to the task and could do better (more thoroughly)? Oh and specifically: anything that understands IDL and COM interfaces?
When I have used Doxygen it has produced a full list of callers and callees. I think you have to turn it on.
I did that of course, but like I mentioned, doxygen does not consider interfaces between objects as they are defined in the IDL. It "only" shows direct C++ calls.
Don't get me wrong, it is already amazing what it does, but it is still not complete from my high level view trying to get a good understanding of how everything fits together.
In Java I would start with JDepend. In .NET, with NDepend. Don't know about C++.