Is there a way to remove prefixes/suffixes and contractions in Mallet - mallet

I recently started to use Mallet from UMass. I looked to see if there was a way to remove prefixes/suffixes and contractions with a command, just like the stop-words has a command, however I didn't find any information about it. If Mallet can do this, can someone point me in the right direction. And if it cannot, but there is something else you know of that can do this, can you point me in that direction?
Thanks in advance!

You can do some computational-linguistic preprocessing to your corpus (e.g. stemming or lemmatising) before running Mallet on it.
I am not aware of a way to do this inside Mallet (and it is heavily language dependent). Maybe you can write an input filter, but I'd keep it outside Mallet.

Related

simulating the estimatiors in matlab

I have an assignment like this but I do not know where to start( sorry I do not know much about Matlab). I have struggled with the variable C( how can I pass the value into it when its range is unknown?) and these blocks at the bottom as well. I have to do something like Simulink or just coding to achieve it. I have searched on the internet but I do not see anything related to my task. Or maybe I use the wrong keywords to find it. Could you give me some suggestion or links which are relevant to this. Thank you so much!

Running OPTICS algorithm on ELKI

I'm normally an R user (a beginning R user, but I'm starting to get the hang of it). However, I have heard positive things about ELKI--in particular, its speed. I came across this old post "How to group nearby latitude and longitude locations stored in SQL" and the answer posted by Anony-Mousse is similar to what I'd like to do. I would like to be able to replicate each step he has done up to the KML file he has shared on Google Drive.
I've downloaded ELKI and am able to run the mini-GUI, which looks like the following:
Could someone post some steps on how to do what Anony-Mousse was able to do?
My data is very similar in nature. I have geocoded addresses in a csv file (more specifically, each tuple is an event and one of the variables/features/columns is the geocoded address of the event) and I'm looking to find clusters much like the OP in the link above.
Hopefully, Anony-Mousse will read this post and come to the rescue. But, I'd be grateful if anyone else could help get me on my way.
Sorry about not following up earlier.
I did not keep the code for my experiments you refer to. So I don't remember whether I used a python script to rewrite the output to KML (I believe I did so), or whether I just copy&pasted from the ELKI source to a custom ResultHandler to generate the file.
Probably the first, because writing XML in Java is a bit more complicated (although also more likely to be correct XML then) than just printing the document in Python. If so, I probably used the scipy.spatial package for computing the convex hull, reading the ELKI text output is fairly trivial (just skip comment lines, and take the two numeric columns of the other as coordinates)

Is there a port for KStem for .NET?

I'm about to launch into a Lucene.NET implementation and I am concerned about using the PorterStemFilter. Reading here, and reading source code, it appears to be far, far too aggressive for my needs.
I need something simpler that doesn't look for roots but just removes "er", "ed", "s", etc suffixes. From what I've read, KStem would do the trick.
I can't for the life of me find a .NET version of KStem. I can't even find source code for the Java version to handroll a port.
Could someone point me in the right direction?
Looks like it is easy enough to handcraft a reduced PorterStemmer by simply removing steps I don't want. Anyone have success with that?
You could use the HunspellStemmer, part of contrib. It can use freely available hunspell dictionaries to provide proper stemming.

iPhone - entering equations

I've been researching this topic for a few weeks now, but I'm still unsure as to what is the "best" way to approach this problem.
I am designing an app, and part of the input involves entering an equation (ie mathematical function). I'm not looking for anything super complicated; it's single-variable, at least for now.
What is the best way to approach entry and parsing? Is there a parser that is very good for this? What about a graphical approach such as dragging/selecting parts and assembling a function by its components?
Thanks.
You should be able to use regular expressions to parse it out.
Check out NSRegularExpression and Google around for a regex that will parse out the equation into its different parts.
If you want to make your application extensible (for the future) you should read something about parser theory. There is a simple example on wikipedia (here) from which you can start. It uses flex (to generate the lexer) and bison (to generate the parser) which can be easily integrated with Objective-C code.
If that example is more than expected you can start with a more simple one from the bison manual (here).
you can use mathml products like mathtype and maths magic.
for other products see this
If you want to use javascript for formatting that use jqmath

random forest code review

I'm doing a research project on random forest algorithm. I have found numerous implementations of the algorithm but the main part of the code is often written in Fortran while I'm completely naive in it.
I have to edit the code, change the main parameters (like tree depth, num of feature variables, ...) and trace the algorithm's performance during each run.
Currently I'm using "Windows-Precompiled-RF_MexStandalone-v0.02-". The train and predict functions are matlab mex files and can not be opened or edited. Can anyone give me a piece of advice on what to do or is there a valid and completely matlab-based version of random forests.
I've read the randomforest-matlab carefully. The main training part unfortunately is a dll file. Through reading more, most of my wonders is now resolved. My question mainly was how to run several trees simultaneously.
Have you taken a look at these libraries?
Stochastic Bosque
randomforest-matlab
If you're doing a research project on it, the best thing is probably to implement the individual tree training yourself in C and then write Mex wrappers. I'd start with an ID3 tree (before attempting C4.5 for instance.) Then write the random forest code itself, which, once you write the tree code, isn't all that hard.
You'll:
learn a lot
be able to modify them as much as you like
eventually move on to exploring new areas with them
I've implemented them myself from scratch so I can help once you post some of your own code. But I don't think anybody on this site will write the code for you.
Will it take effort? Yes. Will you come out of it with more knowledge and ability than you had going in? Undoubtably.
There is a nice library in R called randomForest. It is based on the original implementation of Breiman in Fortran but it is now mainly recoded in C.
http://cran.r-project.org/web/packages/randomForest/index.html
The main parameters you talk about (tree depth, number of features to be tested, ...) are directly available.
Another library I would recommend is Weka. It is java based and lucid.Performance is slightly off though compared to R. The source code can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/