Number of clusters obtained using carrot2 inconsistent on the same data set - cluster-analysis

I am using carrot2 for clustering a set of 500 emails. I am using the BisectingKMeans algorithm provided by carrot2. On the same data set, when I specify k = 9, only 6 are generated and when I give it to run with 8 clusters, 7 are generated – however when I give 10 clusters to run , all 10 are generated.
Can anyone please help me figure out the reason behind this?

I've had a look at the code and it looks like this behaviour was caused by a bug in the cluster splitting routine. I've committed a fix to the master line of Carrot2, which makes the number of generated clusters more predictable. You can download the binaries with the fix from Carrot2 build server.

Related

Optimizing openmaptiles and servers for planet tiles generation

I'm currently using openmaptiles in order to generate planet tiles (zoom 0 to 14 or 15). This is a long process that I plan run on dedicated servers.
I know that this is a service offered by openmaptiles but I can't afford spending $1200 or $1000 to generate or buy the tiles.
It's written in the README of openmaptiles project that the quickstart.sh isn't optimized for planet rendering. This is why I'd like to know how I could optimize the configuration to make it as fast as possible.
To be clear, I will use mbutils to generate tiles from mbtiles file, allowing me to run the planet generation on different servers with different zoom levels (i.e zoom 1 to 9 on a first server, and 10 to 14 on another one). This way, I will collect different mbtiles files that I will use to generate and merge .pbf tiles with mbutils.
I read an issue but it didn't change anything for me.
Maybe I can also remove some layers that won't be used on my map ? (How to do that ?)
ATM, when I run a script, it doesn't seem that it's using the full CPU capacities.
Thanks for your help
I found a way to accelerate the process:
Make a PR of openmaptiles/generate-vectortiles repo that contains the dockerfile of the main container for this project.
In the background, this container uses mapbox's tilelive project that allows to split a big job in smaller ones.
I added two environment variables:
JOBS: The number of job it should be splitted in
JOB_NUM: The job number to run
The fork is here: https://github.com/qlerebours/generate-vectortiles
It allows to paralellize the process if you have multiple servers to generate this.
you can restrict layers returned by modifying: https://github.com/openmaptiles/openmaptiles/blob/master/openmaptiles.yaml
Inside openmaptiles.yaml - reduce the layers entry so it contains the only layers that you require.
For example, I only required building data so I changed the file so that the layers section only contained the following.
layers:
- layers/building/building.yaml
I worked this out by going through the history of the openmaptiles repository. It worked for me.
Hope this helps! If you find another ways to speed up the process, it would be good to share!
Thanks
-Rufus

Satellite 6 Job Invocation search query : using facts (faster)

I'm using Satellite 6 to manage EL 5, 6 and 7x hosts.
I've been trying to perform a Job Invocation (via Monitor-> Jobs -> Run Jobs) on a host of servers, based on a custom fact that I wrote (the fact is called ad_domain and basically tells you whether its active directory joined or not).
However I can't figure out how to do this....is this even possible?
I'm a Satellite newbie...I don't even even know what parameters I can use in the Search Query to do this. Can anyone help enlighten? Is it possible to specify a factor/facter value(s) in the Search Query so that it will resolve only to hosts that match that value(s)?
Appreciate your help in advance,
Sue
You can try
facts.ad_domain = value

Exception in sumTypeTopicCounts

Hi I am trying to use MALLET to obtain 500 topics but I hit the below exception in MALLET. Is this a known issue and are there any workarounds?
overflow in merging on type 4975
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
at cc.mallet.topics.ParallelTopicModel.sumTypeTopicCounts(ParallelTopicModel.java:453)
at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:825)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:245)
I am using mallet-2.0.8RC2.
Recently, I ran Mallet with two different datasets (one with 100M and the other one around 1G). Usually this kind of exception happened with the larger dataset and when I wanted to run in in parallel for larger iteration number like 100 for the larger dataset. It threw Exception: ArrayIndexOutOfBoundsException in two different files: WorkerRunnable and ParallelTopicModel in different spots. So the thing is when the array reaches the end of the array, it prints “overflow in merging on type” to the logger and after that point, the program doesn’t do anything to get out of the situation. I was able to patch these edge cases with index checking before accessing the array. It helps me run it without breaking it but I am not sure how it might change the output anyways and it also keeps printing the same message “overflow in merging on type ” as usual but it goes on and doesn’t throw an exception.
I have uploaded the patches on my Github and follow the instructions. It has been able to resolve the issues for me as I haven’t seen this break again under different circumstances. If it doesn’t resolve your issues, you should probably download the latest version from their Github and debug and build it yourself.
I have also uploaded both datasets; both are four years of data; (1 Jan 2015- 1 Jan 2019), smaller one is StackExchange (DataScience) and the larger one is Reddit (9 DataScience Subreddits) (datasets) and you would like to play with it.
Good luck.

Predict Class Probabilities in Spark RandomForestClassifier

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!
I have already answered a similar question before.
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.
Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.
Nevertheless, this might seem a little complicated.
There is some thing you might need to consider :
update the spark/config.file to install you spark-1.5. Something like :
+3 1.5.0 python s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz
this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.
To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)
EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.
Unfortunately this isn't possible with version 1.4.1, you could extend the random forest class and copy some of the code I added in that pull request if you can't upgrade - but be sure to switch back to the regular version once you are able to upgrade.
Spark 1.5.0 is now supported natively on EMR with the emr-4.1.0 release! No more need to use the emr-bootstrap-actions, which btw only work on 3.x AMIs, not emr-4.x releases.

accessing command line arguments for headless NetLogo in the Matlab extension

I'm running the matlab extension for netlogo in headless(non-gui) mode. I've downloded the extension source and am trying to access the command line arguments from the java code in the extension. The command line arguments are stored in LabInterface.Settings. I would like to be able to access that object in the java code of the extension. I've been working on this for a couple of days but have had not success. It seems the extension process is designed to create primitives to be used inside netlogo. These primitives have knowledge of the different netlogo objects but there is no way for the extension java code to access it. I would appreciate any help.
I would like to be able to run multiple netlogo-matlab analyses with varying parameters, in batch mode accross multiple machines, perhaps a flux cluster. I need to run in headless because of the batch nature. Sometimes the runs will be on the same machine, sometimes split accross multiple machines, flux or condor. I know a similar functionality exist in netlogo for running varying parameters in a single session. Is there some way to split these accross multiple machines?
Currently, I create a series of setup files for netlogo. Each setup file represents the paramenters that vary for that run. Then I submit each netlogo - setup file combination as a single run. Each run can be farmed out to a seperate machine or processor. Adding the matlab extension complicates this. The matlab extension connects it's server to port 9999. With multiple servers running they all get attached to port 9999 and this causes problems. I was hoping to get information from the setup-file name to create independent port numbers tied to the setup file names. This way I could create a unique socket for each setup file, and hence a unique server connection for each netlogo run.
NetLogo doesn't provide a facility for distributing model runs on a cluster, but various people have done it anyway. See:
http://ccl.northwestern.edu/netlogo/docs/faq.html#cluster
https://github.com/jurnix/netlogo-cluster
http://mass.aitia.ai/index.php/intro/meme
and past threads about it on the netlogo-users group. There is no single standard solution.
As for getting access to LabInterface.Settings, it appears to me from looking through the NetLogo source code that the settings object isn't actually stored anywhere. It's just handed off from method to method, ultimately to lab.Lab.run, without ever actually being kept. So trying to get access to the name of the setup file won't work.
So you'll need to some other way for to make the extension generate unique port numbers. Seems to me like there's any number of possible solutions to this. At the time you generate the setup file you know its name, so you could generate a port number at the same time and include it in the experiment definition contained in the file. Or you could pass a port number in a Java system property (using -D) at the time you start NetLogo. Or you could generate a port number based on the process id of the JVM process. Or you could have the extension try port 9999 and see if it's already in use, and if it is, then try a different port number. That's just a few ideas... I could probably come up with ten more.