Sequence Tagging in batch with Mallet cmd prompt - mallet

I have tested the SimpleTagger for Sequence Tagging on mallet's cmd prompt interface. I would now like to train over many files and run tests in batches. Is it also possible to do this on mallet's command prompt? I want to get some hint on the performance of the algorithm for the task at hand before I dive into using the JAVA API.
I have seen that Classification tasks can be run in batch from the command prompt.
is it possible to use SimpleTagger in batch? if no
Can someone point me to a reference code where Sequence Tagging has been done in batch using the java API.
Somewhere I found a reference to "http://mallet.cs.umass.edu/index.php/Command_line_tutorial", but the link seems to be broken.

After some exploration, I learned that it was not possible to readily use the cc.mallet.fst.SimpleTagger for batch evaluations. Instead, I found out that the cc.mallet.examples.TrainCRF is a handy code (that uses the SimpleTagger). The code takes a train and test datasets (in Mallet sequence tagging format, instances separated by single-line) as input arguments and that's it.
I used the mallet-2.0.8 installation available on the Mallet page.
Beware to NOT tune the models based on the performance on the test set. You should avoid that and perhaps not verify the performance on test set until you have tuned the model on the training set sufficiently.

Related

Parameters Variation not running model in AnyLogic

When I create a ParametersVariation simulation, the main model does not run. All I see is the default UI with iterations completed and replication. My end goal (as with most people) is to have a model go through a certain number of replications, but nothing is even running. There is limited documentation available on this. Please advise.
This is how Parameters Variation is intended to work. If you're running 1000 runs and multiple replications with parallel runs, how can you see what's happening in Main in each?
Typically, the best way to benefit from such an experiment is to track the results of each run using elements from the Analysis palette or even better to export results to Excel or similar.
To be able to collect data, you need to write your code in Java actions fields with root. to access elements in main (or top-level agent).
Check the example below, where after each run a variable from main is added to a dataset in the Parameters Variation experiment. At the end of 100 runs for example, the dataset will have 100 values of the main variable, with 1 value for each run.

Running Netlogo headless on the cloud

I've written a NetLogo model to model agent movement in a landscape. I'd like to run this model from the command prompt, using AWs/Google Compute. The model uses about 500MB worth of input rasters and shapefiles and writes rasters and csv files. It also uses the extensions gis, rnd, cf, table and csv.
Would this be possible using the Controlling API? (https://github.com/NetLogo/NetLogo/wiki/Controlling-API). Can I just use the steps listed in the link? I have not tried running NetLogo from the command prompt before.
Also, I do not want to run BehaviourSpace as it is not relevant to this model.
A BehaviorSpace experiment can consist of only a single run, so BehaviorSpace may actually be relevant to you here. You only need to write one short XML file (or no new files at all, if the experiment setup you want is already part of the model) to do it this way.
Whereas if you go through the controlling API, you will have to write and compile Java (or Scala) code, which is a substantially more complex task.
But if you decide to go the controlling API route: yes, that works too, and it is documented, as you've already noticed.

Talend job batch processing

I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.

Getting the current Experiment instance at runtime

I'm running JUnit 4 with AnyLogic. In one of my tests, I need access to the Experiment running the test. Is there any clean way to access it at runtime? E.g., is there a static method along the lines of Experiment.getRunningExperiment()?
There isn't a static method that I know of (and, if there was, it might be complicated by multi-run experiments which permit parallel execution, although perhaps not since there's still a single Experiment, though there'd be thread-safety issues).
However, you can use getEngine().getExperiment() from within a model. You probably need to explain more about your usage context. If you're using AnyLogic Pro and exporting the model to run standalone, then you should have access to the experiment instance anyway (as in the help "Running the model from outside without UI").
Are you trying to run JUnit tests from within an Experiment? If so, what's your general design? Obviously JUnit doesn't sit as well in that scenario since it 'expects' to be instantiating and running the thing to be tested. For my automated tests (where I can't export it standalone because I don't use AnyLogic Pro), I judged that it was easier to avoid JUnit (it's just a framework after all) and implement the tests 'directly' (by having my model components write outputs and, at the end of the run in the Experiment, having the Experiment compare the outputs to pre-prepared expected ones and flag if the test was passed or failed). With AnyLogic Pro, you could still export standalone and use JUnit to run the 'already-a-test' Experiments (with the JUnit test checking the Experiment for a testPassed Boolean being set at the end or whatever).
The fact that you want to get running experiments suggests that you're potentially doing this whilst runs are potentially executing. If so, could you explain a bit about your requirements?

matlab distributed computing with sge(qsub)

Recently I got access to run my codes on a cluster. My code is totally paralleizable but I don't know how to best use its parallel nature. I've to compute elements of a big matrix and each of them are independent of the others. I want to submit the job to run on several machine (like 100) to speed up the computation of the matrix.
Right now, I wrote a script to submit multiple jobs each responsible to compute a part of the matrix and save it in a .mat file. At the end I'm merging them to get the whole matrix. For submitting each individual job, I've created a new .m file (run1.m, run.2, ...) to set a variable and then run the function to compute the associated part in the matrix. So basically run1.m is
id=1;compute_dists_matrix
and then compute_dists_matrix uses id to find the part it is going to compute. Then I wrote a script to create run1.m through run60.m and the qsub them to the cluster.
I wonder if there is a better way to do this using some MATLAB features for example. Because this seems to be a very typical task.
Yes, it works, but is not ideal, and as you say is a common problem. Matlab has a parallel programming toolkit.
Does your cluster have this? If so, the distributed arrays is worth having a look at. If they don't have access to it, then what you are doing is the only other way. You can wrap your run1.m,run2.m in a controlling script to automate it for you...
I believe you could use command line arguments for the id and submit jobs with a range of values for this id. Command line arguments can be processed by launching MATLAB from the command line without the IDE and providing the name of the script to be executed and the list of arguments. I would think you can set up dependencies in your job manager and create a "reduce" script to merge the partial results (from files). The whole process could be managed from a single script that would generate the id & other necessary arguments and submit the processing & postprocessing jobs with dependencies.