matlab distributed computing with sge(qsub) - matlab

Recently I got access to run my codes on a cluster. My code is totally paralleizable but I don't know how to best use its parallel nature. I've to compute elements of a big matrix and each of them are independent of the others. I want to submit the job to run on several machine (like 100) to speed up the computation of the matrix.
Right now, I wrote a script to submit multiple jobs each responsible to compute a part of the matrix and save it in a .mat file. At the end I'm merging them to get the whole matrix. For submitting each individual job, I've created a new .m file (run1.m, run.2, ...) to set a variable and then run the function to compute the associated part in the matrix. So basically run1.m is
id=1;compute_dists_matrix
and then compute_dists_matrix uses id to find the part it is going to compute. Then I wrote a script to create run1.m through run60.m and the qsub them to the cluster.
I wonder if there is a better way to do this using some MATLAB features for example. Because this seems to be a very typical task.

Yes, it works, but is not ideal, and as you say is a common problem. Matlab has a parallel programming toolkit.
Does your cluster have this? If so, the distributed arrays is worth having a look at. If they don't have access to it, then what you are doing is the only other way. You can wrap your run1.m,run2.m in a controlling script to automate it for you...

I believe you could use command line arguments for the id and submit jobs with a range of values for this id. Command line arguments can be processed by launching MATLAB from the command line without the IDE and providing the name of the script to be executed and the list of arguments. I would think you can set up dependencies in your job manager and create a "reduce" script to merge the partial results (from files). The whole process could be managed from a single script that would generate the id & other necessary arguments and submit the processing & postprocessing jobs with dependencies.

Related

Parameters Variation not running model in AnyLogic

When I create a ParametersVariation simulation, the main model does not run. All I see is the default UI with iterations completed and replication. My end goal (as with most people) is to have a model go through a certain number of replications, but nothing is even running. There is limited documentation available on this. Please advise.
This is how Parameters Variation is intended to work. If you're running 1000 runs and multiple replications with parallel runs, how can you see what's happening in Main in each?
Typically, the best way to benefit from such an experiment is to track the results of each run using elements from the Analysis palette or even better to export results to Excel or similar.
To be able to collect data, you need to write your code in Java actions fields with root. to access elements in main (or top-level agent).
Check the example below, where after each run a variable from main is added to a dataset in the Parameters Variation experiment. At the end of 100 runs for example, the dataset will have 100 values of the main variable, with 1 value for each run.

Writing different experiment output run to different cells in a sheet (Excel file)

I am simultaneously running a model with different input values and it is producing different output on each run. I am trying to create a code that will get anylogic to wright each experiment output run in a different cell in excel sheet i.e. throughput Vs. Time. I am using dataset. Wondering If there is any script or hint can help in solving the issue?
Currently I am using the following commands. They keep overwriting the output using the same cells.
Out_excelFile1.setCellValue("Sink1 Out",2,2,2);
Out_excelFile1.writeDataSet(Sink1_D,2,3,2);
Best if you actually use the build-in database for outputs and only write to Excel at the end of all runs, tbh.
But in your case, you need to change the row number by your replication/iteration number. Use getCurrentIteration() or getCurrentReplication() in your "after simulation run" or "after replication" or "after iteration" experiment code sections to get this right.
Then, it would look something like Out_excelFile1.setCellValue("Sink1 Out",2,getCurrentIteration(),2);
(Details depend on your actual implementation, check the help for further info on replications, iterations and those functions)

Sequence Tagging in batch with Mallet cmd prompt

I have tested the SimpleTagger for Sequence Tagging on mallet's cmd prompt interface. I would now like to train over many files and run tests in batches. Is it also possible to do this on mallet's command prompt? I want to get some hint on the performance of the algorithm for the task at hand before I dive into using the JAVA API.
I have seen that Classification tasks can be run in batch from the command prompt.
is it possible to use SimpleTagger in batch? if no
Can someone point me to a reference code where Sequence Tagging has been done in batch using the java API.
Somewhere I found a reference to "http://mallet.cs.umass.edu/index.php/Command_line_tutorial", but the link seems to be broken.
After some exploration, I learned that it was not possible to readily use the cc.mallet.fst.SimpleTagger for batch evaluations. Instead, I found out that the cc.mallet.examples.TrainCRF is a handy code (that uses the SimpleTagger). The code takes a train and test datasets (in Mallet sequence tagging format, instances separated by single-line) as input arguments and that's it.
I used the mallet-2.0.8 installation available on the Mallet page.
Beware to NOT tune the models based on the performance on the test set. You should avoid that and perhaps not verify the performance on test set until you have tuned the model on the training set sufficiently.

Running Netlogo headless on the cloud

I've written a NetLogo model to model agent movement in a landscape. I'd like to run this model from the command prompt, using AWs/Google Compute. The model uses about 500MB worth of input rasters and shapefiles and writes rasters and csv files. It also uses the extensions gis, rnd, cf, table and csv.
Would this be possible using the Controlling API? (https://github.com/NetLogo/NetLogo/wiki/Controlling-API). Can I just use the steps listed in the link? I have not tried running NetLogo from the command prompt before.
Also, I do not want to run BehaviourSpace as it is not relevant to this model.
A BehaviorSpace experiment can consist of only a single run, so BehaviorSpace may actually be relevant to you here. You only need to write one short XML file (or no new files at all, if the experiment setup you want is already part of the model) to do it this way.
Whereas if you go through the controlling API, you will have to write and compile Java (or Scala) code, which is a substantially more complex task.
But if you decide to go the controlling API route: yes, that works too, and it is documented, as you've already noticed.

Run a single job in parallel

I need to know that how can we run a single job in parallel with different parameters in talend.
The answer is straightforward, but rather depends on what you want, and whether you are using free Talend or commercial.
As far as parameters go, make sure that your jobs are using context variables - this is the preferred way of passing parameters in.
As for running in parallel, there are a few options.
Talend's studio is a java code generator, so you can export your job (it's just java code) and run it wherever you want. How you invoke it is up to you - schedule it, invoke it N times manually, your call. Obviously, if your job touches shared resources then making it safe to run in parallel is up to you - the usual concurrency issues apply.
If you have the commercial product, then you can use the Talend admin centre (TAC). The TAC allows you to schedule a job more than once with different contexts. Or, if you want to keep the parallelization logic inside your job, then consider using the tParallelize component in one job to run another job N times.