How many iterations are saved by JAGS/BUGS when burnin and thinning are specified? - winbugs

I have a quick question about the details of running a model in JAGS and BUGS.
Say I run a model with n.burnin=5000, n.iter=5000 and thin=2. Does this mean that the program will:
Run 5,000 iterations, and discard results; and then
Run another 10,000 iterations, only keeping every second result?
If I save these simulations as a CODA object, are all 10,000 saved, or only the thinned 5,000? I'm just trying to understand which set of iterations are used to make the ACF plot?

With JAGS, n.burnin=5000, n.iter=5000 and thin=2, means you keep nothing. You run 5000, discard the first 5000 of these 5000 and then only keep a half of the remaining values of the chain (keep 1 value and discard the next one ..).
Use for example n.burnin=2000, n.iter=7000, thin=50, n.chains=5 : so you have (7000-2000)/50 * 5 = 500 values.

Could you be more specific which software you're talking about? It looks like you're referring to the arguments of the function bugs() in the R2WinBUGS package (except that the argument is called n.thin not thin). Looking at help(bugs) it just says n.burnin is the "number of iterations to discard at the beginning". Which doesn't specifically answer your question, but looking at the source for bugs.script() in that package suggests to me that it would run 5000 iterations burn in, as you suspected. You could send a suggestion to the maintainers of that package to clarify their documentation.
In your example, bugs() would then run 0 further iterations after the burn-in. Here the documentation is clearer - n.iter is the total number of iterations including the burn-in.
For your second question, the CODA output from WinBUGS (and any software which calls WinBUGS or OpenBUGS) will only include the thinned sample.

Related

How to use "Easy edge trace" and "edge trace distances" in ImageJ?

I have already installed the both plugins but don't know how to use them for pod analysis. Need help in that as i don't have programming background. Also can we use it for batch processing of images, in case i have more than 100 images?
Another approach per specially coded ImageJ-macro gives reasonable estimates of the widths and lengths of all pods in the sample image. You can access the macro code from here. Unzip the zip-archive and drop the file "plantPodDimensions.ijm" onto the ImageJ main window. Then open the sample image and run the macro. The estimated pod dimensions appear in a table.
Specimen [right to left] Mean Pod Width [cm] Pod Length [cm]
OHiI7_pod-1 0.70±0.11 23.6
OHiI7_pod-2 0.59±0.09 22.3
OHiI7_pod-3 0.64±0.05 20.7
OHiI7_pod-4 0.41±0.04 20.5
OHiI7_pod-5 0.66±0.07 22.9
OHiI7_pod-6 0.68±0.10 24.4
OHiI7_pod-7 0.60±0.07 20.5
Of course it couldn't be tested, if the macro works as expected for other images than the sample image.
With the download of the plugins come documents that explain the use of the plugins. Batch processing is possible, if the starting points of the traces are known or can somehow be determined by additional pre-processing steps (not trivial). Both plugins are macro-recordable. In any case, batch processing will require some macro code.
For the use case in question I would recommend to perform the analyses via the GUI, not per batch processing. The coding of a suitable macro would take more time than the processing of 100+ images.

Setting up openai gym

I've been given a task to set up an openai toy gym which can only be solved by an agent with memory. I've been given an example with two doors, and at time t = 0 I'm shown either 1 or -1. At t = 1 I can move to correct door and open it.
Does anyone know how I would go about starting out? I want to show that a2c or ppo can solve this using an lstm policy. How do I go about setting up environment, etc?
To create a new environment in gym format, it should have the 5 functions mentioned in the gym.core file.
https://github.com/openai/gym/blob/e689f93a425d97489e590bba0a7d4518de0dcc03/gym/core.py#L11-L35
To lay this down in steps-
Define observation space and action space for your environment, preferably using gym.spaces module.
Write down the step function which performs agent's action and returns a 4 tuple containing - next set of observations from the environment , reward ,
done - a boolean indicating whether the episode is over , and some extra info if you want.
Write a reset function for the environment to reinitialise the episode to a random start state and return a 4 tuple similar to step.
These functions are enough to be able to run an RL agent on your environment.
You can skip the render, seed and close functions if you want.
For the task you have defined,you can model the observation and action space using Discrete(2). 0 for first door and 1 for second door.
Reset would return in it's observation which door has the reward.
Then agent would choose either of the door - 0 or 1.
Then perform a environment step by calling step(action), which will return agent's reward and done flag as true - signifying that the episode is over.
Frankly, the problem you describe seems too simple to accomplish for any reinforcement learning algorithm, but I assume you have provided that as an example.
Remembering for longer horizons is usually harder.
You can read their documentation and toy environments to understand how to create one.

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

SB37 JCL error to small a size?

The following space allocation is giving me an sB37 JCL error. The cobol size of the output file is 100 bytes and the lrecl size is 100 bytes. What do you think is causing this error? I have tried increase the size to 500,100 and still get the same error.
Code:
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DCB=(LRECL=100,BLKSIZE=,RECFM=FBM),
// SPACE=(CYL,(10,5),RLSE)
Try to increase not only the space, but the volume as well.
Include VOL=(,,,#) in your DD. # is the numbers of values you want to allocate
Ex: SPACE=(CYL,(10,5),RLSE),VOL=(,,,3) - includes 3 volumes.
Additionally, you can increase the size, but try to stay within reasonable limits :)
The documentation for B37 says the application programmer should respond as indicated for message IEC030I. The documentation for IEC030I says, in part...
Probable user error. For all cases, allocate as many units as volumes
required.
...as noted in another answer. However, be advised that the documentation for the VOL parameter of the DD statement says...
If you omit the volume count or if you specify 1 through 5, the system
allows up to five volumes; if you specify 6 through 20, the system
allows 20 volumes; if you specify a count greater than 20, the system
allows 5 plus a multiple of 15 volumes. You can override the maximum
volume count in data class by using the volume-count subparameter. The
maximum volume count for an SMS-managed mountable tape data set or a
Non-managed tape data set is 255.
...so for DASD allocations you are best served specifying a volume count greater than 5 (at least).
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DCB=(LRECL=100,BLKSIZE=,RECFM=FBM),
// SPACE=(CYL,(10,5),RLSE)
Try this instead. Notice that the secondary will take advantage of a large dataset whereas without that parameter the largest secondary that makes any sense is < 300. Oh, and if indeed it is from a COBOL program make sure that the FD says "BLOCK 0"!!!!! If it isn't "BLOCK 0" then you might not even need to change your JCL because it wasn't fixed block machine. It was merely fixed and unblocked so the space would almost never be enough. And finally you may wish to revisit why you have the M in the RECFM to begin with. Notice also that I took out the LRECL, the BLKSIZE and the RECFM. That is because the FD in the COBOL program is all you need and putting it in the JCL is not only redundant but dangerous because any change will have to be now done in multiple places.
//OUTPUT1 DD DSN=A.B.C,DISP=(NEW,CATLG,DELETE),
// DSNTYPE=LARGE,UNIT=(SYSALLDA,59),
// SPACE=(CYL,(10,1000),RLSE)
There is a limit of 65,535 tracks per one volume. So if you will specify a SPACE that exceeds that limit - system will simply ignore it.
You can increase this limit to 16,777,215 tracks by adding DSNTYPE=LARGE paramter.
Or you can specify that your dataset is a multi volume by adding VOL=(,,,3)
You can also use DATACLAS=xxxx paramter here, however first of all you need to find it. Easy way is to contact your local Storage Team and ask for one. Or If you are familiar with ISPF navigation, you can enter ISMF;4 command to open a panel
use bellow paramters before hitting enter.
CDS Name . . . . . . 'ACTIVE'
Data Class Name . . *
It should produce a list of all available data classes. Find the one that suits you ( has enougth amount of volume count, does not limit primary and secondary space

Long data load time in Matlab

I have four variables, each saved in 365 mat-files (size: 8 x 92 x 240). I try to load these into my function in a for-loop day=1:365, one variable file per day. However, the two first variables always take abnormally long time to load. My code for loading looks like this:
load([eraFolder sprintf('Y%dD%d-tempSD.mat',year,day)], 'tempSD'); % took 5420 s to load
load([eraFolder sprintf('Y%dD%d-tempDewSD.mat',year,day)], 'tempDewSD')
load([eraFolder sprintf('Y%dD%d-eEraSD.mat',year,day)], 'eEraSD'); % took 6 seconds to load
load([eraFolder sprintf('Y%dD%d-pEraSD.mat',year,day)], 'pEraSD');
Using Profiler, I could see that the first two variables took 5420 seconds to load in 365 calls, whereas the the last two variables took 6 and 4 seconds to load respectively over 365 calls. When I swap the order in which variables are loaded, e.g. eEraSD before tempSD, it is still the first two loads that take more time.
When using tic toc to track the time spent on loading, it appears that the time to load a the first or second variable exponentially increases with the number of calls (with the last calls taking 50 seconds to run) . For the third and fourth variable, the loading time stays around 0.02-0.04 seconds per file, more or less independent on how far in the for loop I have gone. See figures below.
When using importdata instead of 'load', the first line takes about 8000 seconds to load 365 times (with the loading exponentially increasing as shown for T in the second figure). The other lines then take about 10 seconds to load 365 times.
I can't understand why it looks this way and what I can do to decrease the loading time. Would greatly appreciate any idea of a possible solution for this.
I suppose your data sets are in the same directory(over network or local) and with same attributes e.g. access properties and so on.
Then the only option left is with the charateristics of the vairbales stored in those matfiles. Can you check how much those variables appear in size e.g. by loading a sample one. This will narrow down to solve your problem.
Hope that help.
FS
Thanks for your help. I finally found out what caused the problem. In a 'for' loop later in the script, I saved other data to a folder I called temp. After renaming that folder to something else (e.g. temporary), the data loading problem disappeared.
(Doesn't matter so much now that the practical problem is solved, but I can't really say I understand why there was this peculiar relationship between the later 'save' call and this 'importdata' or 'load' call.)
Please see new question about the temp folder