Parameter Variation: Get dataset for one specific iteration - anylogic

I run a parameter variation experiment in Anylogic and collect histogram data about the number of specific agents. This histogram returns the min, mean, and max-value (among others).
I am looking for a way to get the dataset for one specific iteration only (the iteration that is closest to the mean-value of the histogram data).
Is there a way to return data for one specific iteration?
Many thanks!

Yes, but not with HistogramData objects.
Use normal Dataset objects in your experiment. In the properties, you can switch to "use x value as iteration" and store any value from your model iterations in the y-value.
Now you have a nice table with data from your individual iterations.
cheers

Related

Parameter Variation in AnyLogic: Data for a specific variation

I am using parameter variation in AnyLogic (in a system dynamics model). I am interested in how one parameter changes with the various iterations. The parameter is binary: 0 when supply of water is greater than demand and 1 when supply is lower than demand. The parameters being varied are a given percentage of decrease in outdoor irrigation, a given percentage of decrease in indoor water-use, and a given percentage of households that have rainwater harvesting systems. Visually, I need a time plot where on the x-axis is time (10,950 days; i.e. 30 years) and the binary on the y-axis. This should essentially show which iteration pushes a 1 further into the future.
I have watched videos and seen how histograms and 2D data are used to visualize the results of the iterations, but this does not show which iteration produced which output specifically. Is there a way to first, visually show the output as I have described above and second, return the data for a specific iteration?
Many thanks!
Parameter variation experiments have After Iteration and After Simulation run actions that are executed after each iteration and simulation respectively. Here, it is possible to access the values inside the simulation object after it finished but before it is destroyed. There is also a getCurrentIteration() method which can be used to control the parameter variation experiment and retrieve the data.
For more detail please consult here and see "SIR Agent Based Calibration" example model in AnyLogic example models library (Help -> Example Models).

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Time Series Predictions in PostgreSQL

I am new to PostgreSQL and database systems, and I am currently trying to create a database to store observed values as well as all predictions made in the past for some time series.
I have already built a table (actually a view) for observed values, with rows looking basically like:
(time, object, value)
Now I want to store predictions, which means for each time, what has been predicted by some software for the following next N time steps, N being variable since the software has different prediction types.
I have thought about multiple solutions, which are the following:
Store each prediction as a row, using max(N)=240 columns i.e (time, object, value 1, value 2, ..., value 240).
Store each prediction as a row, with the prediction values as a binary JSON, i.e (time, object, JSONB prediction).
Store each prediction value as a row, with a column specifying the delay of the prediction in hours, i.e
(time, object, delay, value).
I don't know how each of these choices would affect performance when I will retrieve and compute summary values on the predictions. A typical thing I would like to do is to retrieve the performance of the prediction for some delay, i.e. how big is the prediction error when we predict x days ahead, and I need this query to be executed pretty fast, to display it in a dashboard.
Which choice do you think is the best? Or do you have any other idea?
Thanks a lot!
Without further information about the access patterns for the collected data i would strongly recommend to use jsonb.
Using one column per timestep will result in bloat of the system catalog and statistics.
If you need to filter on the values of the predictions, you don't want to maintain 240 indexes also.
If you don't need to use these values within a WHERE condition you may use json instead of jsonb.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.