Probability Distributions in Anylogic - anylogic

I've a question, in my simulation model I wanted to create some randomness of quantities of agents arrival. In my model the chance is 25% that the quantity is 1, 50% that quantity is 2 and 25% that the quantity is 3. I want to generate these quantities by using the split function of anylogic and than in the split function I want to add a distribution, which gives the values based on this distribution.
But is there a distribution in Anylogic, which fits to this kind of estimations or should I really do it manually. I was already thinking about using a rounded value of a normal distribution, but I'm not sure if this is correct.
Thanks.

You can use a custom distribution to create any shape of distribution you want (it is under the Agent palette).
In your case, you would need to pick one of type discrete as shown below.
Then, in the split block you would specify the number of copies as
customDistribution()

Related

Appropriate method for clustering ordinal variables

I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!
Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3

How to use a different distribution for rate transitions in a statechart?

I'd like to use a different distribution for defining the transition probabilities in the statechart of an agent. I know that AnyLogic uses the exponential distribution as the default distribution for rate triggered transitions, but on occasion other distributions are more fitting, such as in the case of a disease statechart where often the weibull or gompertz distribution is used.
How would I go about implementing this? Obviously, just directly inputting a weibull distribution as the rate does not work. Would I need to define a function?
If you use a rate triggered transition, you are forced to use the exponential distribution. Something you can do instead is using a dynamic event and trigger with message. The following example shows how to create the dynamic event.
Then the action in the dynamic event is
statechart.fireEvent("message");
You can do the same with any distribution you want. But you don't have all the distributions available, for the gompertz distribution, you have to create your own function to generate random samples.

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

outlier detection based on gaussian mixture model

I have a set of data. I want to build a one class distribution from that data. Based on the learned distribution I want to get a probability value for each of the data instance.
Based on this probability values (thresholding) I want to build a classifier to classify a particular data instance is comming from that distribution or not.
In this case, lets say I have a data of 50x100000 where 50 is the dimension of each data instance, the number of instances are 100000. I am leaning a Gaussian mixture model based on this distribution.
When I try to get the probability values for instances I am getting very low values. So in this case how can I build a clssifier?
I don't think this makes sense. For example, suppose your data is 1 dimensional, and suppose the truth is that it has been sampled from a bimodal distribution. But suppose you haven't worked out that it's from a bimodal distribution and you fit a normal distribution. You'd still have the best possible fit, but it would be the best possible fit to the wrong distribution, and the truth is that none of the points come from that distribution or from any distribution that looks like it.

Select data based on a distribution in matlab

I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.