How can i simulate a lognormal distribution without knowing mean and standard deviation?

How can i simulate a lognormal distribution without knowing mean and standard deviation? - matlab

Consider the Lucas endowment economy with inflation, we know that consumption growth and inflation are log-normally distributed,and that consumption growth and inflation are uncorrelated through time and with each other.
How can I compute Compute the one period nominal risk free rate (1 + it,t+1)?
I have to solve this problem through matlab and tried using
lognrnd()
g_t1 = lognrnd(mu_c, sg_c1)
g_t2 = lognrnd(mu_c, sg_c2)
pi_t1 = lognrnd(mu_pi, sg_pi)
pi_t2 = lognrnd(mu_pi, sg_pi)
but I don't know how to go on without any values. How can I then assign the distribution values to a vector or matrix?

You can’t generate simulated values from a distribution without providing a concrete parameterization for that distribution.
If you can’t use theory to determine the parameter values but you have access to observational data, you can estimate the parameter values. Alternatively, you can use subject matter opinions for your problem context, or WAGs (Wild-Assed Guesses). In all of these cases, be aware that the true parameter values almost certainly differ from the values you are using. Consequently, I recommend using design of experiments over plausible ranges of the values, and fitting a response surface model to determine how sensitive your simulation’s results are to variation in the input distribution’s parameters.

Related

how to compare two hyper parameters in a hierarchical model?

In one hierarchical model, we have two hyer parameters: dnorm(A_mu, 0.25^-2) and dnorm (B_mu, 0.25^-2). In this case, 0.25 is the sd, I use the fixed number. A_mu and B_mu represent the mean of group level. After fitting the data by rjags, we get the distributions for each parameter. So I just directly compare the highest posterior density interval (HDI) of A_mu and B_mu? Do I need to calculate something using the sd(0.25)?
In another case, if the sd of two hyper parameters is not fixed, like that: dnorm(A_mu, A_sd) and dnorm (B_mu, B_sd). How can I compare the two hyper parameters and make a decision, e.g. this group is significantly different another group?

Remember that you are getting posterior distributions for A_mu and B_mu. This makes your comparison easy as you can have a look at 95% confidence intervals (CI) for the parameters (or pick a confidence value that satisfies your needs). I believe JAGS uses Gibbs sampling and so you should be able to get the raw samples from the posteriors for A_mu and B_mu. You can then ask "what is the probability that B_mu is greater than some value?" by calculating the percentage of posterior samples that are greater than that value. Alternatively, and in a similar way to frequentist Hypothesis testing, you can ask what is the probability that the mean of B_mu is a draw from the posterior of A_mu. So the key is just to directly use the samples from your posterior. I would recommend taking a look at Andrew Gelman's BDA3 textbook (Chapter 4) for a really good reference on these concepts.
A few things to keep in mind before drawing conclusions from the data: (1) you should always check the validity of your Markov Chains by evaluating things like autocorrelation (2) try to do a posterior predictive check to make sure your model is well fit to the data. If your model is poorly fit to the data then you can get very misleading results from the procedure above.

Appropriate method for clustering ordinal variables

I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!

Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3

Pearson correlation fails for perfectly correlated sets

Consider the following examples of the Pearson correlation coefficient on sets of film ratings by users A and B:
A = [2,4,4,4,4]
B = [5,4,4,4,4]
pearson(A,B) = -1
A = [5,5,5,5,5]
B = [5,5,5,5,5]
pearson(A,B) = NaN
Pearson correlation seems widely used for calculating the similarity between two sets in collaborative filtering. However the sets above show high (even perfect) similarity, yet the outputs suggest the sets are negatively correlated (or an error is encountered due to div by zero).
I initially thought it was an issue in my implementation, but I've since validated it against a few online calculators.
If the outputs are correct, why is Pearson correlation considered a good choice for this application?

Person correlation measures association between two data sets i.e. how do they increase or decrease together.
In visual terms,how close do they lie on a straight line if one set is plotted on x-axis, and other on y-axis.
Example of positive correlation, irrespective of difference in scale of data sets:
For your case, the data sets are exactly similar, and hence their standard deviation is zero, which is a part of the product used in the denominator in pearson correlation calculation, hence it is undefined.
It means, it is not possible to predict the correlation i.e. how does the data increase or decrease along with other data.
In graph below, all data points lie on one point, hence predicting
the correlation pattern is not possible.
A very simple solution to this would be handle these cases seperately,
or if you want to go through the same flow, a neat hack would be to
make sure that standard deviation of any set is not zero.
Non zero standard deviation can be achieved by altering a single value of the set, with a minor amount, and since the data sets are highly correlated, it would give you the high correlation coefficient.
I would recommend that you study other measures of similarity like Euclidean distance, cosine similarity, adjusted cosine similarity too, and take informed decision on which suits your use cases more. It may be a hybrid approach too.
This tool was used to generate the graphs.

Pearson correlation divides by the standard deviation of the variables, which in your case is zero, therefore causing a divide by zero error. It is considered good because no real data set has a standard deviation of zero. In other words, complete uniform data sets are out of domain for the Pearson correlation coefficient, but that's no reason not to use it.

MATLAB computing Bayesian Information Criterion with the fit.m results

I'm trying to compute the Bayesian with results from fit.m
According to the Wikipedia, log-likelihood can be approximated (when noise is ~N(0,sigma^2)) as:
L = -(n/2)*log(2*pi*sigma^2) - (rss(2*sigma^2))
with n as the number of samples, k as the number of free parameters, and rss as residual sum of squares. And BIC is defined as:
-2*L + k*log(n)
But this is a bit different from the fitglm.m result even for simple polynomial models and the discrepancy seems to increase when higher order terms are used.
Because I want to fit Gaussian models and compute BICs of them, I cannot just use fitglm.m Or, is there any other way to write Gaussian model with the Wilkinson notation? I'm not familiar with the notation, so I don't know if it's possible.

I'm not 100% sure this is your issue, but I think your definition of BIC may be misunderstood.
The Bayesian Information Criterion (BIC) is an approximation to the log of the evidence, and is defined as:
where
is the data,
is the number of adaptive parameters of your model,
is the data size, and most importantly,
is the maximimum a posteriori estimate for your model / parameter set.
Compare for instance with the much simpler Akaike Information Criterion (AIK):
which relies on the usually simpler to obtain maximum likelihood estimate
of the model instead.
Your
is simply a parameter, which is subject to estimation. If the
you're using here is derived from the sample variance, for instance, then that simply corresponds to the
estimate, and not the
one.
So, your discrepancy may simply derive from the builtin function using the 'correct' estimate and you using the wrong one in your 'by-hand' calculations of the BIC.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?

You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012

The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.

Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.