How do you determine how many variables is too many for a CCA? - vegan

I am running a CCA of some ecological data with ~50 sites and several hundred species. I know that you have to be careful when your number of explanatory variables approaches your number of samples. I have 23 explanatory variables, so this isn't a problem for me, but I have also heard that using too many explanatory variables can start to "un-constrain" the CCA.
Are there any guidelines about how many explanatory variables is appropriate? So far, I have just plotted them all and then removed the ones that appear to be redundant (leaving me with 8). Can I use the intertia values to help inform/justify this?
Thanks

This is the same question as asking "how many variables are too many for regression analysis?". Not "almost the same", but exactly the same: CCA is an ordination of fitted values of linear regression. In most severe cases you can over-fit. In CCA this is evident when the first eigenvalues of CCA and (unconstrained) CA are almost identical and the ordinations look similar in first dimensions (you can use Procrustes analysis to check this). Extreme case would be that residual variation disappears, but in ordination you focus on first dimensions, and there the constraints can get lost much earlier than in later constrained axes or in residuals. More importantly: you must see CCA as a kind of regression analysis and have the same attitude to constraints as to explanatory (independent) variables in regression. If you have no prior hypothesis to study, you have all the problems of model selection of regression analysis plus the problems of multivariate ordination, but these are non-technical problems that should be handled somewhere else than in stackoverflow.

Related

When performing OLS regressions why might the strict exogeneity assumption be violated if necessary control variables aren't included?

I think I just about grasp the basic understanding but it's still confusing me. In my statistics course I'm investigating a hypothetical bone-wasting disease dataset among different population groups, which has a number of geographic controls and the like. I got the question wrong and said that they weren't needed, but my teacher didn't elaborate. Like why do we need control for different groups in our regression - what do we gain by doing that? Thanks for any help!
This one is straightforward application of the OLS assumptions.
To take almost straight from undergraduate lecture slides. In OLS you make an assumption about the distribution of of errors - since the errors are joint normal and uncorrelated, which implies that (εi)i∈N is an i.i.d. set. If you don't have control variables it is likely that you are ignoring independent variables that are affecting your dependent variables. Implying that the unobservables between different populations are independent, which appears clearly unrealistic.

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

Matlab: Fit a custom function to xy-data with given x-y errors

I have been looking for a Matlab function that can do a nonlinear total least square fit, basically fit a custom function to data which has errors in all dimensions. The easiest case being x and y data-points with different given standard deviations in x and y, for every single point. This is a very common scenario in all natural sciences and just because most people only know how to do a least square fit with errors in y does not mean it wouldn't be extremely useful. I know the problem is far more complicated than a simple y-error, this is probably why most (not even physicists like myself) learned how to properly do this with multidimensional errors.
I would expect that a software like matlab could do it but unless I'm bad at reading the otherwise mostly useful help pages I think even a 'full' Matlab license doesn't provide such fitting functionality. Other tools like Origin, Igor, Scipy use the freely available fortran package "ODRPACK95", for instance. There are few contributions about total least square or deming fits on the file exchange, but they're for linear fits only, which is of little use to me.
I'd be happy for any hint that can help me out
kind regards
First I should point out that I haven't practiced MATLAB much since I graduated last year (also as a Physicist). That being said, I remember using
lsqcurvefit()
in MATLAB to perform non-linear curve fits. Now, this may, or may not work depending on what you mean by custom function? I'm assuming you want to fit some known expression similar to one of these,
y = A*sin(x)+B
y = A*e^(B*x) + C
It is extremely difficult to perform a fit without knowning the form, e.g. as above. Ultimately, all mathematical functions can be approximated by polynomials for small enough intervals. This is something you might want to consider, as MATLAB does have lots of tools for doing polynomial regression.
In the end, I would acutally reccomend you to write your own fit-function. There are tons of examples for this online. The idea is to know the true solution's form as above, and guess on the parameters, A,B,C.... Create an error- (or cost-) function, which produces an quantitative error (deviation) between your data and the guessed solution. The problem is then reduced to minimizing the error, for which MATLAB has lots of built-in functionality.

Matlab's VARMAX regression parameters/coefficients nX & b

I'm having a bit of trouble following the explanation of the parameters for vgxset. Being new to the field of time-series is probably part of my problem.
The vgxset help page (http://www.mathworks.com/help/econ/vgxset.html) says that its for a generalized model structure, VARMAX, and I assume that I just use a portion of that for VARMA. I basically tried to figure out what parameters pertain to VARMA versus, as opposed to the additional parameters for VARMAX. I assumed (maybe wrongly) nX and b pertain to the exogenous variables. Unfortunatley, I haven't found much on the internet about the prevailing notational conventions for a VARMAX model, so it's hard to be sure.
The SAS page for VARMAX (http://support.sas.com/documentation/cdl/en/etsug/67525/HTML/default/viewer.htm#etsug_varmax_details02.htm) shows that if you have "r" exogenous inputs and k time series, and if you look back at "s" time steps' worth of exogenous inputs, then you need "s" matrices of coefficients, each (k)x(r) in size.
This doesn't seem to be consistent with the vgxset page, which simply provides an nX-vector "b" of regression parameters. So my assumption that nX and b pertain to the exogenous inputs seems wrong, yet I'm not sure what else they can refer to in a VARMAX model. Furthermore, in all 3 examples given, nX seems to be set to the 3rd argument "s" in VARMAX(p,q,s). Again, though, it's not entirely clear because in all the examples, p=s=2.
Would someone be so kind as to shed some light on VARMAX parameters "b" and "nX"?
On Saturday, May 16, 2015 at 6:09:20 AM UTC-4, Rick wrote:
Your assessment is generally correct, "nX" and "b" parameters do
indeed correspond to the exogenous input data "x(t)". The number of
columns (i.e., time series) in x(t) is "nX" and is what SAS calls
"r", and the coefficient vector "b" is its regression coefficient.
I think the distinction here, and perhaps your confusion, is that
SAS incorporates exogenous data x(t) as what's generally called a
"distributed lag structure" in which they specify an r-by-T
predictor time series and allow this entire series to be lagged
using lag operator polynomial notation as are the AR and MA
components of the model.
MATLAB's Econometrics Toolbox, adopts a more classical regression
component approach. Any exogenous data is included as a simple
regression component and is not associated with a lag operator
polynomial.
In this convention, if the user wants to include lags of x(t), then
they would simply create the appropriate lag of x(t) and include it
as additional series (i.e., additional columns of a larger
multi-variate exogenous/predictor matrix, say X(t)).
See the utility function LAGMATRIX.
Note that both conventions are perfectly correct. Personally, I feel
that regression component approach is slightly more flexible since
it does not require you to include "s" lags of all series in x(t).
Interesting. I'm still wrapping my brain around the use of regression to determine lag coefficients. It turns out the the multitude of online tutorial info & hard copy library texts that I've looked at haven't really given much explanatory transition between the theoretical projection of new values onto past values versus actual regression using sample data. Your description is making this more concrete. Thank you.
AFTERNOTE: In keeping with the best practice of which I've been advised, I am posting links to the fora that I posed this question in:
http://www.mathworks.com/matlabcentral/newsreader/view_thread/341064
Matlab's VARMAX regression parameters/coefficients nX & b
https://stats.stackexchange.com/questions/152578/matlabs-varmax-regression-parameters-coefficients-nx-b

Solving a non-polynomial equation numerically

I've got a problem with my equation that I try to solve numerically using both MATLAB and Symbolic Toolbox. I'm after several source pages of MATLAB help, picked up a few tricks and tried most of them, still without satisfying result.
My goal is to solve set of three non-polynomial equations with q1, q2 and q3 angles. Those variables represent joint angles in my industrial manipulator and what I'm trying to achieve is to solve inverse kinematics of this model. My set of equations looks like this: http://imgur.com/bU6XjNP
I'm solving it with
numeric::solve([z1,z2,z3], [q1=x1..x2,q2=x3..x4,q3=x5..x6], MultiSolutions)
Changing the xn constant according to my needs. Yet I still get some odd results, the q1 var is off by approximately 0.1 rad, q2 and q3 being off by ~0.01 rad. I don't have much experience with numeric solve, so I just need information, should it supposed to look like that?
And, if not, what valid option do you suggest I should take next? Maybe transforming this equation to polynomial, maybe using a different toolbox?
Or, if trying to do this in Matlab, how can you limit your solutions when using solve()? I'm thinking of an equivalent to Symbolic Toolbox's assume() and assumeAlso.
I would be grateful for your help.
The numerical solution of a system of nonlinear equations is generally taken as an iterative minimization process involving the minimization (i.e., finding the global minimum) of the norm of the difference of left and right hand sides of the equations. For example fsolve essentially uses Newton iterations. Those methods perform a "deterministic" optimization: they start from an initial guess and then move in the unknowns space essentially according to the opposite of the gradient until the solution is not found.
You then have two kinds of issues:
Local minima: the stopping rule of the iteration is related to the gradient of the functional. When the gradient becomes small, the iterations are stopped. But the gradient can become small in correspondence to local minima, besides the desired global one. When the initial guess is far from the actual solution, then you are stucked in a false solution.
Ill-conditioning: large variations of the unknowns can be reflected into large variations of the data. So, small numerical errors on data (for example, machine rounding) can lead to large variations of the unknowns.
Due to the above problems, the solution found by your numerical algorithm will be likely to differ (even relevantly) from the actual one.
I recommend that you make a consistency test by choosing a starting guess, for example when using fsolve, very close to the actual solution and verify that your final result is accurate. Then you will discover that, by making the initial guess more far away from the actual solution, your result will be likely to show some (even large) errors. Of course, the entity of the errors depend on the nature of the system of equations. In some lucky cases, those errors could keep also very small.