When performing OLS regressions why might the strict exogeneity assumption be violated if necessary control variables aren't included? - linear-regression

I think I just about grasp the basic understanding but it's still confusing me. In my statistics course I'm investigating a hypothetical bone-wasting disease dataset among different population groups, which has a number of geographic controls and the like. I got the question wrong and said that they weren't needed, but my teacher didn't elaborate. Like why do we need control for different groups in our regression - what do we gain by doing that? Thanks for any help!

This one is straightforward application of the OLS assumptions.
To take almost straight from undergraduate lecture slides. In OLS you make an assumption about the distribution of of errors - since the errors are joint normal and uncorrelated, which implies that (εi)i∈N is an i.i.d. set. If you don't have control variables it is likely that you are ignoring independent variables that are affecting your dependent variables. Implying that the unobservables between different populations are independent, which appears clearly unrealistic.

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

How to improve the convergence performance of Dymola?

Recently I am working with fluid modeling with Modelica, but I come across a lot of divergence problems of nonlinear equations, like in the following screenshot.
So I am considering if it is possible to use the min/max/nominal attributes of variables to improve the model's convergence, especially when a user comes across the nonlinear solver failure. According to the answer of this question on StackOverflow, min/max attributes won't help convergence, and based on the Modelica Specification 4.8.6, nomial attributes are used to determine appropriate tolerances or epsilons, or may be used for scaling.
So my question is:
If I meet this kind of divergence problem caused by the nonlinearity of my model, how could I help the compiler to get convergence better and quicker?
Someone might suggest better start values of variables used as state variables, but when I am dealing with large models, I am not sure how to find the specific state variables of which I should modify the start values.
Chapter 2.6.13 "Online diagnostics for non-linear systems" in manual 1B and following in the manual should help. You can e.g. list states that dominates error: usually these states are a good hint where to start your improvements.
Adding to the answer by Imke Krueger.
If the models fail after 2917 s one possibility is that the solution was diverging before that, with e.g., enthalpy decreasing further and further until the model has left the valid regions.
Assuming it happened fairly slowly it is best to plot the states and other variables in that components. Additionally the states dominating the error as indicated in the answer by Imke Krueger and see if any of them seem to diverge.
If it happened more quickly:
Log events and check whether something important like a flow reversal just happened before that time.
Disable equidistant output, as it is possible that the model diverged between two output points.
An eigenvalue-based analysis of the Jacobin at time = 0 provides a ranking of state-variables from most significant to the least one. That could be a heuristic to examine the influence of start variables of most significant state-variables.
What could be also helpful is to conduct a similar analysis a little time before the problem occurs.
Also there is a possibility to compute dynamic parameter sensitivities of state variables (before the problem occurs) w.r.t. start values, see e.g. https://github.com/Mathemodica/DerXP for a suggested approach. This gives you a hint which start values significantly influences the values of state variables.

Why do we need stochasticity in deterministic simulations?

Assuming the world were deterministic, why would we still need to introduce stochasticity into our simulations?
In a nutshell, to simplify models.
Let’s go with your assumption, even though I don’t believe it. If the universe is completely deterministic, then in any given scenario you choose to model there is one and only one correct answer. Unless you include the complete state space of absolutely everything that determines that answer, your model is wrong. Wrong, wrong, wrong!!!
For instance, if you want to predict how long it will take to fly from New York to London, you need to know the vector sums of all forces acting on the aircraft, which means you need the complete state (down to the atomic level) of the aircraft itself, the passengers, the atmosphere, fluctuations in the magnetic field of the earth, cosmic rays that can trigger upper atmospheric events, etc, etc, ad nauseam. Exclusion of any aspect of the potential forces involved makes your answer wrong.
Clearly, there’s no way to measure it all, and even if there was, there’s no way to maintain so much state information in any computing device we can build. And so we simplify and acknowledge that there is some degree of uncertainty in our model’s predictions/solutions.
When you embrace the existence of uncertainty, it brings us directly to stochastic solutions. One view of probability is that it is a mathematical formalism for modeling uncertainty. Rather than try to model every physical aspect of an aircraft’s flight, we can characterize the likely outcomes based on what proportion of flights require less (or more) than any particular amount of time, i.e., describing the distribution of possible flight times.
Once you adopt distributional modeling, you can see how distributional behaviors propagate though other parts of a system—either analytically, if your system is sufficiently simple, or by generating realizations of the distributions and using replication and sampling via simulation.

How do you determine how many variables is too many for a CCA?

I am running a CCA of some ecological data with ~50 sites and several hundred species. I know that you have to be careful when your number of explanatory variables approaches your number of samples. I have 23 explanatory variables, so this isn't a problem for me, but I have also heard that using too many explanatory variables can start to "un-constrain" the CCA.
Are there any guidelines about how many explanatory variables is appropriate? So far, I have just plotted them all and then removed the ones that appear to be redundant (leaving me with 8). Can I use the intertia values to help inform/justify this?
Thanks
This is the same question as asking "how many variables are too many for regression analysis?". Not "almost the same", but exactly the same: CCA is an ordination of fitted values of linear regression. In most severe cases you can over-fit. In CCA this is evident when the first eigenvalues of CCA and (unconstrained) CA are almost identical and the ordinations look similar in first dimensions (you can use Procrustes analysis to check this). Extreme case would be that residual variation disappears, but in ordination you focus on first dimensions, and there the constraints can get lost much earlier than in later constrained axes or in residuals. More importantly: you must see CCA as a kind of regression analysis and have the same attitude to constraints as to explanatory (independent) variables in regression. If you have no prior hypothesis to study, you have all the problems of model selection of regression analysis plus the problems of multivariate ordination, but these are non-technical problems that should be handled somewhere else than in stackoverflow.

What's a genetic algorithm that would produce interesting/surprising results and not have a boring/obvious end point?

I find genetic algorithm simulations like this to be incredibly entrancing and I think it'd be fun to make my own. But the problem with most simulations like this is that they're usually just hill climbing to a predictable ideal result that could have been crafted with human guidance pretty easily. An interesting simulation would have countless different solutions that would be significantly different from each other and surprising to the human observing them.
So how would I go about trying to create something like that? Is it even reasonable to expect to achieve what I'm describing? Are there any "standard" simulations (in the sense that the game of life is sort of standardized) that I could draw inspiration from?
Depends on what you mean by interesting. That's a pretty subjective term. I once programmed a graph analyzer for fun. The program would first let you plot any f(x) of your choice and set the bounds. The second step was creating a tree holding the most common binary operators (+-*/) in a random generated function of x. The program would create a pool of such random functions, test how well they fit to the original curve in question, then crossbreed and mutate some of the functions in the pool.
The results were quite cool. A totally weird function would often be a pretty good approximation to the query function. Perhaps not the most useful program, but fun nonetheless.
Well, for starters that genetic algorithm is not doing hill-climbing, otherwise it would get stuck at the first local maxima/minima.
Also, how can you say it doesn't produce surprising results? Look at this vehicle here for example produced around generation 7 for one of the runs I tried. It's a very old model of a bicycle. How can you say that's not a surprising result when it took humans millennia to come up with the same model?
To get interesting emergent behavior (that is unpredictable yet useful) it is probably necessary to give the genetic algorithm an interesting task to learn and not just a simple optimisation problem.
For instance, the Car Builder that you referred to (although quite nice in itself) is just using a fixed road as the fitness function. This makes it easy for the genetic algorithm to find an optimal solution, however if the road would change slightly, that optimal solution may not work anymore because the fitness of a solution may have grown dependent on trivially small details in the landscape and not be robust to changes to it. In real, cars did not evolve on one fixed test road either but on many different roads and terrains. Using an ever changing road as the (dynamic) fitness function, generated by random factors but within certain realistic boundaries for slopes etc. would be a more realistic and useful fitness function.
I think EvoLisa is a GA that produces interesting results. In one sense, the output is predictable, as you are trying to match a known image. On the other hand, the details of the output are pretty cool.