I'm running many linear regressions and probit models with a massive number of covariates. That means, every time Stata finished to compute and print the results, produce a huge list of coefficients. And, each time I have to move until the beginning of such a list, where the main coefficients are printed.
I would like to know if there is a way to avoid that. I was looking for an option to print only a certain number of lines. My sencond try was running the regression using -quietly- option and then trying to print a given number of lines. But I'm not really familiar with Stata. I usually work in R, but I have to use Stata this time, that's why I'm struggling with this commercial software.
For linear regressions the -areg- function offers a partial solution for my issue, but that function only allows me to "absorb" a single factor variable. But I need to absorb more variables and also run probit models. Hence, -areg- don't work to me.
Anyone has a trick to solve this? Only print a selection of covariates in Stata?
UPDATE:
A minimal example: I have the following linear regression with many places and time units as FEs.
regress depVar Var1 Var2-Var15 i.place i.time [pw = myweigth], cluster(ID)
I'm interested on see only the coefficients of Var*. But every time I run the regression I got thousands of coefficients for the FEs.
I posted the same question on reddit, and I got the following comments:
https://www.reddit.com/r/stata/comments/fwtds4/cutting_down_stata_results/
What is pretty much what I was looking for. Basically, is solved via estout package, and its -estout- and -esttab- functions:
estout myRegression: quietly ///
regress depVar Var1 Var2-Var15 i.place i.time [pw = myweigth], cluster(ID)
esttab myRegression, drop(place time)
Maybe someone can enrich this approach. Thanks!
I met an error during initialization when using ThermoSysPro library.
It seems like the Turbine5.Pe is larger than Turbine2.Pe, so the result is negative. but I checked my parameters, there shouldn't be such a problem.
Is this because the nonlinear solver couldn't solve the equation in the following picture?
There is not enough information and I would recommend to set Details and/or Nonlinear iterations in Simulation setup>Debug>Nonlinear solver diagnostics to get more information.
The full expression causing the problem is sqrt((Turbine2.Pe^2-Turbine5.Pe^2)/(Turbine2.Cst*Turbine2.proe.T))
Since the two Pe-values have fixed=true it seems unlikely that they are wrong, but it is impossible to see without the complete model.
However, it is also possible that either Cst or proe.T is negative, or computed to a negative value based on other values.
Without a complete model that is impossible to tell.
According to the comparison between ThermoSysPro(Open source library from EDF https://github.com/alex19941215/ThermoSysPro ) and ThermalPower(Commercial library from Modelon https://www.modelon.com/library/thermal-power-library ), there should be some inspiration for people faced with the same situation.
Here is the code form ThermoSysPro library:
Connectors.FluidInlet Ce
Connectors.FluidOutlet Cs
Here is a type code from Thermal Power library:
Interfaces.FlowPort feed(
h_outflow(start=hstartin))
Interfaces.FlowPort drain(
p(start=pstart),
h_outflow(start=hstartout))
From the code, we can see that in the Thermal Power library each connector's attribute is assigned values according to the parameters, but in the ThermoSysPro library, the connector is using default values, probably zero. So that's why the Thermal Power library has better performance in the term of initialization convergence
I am trying to find time series outlier using Tableau forecast. I need to compare the actual value with the 95% confidence level in forecast results to determine if it is an outlier.
I understand I can view the forecast results on the chart. But I want to use the forecast results in calculated measure. Is there any way to do it? I cannot find any Tableau functions to retrieve the forecast results.
Xuefei. Doesn't look like there is a way currently, at least going by their help page - https://help.tableau.com/v2019.1/pro/desktop/en-us/forecast_options.htm. If you haven't already considered this - integration with R is easy and that way you could just model it in R (accounting for additive/multiplicative, trend/cyclicity/seasonality) and access the forecast values from R. Integration with Python is also supposed to be easy, although I haven't tried it myself.
Example of code in Tableau to incorporate R code for linear regression (this is the formula for the calc field in Tableau)
SCRIPT_REAL("
fv=log(.arg1)
fpri=.arg2
fit=lm(fv~fpri)
exp(fit$fitted)",SUM([Impressions]),SUM([CPM]))
Looking for advice on how to determine wether my model output data distribution is similar (and if so, then how similar) to the observed datasets distribution.
Basically I have a GBM model with mean reversion that provides seemingly good results, when I compare its distribution to observed data. You can see their PDFs side-by-side in the attached picture.
PDF of observed and model data
Both datasets are huge (~6 million datapoint), and I start to suspect that this is part of the problem...
I am looking for a way to verify that the datasets distributions are similar. I tried the two-sample Kolmogorov-Smirnov test, two-sample t-test, but for some reason both of them rejected the Null hypothesis (always, even with different Alphas). In some threads I've read that these tests are unreliable, when applied to huge datasets, but there wasn't a consensus about this.
I am using Matlab currently, but I am open to others if necessary.
Any help would be appreciated! I primarily looking for a hypothesis test for verification, but if you have a different idea don't hold it back!
Background
I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section):
The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.)
Tool Set
The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports.
Poor Model Choice
Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows:
SELECT
regr_slope( amount, year_taken ),
regr_intercept( amount, year_taken ),
corr( amount, year_taken )
FROM
temp_regression
INTO STRICT slope, intercept, correlation;
The results are returned to JasperReports using:
SELECT
year_taken,
amount,
year_taken * slope + intercept,
slope,
intercept,
correlation,
total_measurements
INTO result;
JasperReports calls into PostgreSQL using the following parameterized analysis function:
SELECT
year_taken,
amount,
measurements,
regression_line,
slope,
intercept,
correlation,
total_measurements,
execute_time
FROM
climate.analysis(
$P{CityId},
$P{Elevation1},
$P{Elevation2},
$P{Radius},
$P{CategoryId},
$P{Year1},
$P{Year2}
)
ORDER BY year_taken
This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate.
Questions
Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope:
What is a better regression model to apply?
What CPAN-R packages provide such models? (Installable, ideally, using apt-get.)
How can the R functions be called within a PostgreSQL function?
If no such functions exist:
What parameters should I try to obtain for functions that will produce the desired fit?
How would you recommend showing the best fit curve?
Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.)
Thank you!
The awesome pl/r package allows you to run R inside PostgreSQL as a procedural language. There are some gotchas because R likes to think about data in terms of vectors which is not what a RDBMS does. It is still a very useful package as it gives you R inside of PostgreSQL saving you some of the roundtrips of your architecture.
And pl/r is apt-get-able for you as it has been part of Debian / Ubuntu for a while. Start with apt-cache show postgresql-8.4-plr (that is on testing, other versions/flavours have it too).
As for the appropriate modeling: that is a whole different ballgame. loess is a fair suggestion for something non-parametric, and you probably also want some sort of dynamic model, either ARMA/ARIMA or lagged regression. The choice of modeling is pretty critical given how politicized the topic is.
I don't think autoregression is what you want. Non-linear isn't what you want either because the implies discontinuous data. You have continuous data, it just may not be a straight line. If you're just visualizing, and especially if you don't know what the shape is supposed to be then loess is what you want.
It's easy to also get a confidence interval band around the line if you just plot the data with ggplot2.
qplot(x, y, data = df, geom = 'point') + stat_smooth()
That will make a nice plot.
If you want to a simpler graph in straight R.
plot(x, y)
lines(loess.smooth(x,y))
May I propose a different solution? Just use PostgreSQL to pull the data, feed it into some R script and finally show the results. The R script may be as complicated as you want as long as the user doesn't have to deal with it.
You may want to have a look at rapache, an Apache module that allows running R scripts in a webpage.
A couple of videos illustrating its use:
Hello world application
Jeffrey Horner's presentation of RApache + links to working apps
In particular check how the San Francisco Estuary Institue Web Query Tool allows the user to interact with the parameters.
As for the regression, I'm not an expert, so I may be saying something extremely stupid... but wouldn't something like a LOESS regression be OK for this?