Non-linear regression models in PostgreSQL using R - postgresql

Background
I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section):
The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.)
Tool Set
The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports.
Poor Model Choice
Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows:
SELECT
regr_slope( amount, year_taken ),
regr_intercept( amount, year_taken ),
corr( amount, year_taken )
FROM
temp_regression
INTO STRICT slope, intercept, correlation;
The results are returned to JasperReports using:
SELECT
year_taken,
amount,
year_taken * slope + intercept,
slope,
intercept,
correlation,
total_measurements
INTO result;
JasperReports calls into PostgreSQL using the following parameterized analysis function:
SELECT
year_taken,
amount,
measurements,
regression_line,
slope,
intercept,
correlation,
total_measurements,
execute_time
FROM
climate.analysis(
$P{CityId},
$P{Elevation1},
$P{Elevation2},
$P{Radius},
$P{CategoryId},
$P{Year1},
$P{Year2}
)
ORDER BY year_taken
This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate.
Questions
Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope:
What is a better regression model to apply?
What CPAN-R packages provide such models? (Installable, ideally, using apt-get.)
How can the R functions be called within a PostgreSQL function?
If no such functions exist:
What parameters should I try to obtain for functions that will produce the desired fit?
How would you recommend showing the best fit curve?
Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.)
Thank you!

The awesome pl/r package allows you to run R inside PostgreSQL as a procedural language. There are some gotchas because R likes to think about data in terms of vectors which is not what a RDBMS does. It is still a very useful package as it gives you R inside of PostgreSQL saving you some of the roundtrips of your architecture.
And pl/r is apt-get-able for you as it has been part of Debian / Ubuntu for a while. Start with apt-cache show postgresql-8.4-plr (that is on testing, other versions/flavours have it too).
As for the appropriate modeling: that is a whole different ballgame. loess is a fair suggestion for something non-parametric, and you probably also want some sort of dynamic model, either ARMA/ARIMA or lagged regression. The choice of modeling is pretty critical given how politicized the topic is.

I don't think autoregression is what you want. Non-linear isn't what you want either because the implies discontinuous data. You have continuous data, it just may not be a straight line. If you're just visualizing, and especially if you don't know what the shape is supposed to be then loess is what you want.
It's easy to also get a confidence interval band around the line if you just plot the data with ggplot2.
qplot(x, y, data = df, geom = 'point') + stat_smooth()
That will make a nice plot.
If you want to a simpler graph in straight R.
plot(x, y)
lines(loess.smooth(x,y))

May I propose a different solution? Just use PostgreSQL to pull the data, feed it into some R script and finally show the results. The R script may be as complicated as you want as long as the user doesn't have to deal with it.
You may want to have a look at rapache, an Apache module that allows running R scripts in a webpage.
A couple of videos illustrating its use:
Hello world application
Jeffrey Horner's presentation of RApache + links to working apps
In particular check how the San Francisco Estuary Institue Web Query Tool allows the user to interact with the parameters.
As for the regression, I'm not an expert, so I may be saying something extremely stupid... but wouldn't something like a LOESS regression be OK for this?

Related

Is it possible to access forecast results in calculated measure?

I am trying to find time series outlier using Tableau forecast. I need to compare the actual value with the 95% confidence level in forecast results to determine if it is an outlier.
I understand I can view the forecast results on the chart. But I want to use the forecast results in calculated measure. Is there any way to do it? I cannot find any Tableau functions to retrieve the forecast results.
Xuefei. Doesn't look like there is a way currently, at least going by their help page - https://help.tableau.com/v2019.1/pro/desktop/en-us/forecast_options.htm. If you haven't already considered this - integration with R is easy and that way you could just model it in R (accounting for additive/multiplicative, trend/cyclicity/seasonality) and access the forecast values from R. Integration with Python is also supposed to be easy, although I haven't tried it myself.
Example of code in Tableau to incorporate R code for linear regression (this is the formula for the calc field in Tableau)
SCRIPT_REAL("
fv=log(.arg1)
fpri=.arg2
fit=lm(fv~fpri)
exp(fit$fitted)",SUM([Impressions]),SUM([CPM]))

How to run parameters optimization for my model using the GPU

Essentially, I have a model with given a set of parameters is able to calculate different thermodynamic properties for different compounds, let's say liquid densities and vapor pressures for example.
When I want to regress the model parameters (e.g. a,b,c,d,e) by fitting data for different compounds, I usually do a lot of sequential operations that I am sure I can easily improve their efficiency. I am thinking about multiobjective optimization or even better-using GPU or multicores of the CPU. But I am a bit lost on where to start from reading just the documentation.
So within my objetive function I usually have something like:
[fval]= objective_function(a,b,c,d,e)
input_comp1=f(constants,a,b,c,d,e)
input_comp2=f(constants,a,b,c,d,e)
exp_density1=load some text file or so.
exp_density2=load some text file or so.
exp_vaporpressure1=load some text file or so.
exp_vaporpressure2=load some text file or so.
densities_1=density(a,b,c,d,e,input_comp1)
vapor_pressures1=vapor_pressures(a,b,c,d,e,input_comp1)
densities_2=density(a,b,c,d,e,input_comp2)
vapor_pressures=vapor_pressures (a,b,c,d,e,input_comp2)
ARD_d1=expression for deviations between experimental and calculated values for density of comp.1
ARD_d2=...
ARD_p1=...
ARD_p2=...
fval= ARD_d1+ARD_d2+ARD_p1+ARD_p2
Which is then evaluated by something like fminsearch but I have also used others in the past, fminsearch worked the best for me. When I do this for just one component It works fast enough for my purpose (but I am a patient man). But now I've extended the model in a way that I need to regress parameters simultaneous from more than one component and it becomes impossible.
I am quite sure this way of doing the calculations can be improved because I can run the calculations for the different compounds simultaneous instead of doing them sequentially, and then evaluate fval when the calculations for all components are done. But how?

Recommendation system design

I am currently working on a research in which I try to predict people's IQ.
This is how the research goes, on day 1 participants take IQ test. At regular intervals of 2 weeks they continue to take the test (with different questions maybe) for 6 months.
Given this information (or dataset) how does one go about designing a recommendation system.
I imagine it something like this
IQvalue --input--> [ Recommendation Engine ] --spits out--> probable IQ value (after 6 months)
My actual research is not on IQ at all. I just made this example up.
Kindly suggest if I am going in the right direction at all? Are there any algorithms that do something similar?
Appreciate any help.
For case 1, you only have the time-related IQ values, I suggest you consider the time series analysis methods. Your target is to predict how the IQ change with time. My suggestion for this solution is the statsmodels library. Its github address is as follows:
https://github.com/statsmodels/statsmodels .
This tool is written in python and easy to use. It contains many generally used tsa models, such as ARIMA.
For case 2, if you also have the features of people, for instance, the answers in the QA test, ages, gender, education, etc., I suggest you consider use a machine learning methods to predict the IQs. You may consider the random forest or gradient boost to solve this problem. I suggest you use the tools such as Scikit-learn or xgboost.
For case 3, you can model it as a recommender system problem. Suppose user-test people, item-IQ, rating-IQ value, you can construct a user-item matrix. After that, you can use RS methods, such as matrix factorization or memory-based methods to predict the IQ values.
In my opinion, the first two means may be better for your case.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Matrix-Algebra Design Decomposition

I am looking at refactoring some very complex code which is a subsystem of a project I have at work. Part of my examination of this code is that it is incredibly complex, and contains a lot of inputs, intermediate values and outputs depending on some core business logic.
I want to redesign this code to be easier to maintain as well as executing a hell of a lot faster, so to start off with I have been trying to look at each of the parameters and their dependencies on each other. This has lead to quite a large and tangled graph and I would like a mechanism for simplifying this graph.
A while back I came across a technique in a book about SOA design called "Matrix Design Decomposition" which uses a matrix of outputs and the dependencies they have on the inputs, applies some form of matrix algebra and can generate Business Process diagrams for those dependencies.
I know there is a web tool available at http://www.designdecomposition.com/ however it is limited in the number of input/output dependencies you can have. I have tried looking around for the algorithmic source for this tool (so I could attempt to implement it myself without the size limitation), however I have had no luck.
Does anybody know a similar technique that I could use? Currently I am even considering taking the dependency matrix and applying some Genetic Algorithms to see if evolution can come up with a simpler workflow...
Cheers,
Aidos
EDIT:
I will explain the motivation:
The original code was written for a system which computed all of the values (about 60) every time the user performed an operation (adding, removing or modifying certain properties of a item). This code was written over ten years ago and is definitely showing signs of age - others have added more complex calculations into the system and now we are getting completely unreasonable performance (up to 2 minutes before control is returned to the user). It has been decided to detach the calculations from the user actions and provide a button to "recalculate" the values.
My problem arises because there are so many calculations that are going on and they are based on the assumption that all of the required data will be available for their computation - now when I try to re-implement the calculations I keep encountering problems because I haven't got the result for a different calculation that this calculation relies on.
This is where I want to use the matrix-decomposition approach. The MD approach allows me to specify all of the inputs and outputs and gives me the "simplest" workflow that I can use for generating all of the outputs.
I can then use this "workflow" to know the precedence of the calculations I need to perform to get the same result without generating any exceptions. It also shows me which parts of the calculation system I can parallelise and where the fork and join points will be (I won't worry about that part just yet). At the moment all I have is an insanely large matrix with lots of dependencies showing in it, with no idea where to start.
I will elaborate from my comment a little more:
I don't want to use the solution from the EA process in the actual program. I want to take the dependency matrix and decompose it into modules that I will then code manually - this is purely a design aid - I am just interested in what the inputs/outputs for these modules will be. Basically a representation of the complex interdependencies between these calculations, as well as some idea of precedence.
Say I have A requires B and C. D requires A and E. F requires B, A and E, I want to effectively partition the problem space from a complex set of dependencies into a "workflow" that I can examine to get a better understanding. Once I have this understanding I can come up with a better design / implementation that is still human readable, so for the example I know I need to calculate A, then C, then D, then F.
--
I know this seems kind of strange, if you take a look at the website I linked to before the matrix based decomposition there should give you some understanding of what I am thinking of...
kquinn, If it's the piece of code I think he's referring to (I used to work there), it's already a black box solution that no human can understand as is. He's not looking to make it more complicated, less in fact. What he's trying to achieve is a whole heap of interlinked calculations.
What currently happens, is that whenever anything changes, it's an avalanche of events which cause a whole bunch of calculations to fire off, which in turn causes a whole bunch more events which continues on until finally it reaches a state of equilibrium.
What I assume he wants to do is find the dependencies for those outlying calculations and work in from there so they can be rewritten and find a way for the calculations from happening for the sake of it, rather than because they need to.
I can't offer much advice in regards to simplifying the graph, as unfortunately it's not something I have much experience in. That said, I would start looking for those outlying calculations which have no dependencies, and just traverse the graph from there. Start building up a new framework that includes the core business logic of each calculation in the simplest possible way, and refactor the crap out of it along the way.
If this is, as you say, "core business logic", then you really don't want to be screwing around with fancy decompositions and evolutionary algorithms that produce a "black box" solution that no one in the world understands or is capable of modifying. I would be very surprised if any of these techniques actually yielded any useful result; the human brain is still incomprehensibly more capable than any machine at untangling complicated relationships.
What you want to do is traditional refactoring: clean up the individual procedures, streamlining them and merging them where possible. Your goal is to make the code clear, so your successor doesn't have to go through the same process.
What language are you using?
Your problem should be pretty easy to model using Java Executors and Future<> tasks, but a similar framework is perhaps availabe on your chosen platform as well?
Also, if I understand this correctly, you want to generate a critical path for a large set of interdependent calculations -- is that something done dynamically, or do you "just" need a static analysis?
Regarding an algorithmic solution; pick up the closest copy of your numerical analysis textbook and refresh your memory on singular value decompositions and LU factorization; I'm guessing from the top off my head that this is what lies behind the tool you linked to.
EDIT: Since you're using Java, I'll give a brief outline of a suggestion proposal:
-> Use a threadpool executor to parallellize all calculations easily
-> Solve interdependencies with an object map of Future<> or FutureTask<>:s, i.e. if you variables are A, B and C, where A = B + C, do something like this:
static final Map<String, FutureTask<Integer>> mapping = ...
static final ThreadPoolExecutor threadpool = ...
FutureTask<Integer> a = new FutureTask<Integer>(new Callable<Integer>() {
public Integer call() {
Integer b = mapping.get("B").get();
Integer c = mapping.get("C").get();
return b + c;
}
}
);
FutureTask<Integer> b = new FutureTask<Integer>(...);
FutureTask<Integer> c = new FutureTask<Integer>(...);
map.put("A", a);
map.put("B", a);
map.put("C", a);
for ( FutureTask<Integer> task : map.values() )
threadpool.execute(task);
Now, if I'm not totally off (and I may very well be, it was a while since I worked in Java), you should be able to solve the apparent deadlock problem by tuning the thread pool size, or use a growing thread pool. (You still have to make sure that there are no interdependent tasks though, such as if A = B + C, and B = A + 1...)
If the black-box is linear you can discover all the coefficients by simply concatenating many vectors of input and many vectors of output.
you have input x[i] and output y[i], then you create a matrix Y whose columns are y[0], y[1], ... y[n], and a matrix X whose columns are x[0], x[1], ..., x[n]. There will be a transformation Y = T * X, then you may determine T = Y * inverse(X).
But since you said it is complex I bet it is not linear. Then if you still want a general framework you can use this a factor-graph
https://ieeexplore.ieee.org/document/910572
I would be curious if you can do this.
What I think is easier is to understand the code and rewrite it using the best practices.