Prediction limit in normal regression and survival regression - linear-regression

I am trying to predict the duration it takes for gas pipes to leak. I used 15 features which the most important one is “pipe installation year”. The latest leak data that I have is for a leak that happened in 2017 and that pipe was installed in 2009 I know that normal ML models that I built will not be able to do a good job in predicting the leak duration for pipes that have been installed after 2009. The reason I say it is because I first sort the data based on their “installation year” and then did a train test split to see how it functions in predicting test dataset, I got %93 R squared but when I turned the shuffle function off in train test split( which means that unlike the normal train test split which in, subsets are chosen randomly, the data will be in order of first %80 training and the last %20 testing) to see if it can predict the pipes that their ”year installed” was not in the model training, I only got %30 R squared. I know that “installation year” is a pretty important feature and the ML model can not predict the pipes that their “installation year” were not trained in the model.
I am also using survival regressions too on top of the normal ML models.I am not sure if I will have the same problem in COX PH model and other multivariate survival models too or not. Does COX PH be able to predict the hazard ratio and survival function for the pipes that were installed after 2009?

Will coxph be able to predict the hazard ratio and survival function for the pipes that were installed after 2009?
coxph should be able to calculate the hazard ratio and survival function for given period (it´s what it is supposed to do). You can run it and plot a KM to see if it makes sense and you can utilize the results.

Related

multiple regressor time series producing one output

Absolute beginner here. I'm trying to use a neural network to predict price of a product that's being shipped while using temperature, deaths during a pandemic, rain volume, and a column of 0 and 1's (dummy variable).
So imagine that I have a dataset given those values as well as column giving me time in a year/week format.
I started reading Rob Hyndman's forecasting book but I haven't yet seen anything that can help me. One idea that I have is to make a variable that's going to take out each column of the dataframe and make it into a time series. For example, for rain, I can do something like
rain <- df$rain_inches cost<-mainset3 %>% select(approx_cost) raintimeseries <-ts(rain, frequency=52, start=c(2015,1), end=c(2021,5))
I would the same for the other regressors.
I want to use neural networks on each of the regressors to predict cost and then put them all together.
Ideally I'm thinking it would be a good idea to train on say, 3/4 ths of the time series data and test on the remain 1/4 and then possibly make predictions.
I'm now seeing that even if I am using one regressor I'm still left with a multivariate time series and I've only found examples online for univariate models.
I'd appreciate if someone could give me ideas on how to model my problem using neural networks
I saw this link Keras RNN with LSTM cells for predicting multiple output time series based on multiple intput time series
but I just see a bunch of functions and nowhere where I can actually insert my function
The solution to your problem is the same as for the univariate case you found online except for the fact that you just need to work differently with your feature/independent set. your y variable or cost variable remains as is but your X variables will have to be in 3 dimensions which is (Number of observations, time steps, number of independent variables)

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Continuously train MATLAB ANN, i.e. online training?

I would like to ask for ideas what options there is for training a MATLAB ANN (artificial neural network) continuously, i.e. not having a pre-prepared training set? The idea is to have an "online" data stream thus, when first creating the network it's completely untrained but as samples flow in the ANN is trained and converges.
The ANN will be used to classify a set of values and the implementation would visualize how the training of the ANN gets improved as samples flows through the system. I.e. each sample is used for training and then also evaluated by the ANN and the response is visualized.
The effect that I expect is that for the very first samples the response of the ANN will be more or less random but as the training progress the accuracy improves.
Any ideas are most welcome.
Regards, Ola
In MATLAB you can use the adapt function instead of train. You can do this incrementally (change weights every time you get a new piece of information) or you can do it every N-samples, batch-style.
This document gives an in-depth run-down on the different styles of training from the perspective of a time-series problem.
I'd really think about what you're trying to do here, because adaptive learning strategies can be difficult. I found that they like to flail all over compared to their batch counterparts. This was especially true in my case where I work with very noisy signals.
Are you sure that you need adaptive learning? You can't periodically re-train your NN? Or build one that generalizes well enough?

Good results with NN, not with SVM; cause for concern?

I have painstakingly gathered data for a proof-of-concept study I am performing. The data consists of 40 different subjects, each with 12 parameters measured at 60 time intervals and 1 output parameter being 0 or 1. So I am building a binary classifier.
I knew beforehand that there is a non-linear relation between the input-parameters and the output so a simple perceptron of Bayes classifier would be unable to classify the sample. This assumption proved correct after initial tests.
Therefore I went to neural networks and as I hoped the results were pretty good. An error of about 1-5% is generally the result. The training is done by using 70% as training and 30% as evaluation. Running the complete dataset again (100%) through the model I was very happy with the results. The following is a typical confusion matrix (P = positive, N = negative):
P N
P 13 2
N 3 42
So I am happy and with the notion that I used a 30% for evaluation I am confident that I am not fitting noise.
Therefore I resolved to SVM for a double check and the SVM was unable to converge to a good solution. Most of the time the solutions are terrible (say 90% error...). Maybe I am not fully aware of SVM's or the implementations are not correct, but it troubles me because I thought that when NN provide a good solution, SVM's are most of the time better in seperating the data due to their maximum-margin hyperplane.
What does this say of my result? Am I fitting noise? And how do I know if this is a correct result?
I am using Encog for the calculations but the NN results are comparable to home-grown NN models I made.
If it is your first time to use SVM, I strongly recommend you to take a look at A Practical Guide to Support Vector Classication, by authors of a famous SVM package libsvm. It gives a list of suggestions to train your SVM classifier.
Transform data to the format of an SVM package
Conduct simple scaling on the data
Consider the RBF kernel
Use cross-validation to nd the best parameter C and γ
Use the best parameter C and γ
to train the whole training set
Test
In short, try scaling your data and carefully choosing the kernal plus the parameters.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results