How to Predict a Trend in MATLAB with irregular sampling time? - matlab

I'm using MATLAB to predict a trend with a machine learning approach.
My data file is an .xlsx file containing a timeline in one column (various sampling timestamps, i.e. numbers that represents seconds), and in the other columns I have some integers representing my trend.
My .xlsx file is pretty much like this:
0,0100 | 0
0,0110 | 1
0,0135 | 5
And so on.
I used "|" to distinguish between columns. The sampling time is not regular.
Given 10 values of the trend taken from 10 consecutive timestamps, I'd like to predict the 11th value at a given timestamp. For example, if the 9th value is at 34,010 and the 10th value is at 34,568s I'd like to know the value at 37,431s.
How can I do it?
I've found this link: Time Series Forecasting Using Deep Learning, but there the sampling time is regular.
Should I interpolate my trend values and re-sample them with a constant sampling time?

I would distinguish the forecasting problem from the data sampling time problem. You are dealing substantially with missing data.
Forecasting problem: You may use any machine learning technique just ignoring missing data. If you are not familiar with machine learning, I would suggest you to use LASSO (least absolute shrinkage and selection operator), which has been demonstrated to have predicting power (see "Sparse Signals in the Cross-Section of Returns" by ALEX CHINCO, ADAM D. CLARK-JOSEPH, and MAO YE).
Missing imputation problem: In the first place you should consider the reason why you have missing data. Sometime it makes no sense to impute values because the information that the value is missing is itself important and should not be overridden. Otherwise you have multiple options, other than linear interpolation, to estimate the missing values. For example check the MATLAB function fillmissing.

Related

multiple regressor time series producing one output

Absolute beginner here. I'm trying to use a neural network to predict price of a product that's being shipped while using temperature, deaths during a pandemic, rain volume, and a column of 0 and 1's (dummy variable).
So imagine that I have a dataset given those values as well as column giving me time in a year/week format.
I started reading Rob Hyndman's forecasting book but I haven't yet seen anything that can help me. One idea that I have is to make a variable that's going to take out each column of the dataframe and make it into a time series. For example, for rain, I can do something like
rain <- df$rain_inches cost<-mainset3 %>% select(approx_cost) raintimeseries <-ts(rain, frequency=52, start=c(2015,1), end=c(2021,5))
I would the same for the other regressors.
I want to use neural networks on each of the regressors to predict cost and then put them all together.
Ideally I'm thinking it would be a good idea to train on say, 3/4 ths of the time series data and test on the remain 1/4 and then possibly make predictions.
I'm now seeing that even if I am using one regressor I'm still left with a multivariate time series and I've only found examples online for univariate models.
I'd appreciate if someone could give me ideas on how to model my problem using neural networks
I saw this link Keras RNN with LSTM cells for predicting multiple output time series based on multiple intput time series
but I just see a bunch of functions and nowhere where I can actually insert my function
The solution to your problem is the same as for the univariate case you found online except for the fact that you just need to work differently with your feature/independent set. your y variable or cost variable remains as is but your X variables will have to be in 3 dimensions which is (Number of observations, time steps, number of independent variables)

Linear regression returning bad fit with large x values

I'm looking to do a linear regression to determine the estimated date of depletion for a particular resource. I have a dataset containing a column of dates, and several columns of data, always decreasing. A linear regression using scikit learn's LinearRegression() function yields a bad fit.
I converted the date column to ordinal, which resulted in values ~700,000. Relative to the y axis of values between 0-200, this is rather large. I imagine that the regression function is starting at low values and working its way up, eventually giving up before it finds a good enough fit. If i could assign starting values to the parameters, large intercept and small slope, perhaps it would fix the problem. I don't know how to do this, and i am very curious as to other solutions.
Here is a link to some data-
https://pastebin.com/BKpeZGmN
And here is my current code
model=LinearRegression().fit(dates,y)
model.score(dates,y)
y_pred=model.predict(dates)
plt.scatter(dates,y)
plt.plot(dates,y_pred,color='red')
plt.show()
print(model.intercept_)
print(model.coef_)
This code plots the linear model over the data, yielding stunning inaccuracy. I would share in this post, but i am not sure how to post an image from my desktop.
My original data is dates, and i convert to ordinal in code i have not shared here. If there is an easier way to do this that would be more accurate, i would appreciate a suggestion.
Thanks,
Will

Appropriate method for clustering ordinal variables

I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!
Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3

Extracting Patterns using Neural Networks

I am trying to extract common patterns that always appear whenever a certain event occurs.
For example, patient A, B, and C all had a heart attack. Using the readings from there pulse, I want to find the common patterns before the heart attack stroke.
In the next stage I want to do this using multiple dimensions. For example, using the readings from the patients pulse, temperature, and blood pressure, what are the common patterns that occurred in the three dimensions taking into consideration the time and order between each dimension.
What is the best way to solve this problem using Neural Networks and which type of network is best?
(Just need some pointing in the right direction)
and thank you all for reading
Described problem looks like a time series prediction problem. That means a basic prediction problem for a continuous or discrete phenomena generated by some existing process. As a raw data for this problem we will have a sequence of samples x(t), x(t+1), x(t+2), ..., where x() means an output of considered process and t means some arbitrary timepoint.
For artificial neural networks solution we will consider a time series prediction, where we will organize our raw data to a new sequences. As you should know, we consider X as a matrix of input vectors that will be used in ANN learning. For time series prediction we will construct a new collection on following schema.
In the most basic form your input vector x will be a sequence of samples (x(t-k), x(t-k+1), ..., x(t-1), x(t)) taken at some arbitrary timepoint t, appended to it predecessor samples from timepoints t-k, t-k+1, ..., t-1. You should generate every example for every possible timepoint t like this.
But the key is to preprocess data so that we get the best prediction results.
Assuming your data (phenomena) is continuous, you should consider to apply some sampling technique. You could start with an experiment for some naive sampling period Δt, but there are stronger methods. See for example Nyquist–Shannon Sampling Theorem, where the key idea is to allow to recover continuous x(t) from discrete x(Δt) samples. This is reasonable when we consider that we probably expect our ANNs to do this.
Assuming your data is discrete... you still should need to try sampling, as this will speed up your computations and might possibly provide better generalization. But the key advice is: do experiments! as the best architecture depends on data and also will require to preprocess them correctly.
The next thing is network output layer. From your question, it appears that this will be a binary class prediction. But maybe a wider prediction vector is worth considering? How about to predict the future of considered samples, that is x(t+1), x(t+2) and experiment with different horizons (length of the future)?
Further reading:
Somebody mentioned Python here. Here is some good tutorial on timeseries prediction with Keras: Victor Schmidt, Keras recurrent tutorial, Deep Learning Tutorials
This paper is good if you need some real example: Fessant, Francoise, Samy Bengio, and Daniel Collobert. "On the prediction of solar activity using different neural network models." Annales Geophysicae. Vol. 14. No. 1. 1996.

Time series classification MATLAB

My task is to classify time-series data with use of MATLAB and any neural-network framework.
Describing task more specifically:
Is is a problem from computer-vision field. Is is a scene boundary detection task.
Source data are 4 arrays of neighbouring frame histogram correlations from the videoflow.
Based on this data, we have to classify this timeseries with 2 classes:
"scene break"
"no scene break"
So network input is 4 double values for each source data entry, and output is one binary value. I am going to show example of src data below:
0.997894,0.999413,0.982098,0.992164
0.998964,0.999986,0.999127,0.982068
0.993807,0.998823,0.994008,0.994299
0.225917,0.000000,0.407494,0.400424
0.881150,0.999427,0.949031,0.994918
Problem is that pattern-recogition tools from Matlab Neural Toolbox (like patternnet) threat source data like independant entrues. But I have strong belief that results will be precise only if net take decision based on the history of previous correlations.
But I also did not manage to get valid response from reccurent nets which serve time series analysis (like delaynet and narxnet).
narxnet and delaynet return lousy result and it looks like these types of networks not supposed to solve classification tasks. I am not insert any code here while it is allmost totally autogenerated with use of Matlab Neural Toolbox GUI.
I would apprecite any help. Especially, some advice which tool fits better for accomplishing my task.
I am not sure how difficult to classify this problem.
Given your sample, 4 input and 1 output feed-forward neural network is sufficient.
If you insist on using historical inputs, you simply pre-process your input d, such that
Your new input D(t) (a vector at time t) is composed of d(t) is a 1x4 vector at time t; d(t-1) is 1x4 vector at time t-1;... and d(t-k) is a 1x4 vector at time t-k.
If t-k <0, just treat it as '0'.
So you have a 1x(4(k+1)) vector as input, and 1 output.
Similar as Dan mentioned, you need to find a good k.
Speaking of the weights, I think additional pre-processing like windowing method on the input is not necessary, since neural network would be trained to assign weights to each input dimension.
It sounds a bit messy, since the neural network would consider each input dimension independently. That means you lose the information as four neighboring correlations.
One possible solution is the pre-processing extracts the neighborhood features, e.g. using mean and std as two features representative for the originals.