Predictions into the future Azure Machine Learning Studio Designer - prediction

I am currently developing an automated mechanism where I use the Azure Machine Learning Designer (AMLD). During development i used an 80/20 Split to test the efficency of my predictions.
Now i want to go live but I've missed the point where i can actually predict into the future.
I currently get a prediction for the last 20% of my data so i can compare them to the actual data. How do i change it so that the prediction actually starts at the end of my data?
A part of my prediction process is attached:

Continuing with the comments:
Example problem statement: Predicting salary based on experience.
The dataset consists of three columns and salary is the dependent variable and first two columns will be independent variables
The sample code starts from below.
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
#Training the linear regression model on complete dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
#Training with the model.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
As a sample I am giving polynomial regression
#visualize Linear regression results.
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()
#Visualize Polynomial regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Truth or Bluff (Polynomial Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
#Predicting new result with linear regression-> this can help you
lin_reg.predict([[6.5]])
#Predicting new result with polynomial regression
lin_reg_2.predict(poly_reg.fit_transform([[6.5]]))
Go through the flow of implementation of two different regression models on same dataset and how the difference in results will be.

Related

FFT not showing any dominant frequencies

I am trying to perform an FFT from time series data of DC motor current from "F.A.I.R. open dataset of brushed DC motor faults for testing of AI algorithms". However, the result does not show any dominant frequency bands. It just resembles broadband noise. The first image is a zoomed in snap shot of the time series data (the entire series is over 100,000 data points), after the DC portion has been substracted.
Timeseries graph
The second image is the fft graph and my code is below. The time period is not yet set correctly but this does not effect the form of the data, only the frequency values assigned to it.
FFT graph
import matplotlib.pyplot as plt
import h5py
filename = "MOTOR-DC_2020_12_02_17_59_47_Analogico.hdf5"
#MOTOR-DC_2020_12_02_17_59_47_Analogico.hdf5
#MOTOR-DC_2020_12_02_17_30_42_Analogico.hdf5
with h5py.File(filename, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
vibration =[(data[i][0]) for i in range(0,len(data))]
current =[(data[i][1]) for i in range(0,len(data))]
voltage=current =[(data[i][2]) for i in range(0,len(data))]
x=list(range(0,len(vibration)))
from scipy.fft import fft, fftfreq
import numpy as np
# Number of sample points
N = len(data)#600
# sample spacing
T = 0.0001
x = np.linspace(0.0, N*T, N, endpoint=False)
y = current
y_mean=np.mean(y)
y_med=np.median(y)
print('Mean',y_mean,'Median=',y_med)
for i in range(0,len(y)):
y[i]=y[i]-y_mean
#plt.plot(x,current)
yf = fft(y)
xf = fftfreq(n=N, d=T)[:N//2]
import matplotlib.pyplot as plt
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))
plt.grid()
plt.show()
Try
yf = fft(y)
xf = fftfreq(N)
xf[np.argmax(np.abs(yf))]
This will give you the normalized frequency of the most prominent harmonic.
You can then multiply it by the sampling frequency to get the actual frequency.

Remove noise and smoothen the ecg signal

I am processing Long term afib dataset - https://physionet.org/content/ltafdb/1.0.0/
When I test the 30s strips of this data, my model is not correcting predicting the signals. So I am trying to deal with noise in this dataset. Here how it looks
Here is the code to plot -
def plot_filter_graphs(data,xmin,xmax,order):
from numpy import sin, cos, pi, linspace
from numpy.random import randn
from scipy import signal
from scipy.signal import lfilter, lfilter_zi, filtfilt, butter
from matplotlib.pyplot import plot, legend, show, grid, figure, savefig,xlim
lowcut=1
highcut=35
nyq = 0.5 * 300
low = lowcut / nyq
high = highcut / nyq
b, a = signal.butter(order, [low, high], btype='band')
# Apply the filter to xn. Use lfilter_zi to choose the initial condition
# of the filter.
z = lfilter(b, a,data)
# Use filtfilt to apply the filter.
y = filtfilt(b, a, data)
y = np.flipud(y)
y = signal.lfilter(b, a, y)
y = np.flipud(y)
# Make the plot.
figure(figsize=(16,5))
plot(data,'b',linewidth=1.75)
plot(z, 'r--', linewidth=1.75)
plot( y, 'k', linewidth=1.75)
xlim(xmin,xmax)
legend(('actual',
'lfilter',
'filtfilt'),
loc='best')
grid(True)
show()
I am using butter band pass filter to filter the noise. I also checked with filtfilt and lfilt but that is also not giving good result.
Any suggestion, how noise can be removed so that signal accuracy is good and hense it can be used for model prediction

plotting large time series efficiently (matplotlib)

I'm trying to plot three time series on the same axes using matplotlib. Each time series has 10^6 data points. While I have no problem generating the figure, the PDF output is large and very slow to open in viewers. Aside from working in a rasterized format, or only plotting a subset of the time series, is there a way to get better graphical performance? I have tried "optimizing" in acrobat, and I have also had the same trouble with matlab.
The code is as follows:
import numpy as np
import matplotlib.pyplot as plt
data=np.loadtxt('data.txt')
idx = data[:,0]
y1 = data[:,1]
y2 = data[:,2]
y3 = data[:,3]
plt.rc('text', usetex=True)
plt.rc('font', size=16)
fig, ax = plt.subplots()
ax.plot(idx,y1,color='b',label=r'$y_1$',
linewidth=2.0,markersize=10,fillstyle='none')
ax.plot(idx,y2,color='g',label=r'$y_2$',
linewidth=2.0,markersize=10,fillstyle='none')
ax.plot(idx,y3,color='r',label=r'$y_3$',
linewidth=2.0,markersize=10,fillstyle='none')
plt.xlabel(r'Index')
plt.ylabel(r'Vales')
legend = ax.legend(loc='upper right',fontsize=16)
ax.set_xscale('log')
plt.savefig('fig1.pdf',bbox_inches='tight')
plt.show()

How to use rp, rs, and Wn parameters in scipy.signal.filter_design.ellip?

I'd like to try out the elliptic filter design function from SciPy in scipy.signal.filter_design.ellip. I'm familiar with the filter design functions in Octave, but I'm not sure how to use this:
From the documentation at http://www.scipy.org/doc/api_docs/SciPy.signal.filter_design.html
ellip(N, rp, rs, Wn, btype = 'low', analog = 0, output = 'ba')
Elliptic (Cauer) digital and analog filter design.
Description:
Design an Nth order lowpass digital or analog elliptic filter and return the filter coefficients in (B,A) or (Z,P,K) form.
See also ellipord.
I understand N (order), btype (low or high), analog (true/false), and output (ba vs. zpk).
What are rp, rs, and Wn and how are they supposed to work?
From my experience with Octave, I'm guessing that rp and rs have to do with the maximum allowed ripple in the pass and stop bands, and that Wn is a weight or controls the cutoff frequency, but how these work isn't documented and I can't find any examples.
I believe HYRY is correct. From my experience using the Python Matlab clone scripts they work well, with the exception of poor documentation. Yes, Rp and Rs are the maximum allowable ripple in the passband and stopband respectively. The Wn is the digital cutoff, or edge frequency.
So...here's some code on how to use it to replicate the filter that the mathworks uses as an example:
import pylab
import scipy
import scipy.signal
[b,a] = scipy.signal.ellip(6,3,50,300.0/500.0);
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
plt.title('Digital filter frequency response')
ax1 = fig.add_subplot(111)
h,w = scipy.signal.freqz(b, a)
plt.semilogy(h, np.abs(w), 'b')
plt.semilogy(h, abs(w), 'b')
plt.ylabel('Amplitude (dB)', color='b')
plt.xlabel('Frequency (rad/sample)')
plt.grid()
plt.legend()
ax2 = ax1.twinx()
angles = np.unwrap(np.angle(w))
plt.plot(h, angles, 'g')
plt.ylabel('Angle (radians)', color='g')
plt.show()
sorry the format is so lame, but it works! You'll notice that the frequency scale is different than matlab shows, it's just cosmetic. This is what you get:
I think this function is the same as Octave or MATLAB, so you can read the MATLAB document about it.
http://www.mathworks.com/help/toolbox/signal/ref/ellip.html

Using scipy.stats.gaussian_kde with 2 dimensional data

I'm trying to use the scipy.stats.gaussian_kde class to smooth out some discrete data collected with latitude and longitude information, so it shows up as somewhat similar to a contour map in the end, where the high densities are the peak and low densities are the valley.
I'm having a hard time putting a two-dimensional dataset into the gaussian_kde class. I've played around to figure out how it works with 1 dimensional data, so I thought 2 dimensional would be something along the lines of:
from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])
which is saying that I have 3 points at [1.1, 1.1], [1.2, 1.2], [1.3, 1.3]. and I want to have the kernel density estimation using from 1 to 3 using width of 1 on x and y axis.
When creating the gaussian_kde, it keeps giving me this error:
raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix
Looking into the source code of gaussian_kde, I realize that the way I'm thinking about what dataset means is completely different from how the dimensionality is calculate, but I could not find any sample code showing how multi-dimension data works with the module. Could someone help me with some sample ways to use gaussian_kde with multi-dimensional data?
This example seems to be what you're looking for:
import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
Axes need fixing, obviously.
You can also do a scatter plot of the data with
scatter(rvs[:,0],rvs[:,1])
I think you are mixing up kernel density estimation with interpolation or maybe kernel regression. KDE estimates the distribution of points if you have a larger sample of points.
I'm not sure which interpolation you want, but either the splines or rbf in scipy.interpolate will be more appropriate.
If you want one-dimensional kernel regression, then you can find a version in scikits.statsmodels with several different kernels.
update: here is an example (if this is what you want)
>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917, 0.02470436, 0.03084282])
gaussian_kde has variables in rows and observations in columns, so reversed orientation from the usual in stats. In your example, all three points are on a line, so it has perfect correlation. That is, I guess, the reason for the singular matrix.
Adjusting the array orientation and adding a small noise, the example works, but still looks very concentrated, for example you don't have any sample point near (3,3):
>>> data = np.array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 7.70204299e+000, 1.96813149e-044, 1.45796523e-251])
I found it difficult to understand the SciPy manual's description of how gaussian_kde works with 2D data. Here is an explanation which is intended to complement #endolith 's example. I divided the code into several steps with comments to explain the less intuitive bits.
First, the imports:
import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show
Create some dummy data: these are 1-D arrays of the "X" and "Y" point coordinates.
np.random.seed(142) # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)
For 2-D density estimation the gaussian_kde object has to be initialised with an array with two rows containing the "X" and "Y" datasets. In NumPy terminology, we "stack them vertically":
xy = np.vstack((x, y))
so the "X" data is in the first row xy[0,:] and the "Y" data are in the second row xy[1,:] and xy.shape is (2, 2000). Now create the gaussian_kde object:
dens = st.gaussian_kde(xy)
We will evaluate the estimated 2-D density PDF on a 2-D grid. There is more than one way of creating such a grid in NumPy. I show here an approach which is different from (but functionally equivalent to) #endolith 's method:
gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)
gxy is a 3-D array, the [i,j]-th element of gxy contains a 2-element list of the corresponding "X" and "Y" values: gxy[i, j] 's value is [ gx[i], gy[j] ].
We have to invoke dens() (or dens.pdf() which is the same thing) on each of the 2-D grid points. NumPy has a very elegant function for this purpose:
z = np.apply_along_axis(dens, 2, gxy)
In words, the callable dens (could have been dens.pdf as well) is invoked along axis=2 (the third axis) in the 3-D array gxy and the values should be returned as a 2-D array. The only glitch is that the shape of z will be (128,128,1) and not (128,128) what I expected. Note that the documentation says that:
The shape of out [the return value, L.D.] is identical to the shape of arr, except along the
axis dimension. This axis is removed, and replaced with new dimensions
equal to the shape of the return value of func1d. So if func1d returns
a scalar out will have one fewer dimensions than arr.
Most likely dens() returned a 1-long tuple and not a scalar which I was hoping for. I didn't investigate the issue any further, because this is easy to fix:
z = z.reshape(128, 128)
after which we can generate the image:
imshow(z, aspect=gx.ptp() / gy.ptp())
show() # needed if you try this in PyCharm
Here is the image. (Note that I have implemented #endolith 's version as well and got an image indistinguishable from this one.)
The example posted in the top answer didn't work for me. I had to tweak it little bit and it works now:
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()