FFT not showing any dominant frequencies - scipy

I am trying to perform an FFT from time series data of DC motor current from "F.A.I.R. open dataset of brushed DC motor faults for testing of AI algorithms". However, the result does not show any dominant frequency bands. It just resembles broadband noise. The first image is a zoomed in snap shot of the time series data (the entire series is over 100,000 data points), after the DC portion has been substracted.
Timeseries graph
The second image is the fft graph and my code is below. The time period is not yet set correctly but this does not effect the form of the data, only the frequency values assigned to it.
FFT graph
import matplotlib.pyplot as plt
import h5py
filename = "MOTOR-DC_2020_12_02_17_59_47_Analogico.hdf5"
#MOTOR-DC_2020_12_02_17_59_47_Analogico.hdf5
#MOTOR-DC_2020_12_02_17_30_42_Analogico.hdf5
with h5py.File(filename, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
vibration =[(data[i][0]) for i in range(0,len(data))]
current =[(data[i][1]) for i in range(0,len(data))]
voltage=current =[(data[i][2]) for i in range(0,len(data))]
x=list(range(0,len(vibration)))
from scipy.fft import fft, fftfreq
import numpy as np
# Number of sample points
N = len(data)#600
# sample spacing
T = 0.0001
x = np.linspace(0.0, N*T, N, endpoint=False)
y = current
y_mean=np.mean(y)
y_med=np.median(y)
print('Mean',y_mean,'Median=',y_med)
for i in range(0,len(y)):
y[i]=y[i]-y_mean
#plt.plot(x,current)
yf = fft(y)
xf = fftfreq(n=N, d=T)[:N//2]
import matplotlib.pyplot as plt
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))
plt.grid()
plt.show()

Try
yf = fft(y)
xf = fftfreq(N)
xf[np.argmax(np.abs(yf))]
This will give you the normalized frequency of the most prominent harmonic.
You can then multiply it by the sampling frequency to get the actual frequency.

Related

Predictions into the future Azure Machine Learning Studio Designer

I am currently developing an automated mechanism where I use the Azure Machine Learning Designer (AMLD). During development i used an 80/20 Split to test the efficency of my predictions.
Now i want to go live but I've missed the point where i can actually predict into the future.
I currently get a prediction for the last 20% of my data so i can compare them to the actual data. How do i change it so that the prediction actually starts at the end of my data?
A part of my prediction process is attached:
Continuing with the comments:
Example problem statement: Predicting salary based on experience.
The dataset consists of three columns and salary is the dependent variable and first two columns will be independent variables
The sample code starts from below.
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
#Training the linear regression model on complete dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
#Training with the model.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
As a sample I am giving polynomial regression
#visualize Linear regression results.
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.show()
#Visualize Polynomial regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Truth or Bluff (Polynomial Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
#Predicting new result with linear regression-> this can help you
lin_reg.predict([[6.5]])
#Predicting new result with polynomial regression
lin_reg_2.predict(poly_reg.fit_transform([[6.5]]))
Go through the flow of implementation of two different regression models on same dataset and how the difference in results will be.

Remove noise and smoothen the ecg signal

I am processing Long term afib dataset - https://physionet.org/content/ltafdb/1.0.0/
When I test the 30s strips of this data, my model is not correcting predicting the signals. So I am trying to deal with noise in this dataset. Here how it looks
Here is the code to plot -
def plot_filter_graphs(data,xmin,xmax,order):
from numpy import sin, cos, pi, linspace
from numpy.random import randn
from scipy import signal
from scipy.signal import lfilter, lfilter_zi, filtfilt, butter
from matplotlib.pyplot import plot, legend, show, grid, figure, savefig,xlim
lowcut=1
highcut=35
nyq = 0.5 * 300
low = lowcut / nyq
high = highcut / nyq
b, a = signal.butter(order, [low, high], btype='band')
# Apply the filter to xn. Use lfilter_zi to choose the initial condition
# of the filter.
z = lfilter(b, a,data)
# Use filtfilt to apply the filter.
y = filtfilt(b, a, data)
y = np.flipud(y)
y = signal.lfilter(b, a, y)
y = np.flipud(y)
# Make the plot.
figure(figsize=(16,5))
plot(data,'b',linewidth=1.75)
plot(z, 'r--', linewidth=1.75)
plot( y, 'k', linewidth=1.75)
xlim(xmin,xmax)
legend(('actual',
'lfilter',
'filtfilt'),
loc='best')
grid(True)
show()
I am using butter band pass filter to filter the noise. I also checked with filtfilt and lfilt but that is also not giving good result.
Any suggestion, how noise can be removed so that signal accuracy is good and hense it can be used for model prediction

Simple scipy curve_fit test not returning expected results

I am trying to estimate the amplitude, frequency, and phase of an incoming signal of about 50Hz based on measurement of only a few cycles. The frequency needs to be precise to .01 Hz. Since the signal itself is going to be a pretty clear sine wave, I am trying parameter fitting with SciPy's curve_fit. I've never used it before, so I wrote a quick test function.
I start by generating samples of a single cycle of a dummy cosine wave
from math import *
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
fs = 1000 # Sampling rate (Hz)
T = .1 # Length of collection (s)
windowlength = int(fs*T) # Number of samples
f0 = 10 # Fundamental frequency of our wave (Hz)
wave = [0]*windowlength
for x in range(windowlength):
wave[x] = cos(2*pi*f0*x/fs)
t = np.linspace(0,T,int(fs*T)) # This will be our x-axis for plotting
Then I try to fit those samples to a function, adapting the code from the official example provided by scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
# Define function to fit
def sinefit(x, A, ph, f):
return A * np.sin(2*pi*f * x + ph)
# Call curve_fit
popt,cov = curve_fit(sinefit, t, wave, p0=[1,np.pi/2,10])
# Plot the result
plt.plot(t, wave, 'b-', label='data')
plt.plot(t, sinefit(t, *popt), 'r-', label='fit')
print("[Amplitude,phase,frequency]")
print(popt)
This gives me popt = [1., 1.57079633, 9.9] and the plot
plot output
My question is: why is my frequency off? I initialized the curve_fit function with the exact parameters of the cosine wave, so shouldn't the first iteration of the LM algorithm realize that there is zero residual and that it has already arrived at the correct answer? That seems to be the case for amplitude and phase, but frequency is 0.1Hz too low.
I expect this is a dumb coding mistake, since the original wave and the fit are clearly lined up in the plot. I also confirmed that the difference between them was zero across the entire sample. If they really were .1 Hz out of phase, there would be a phase shift of 3.6 degrees over my 100ms window.
Any thoughts would be very much appreciated!
The problem is that your array t is not correct. The last value in your t is 0.1, but with a sampling period of 1/fs = 0.001, the last value in t should be 0.099. That is, the times of the 100 samples are [0, 0.001, 0.002, ..., 0.098, 0.099].
You can create t correctly with either
t = np.linspace(0, T, int(fs*T), endpoint=False)
or
t = np.arange(windowlength)/fs # Use float(fs) if you are using Python 2

plotting large time series efficiently (matplotlib)

I'm trying to plot three time series on the same axes using matplotlib. Each time series has 10^6 data points. While I have no problem generating the figure, the PDF output is large and very slow to open in viewers. Aside from working in a rasterized format, or only plotting a subset of the time series, is there a way to get better graphical performance? I have tried "optimizing" in acrobat, and I have also had the same trouble with matlab.
The code is as follows:
import numpy as np
import matplotlib.pyplot as plt
data=np.loadtxt('data.txt')
idx = data[:,0]
y1 = data[:,1]
y2 = data[:,2]
y3 = data[:,3]
plt.rc('text', usetex=True)
plt.rc('font', size=16)
fig, ax = plt.subplots()
ax.plot(idx,y1,color='b',label=r'$y_1$',
linewidth=2.0,markersize=10,fillstyle='none')
ax.plot(idx,y2,color='g',label=r'$y_2$',
linewidth=2.0,markersize=10,fillstyle='none')
ax.plot(idx,y3,color='r',label=r'$y_3$',
linewidth=2.0,markersize=10,fillstyle='none')
plt.xlabel(r'Index')
plt.ylabel(r'Vales')
legend = ax.legend(loc='upper right',fontsize=16)
ax.set_xscale('log')
plt.savefig('fig1.pdf',bbox_inches='tight')
plt.show()

Using scipy.stats.gaussian_kde with 2 dimensional data

I'm trying to use the scipy.stats.gaussian_kde class to smooth out some discrete data collected with latitude and longitude information, so it shows up as somewhat similar to a contour map in the end, where the high densities are the peak and low densities are the valley.
I'm having a hard time putting a two-dimensional dataset into the gaussian_kde class. I've played around to figure out how it works with 1 dimensional data, so I thought 2 dimensional would be something along the lines of:
from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])
which is saying that I have 3 points at [1.1, 1.1], [1.2, 1.2], [1.3, 1.3]. and I want to have the kernel density estimation using from 1 to 3 using width of 1 on x and y axis.
When creating the gaussian_kde, it keeps giving me this error:
raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix
Looking into the source code of gaussian_kde, I realize that the way I'm thinking about what dataset means is completely different from how the dimensionality is calculate, but I could not find any sample code showing how multi-dimension data works with the module. Could someone help me with some sample ways to use gaussian_kde with multi-dimensional data?
This example seems to be what you're looking for:
import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
Axes need fixing, obviously.
You can also do a scatter plot of the data with
scatter(rvs[:,0],rvs[:,1])
I think you are mixing up kernel density estimation with interpolation or maybe kernel regression. KDE estimates the distribution of points if you have a larger sample of points.
I'm not sure which interpolation you want, but either the splines or rbf in scipy.interpolate will be more appropriate.
If you want one-dimensional kernel regression, then you can find a version in scikits.statsmodels with several different kernels.
update: here is an example (if this is what you want)
>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917, 0.02470436, 0.03084282])
gaussian_kde has variables in rows and observations in columns, so reversed orientation from the usual in stats. In your example, all three points are on a line, so it has perfect correlation. That is, I guess, the reason for the singular matrix.
Adjusting the array orientation and adding a small noise, the example works, but still looks very concentrated, for example you don't have any sample point near (3,3):
>>> data = np.array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 7.70204299e+000, 1.96813149e-044, 1.45796523e-251])
I found it difficult to understand the SciPy manual's description of how gaussian_kde works with 2D data. Here is an explanation which is intended to complement #endolith 's example. I divided the code into several steps with comments to explain the less intuitive bits.
First, the imports:
import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show
Create some dummy data: these are 1-D arrays of the "X" and "Y" point coordinates.
np.random.seed(142) # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)
For 2-D density estimation the gaussian_kde object has to be initialised with an array with two rows containing the "X" and "Y" datasets. In NumPy terminology, we "stack them vertically":
xy = np.vstack((x, y))
so the "X" data is in the first row xy[0,:] and the "Y" data are in the second row xy[1,:] and xy.shape is (2, 2000). Now create the gaussian_kde object:
dens = st.gaussian_kde(xy)
We will evaluate the estimated 2-D density PDF on a 2-D grid. There is more than one way of creating such a grid in NumPy. I show here an approach which is different from (but functionally equivalent to) #endolith 's method:
gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)
gxy is a 3-D array, the [i,j]-th element of gxy contains a 2-element list of the corresponding "X" and "Y" values: gxy[i, j] 's value is [ gx[i], gy[j] ].
We have to invoke dens() (or dens.pdf() which is the same thing) on each of the 2-D grid points. NumPy has a very elegant function for this purpose:
z = np.apply_along_axis(dens, 2, gxy)
In words, the callable dens (could have been dens.pdf as well) is invoked along axis=2 (the third axis) in the 3-D array gxy and the values should be returned as a 2-D array. The only glitch is that the shape of z will be (128,128,1) and not (128,128) what I expected. Note that the documentation says that:
The shape of out [the return value, L.D.] is identical to the shape of arr, except along the
axis dimension. This axis is removed, and replaced with new dimensions
equal to the shape of the return value of func1d. So if func1d returns
a scalar out will have one fewer dimensions than arr.
Most likely dens() returned a 1-long tuple and not a scalar which I was hoping for. I didn't investigate the issue any further, because this is easy to fix:
z = z.reshape(128, 128)
after which we can generate the image:
imshow(z, aspect=gx.ptp() / gy.ptp())
show() # needed if you try this in PyCharm
Here is the image. (Note that I have implemented #endolith 's version as well and got an image indistinguishable from this one.)
The example posted in the top answer didn't work for me. I had to tweak it little bit and it works now:
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()