I'm trying to do a project that reads in a WAV file that has a sequence of music note in it using MATLAB. For example, my WAV file might contain a sequence of C-D-C-E. And feeding this file into my program would print out "C D C E."
I tried using WAVREAD to turn the file into a vector, then
used sampling to downsample it and make into a one-channel file.
Then I was able to come up with a spectrogram that has "peaks" at certain frequencies.
From here, I would like to get help on how to make MATLAB to recognize the frequencies at the peaks, therefore enabling me to print out the note.
Or am I on the wrong track?
Thanks in advance!
You are on the correct track but this is not a simple problem. What I would suggest looking into is something called a chromagram. This will use information that you gathered from the spectrogram and "bin" it into piano note frequencies. This will give an approximation of a songs harmonic content. This may not be entirely accurate though because of residual energy in the note's harmonics, but it is a start.
Do realize that transcription, which is what you are doing, is a very difficult task and has yet to be 100% solved. People are still researching this to today. I have code to generate chroma, but I will have to dig for it.
EDIT
Here is some code to chroma
clc; close all; clear all;
% didn't have wav file, but simply replace this with the following
% [audio,fs] = wavread('audioFile.wav')
audio = rand(1,10000);
fs = 44100; % temp sampling frequency, will depend on audio input
NFFT = 1024; % feel free to change FFT size
hamWin = hamming(NFFT); % window your audio signal to avoid fft edge effects
% get spectral content
S = spectrogram(audio,hamWin,NFFT/2,NFFT,fs);
% Start at center lowest piano note
A0 = 27.5;
% all 88 keys
keys = 0:87;
center = A0*2.^((keys)/12); % set filter center frequencies
left = A0*2.^((keys-1)/12); % define left frequency
left = (left+center)/2.0;
right = A0*2.^((keys+1)/12); % define right frequency
right = (right+center)/2;
% Construct a filter bank
filter = zeros(numel(center),NFFT/2+1); % place holder
freqs = linspace(0,fs/2,NFFT/2+1); % array of frequencies in spectrogram
for i = 1:numel(center)
xTemp = [0,left(i),center(i),right(i),fs/2]; % create points for filter bounds
yTemp = [0,0,1,0,0]; % set magnitudes at each filter point
filter(i,:) = interp1(xTemp,yTemp,freqs); % use interpolation to get values for frequencies
end
% multiply filter by spectrogram to get chroma values.
chroma = filter*abs(S);
%Put into 12 bin chroma
chroma12 = zeros(12,size(chroma,2));
for i = 1:size(chroma,1)
bin = mod(i,12)+1; % get modded index
chroma12(bin,:) = chroma12(bin,:) + chroma(i,:); % add octaves together
end
That should do the trick. It may not be the fastest solution, but it should get the job done.
Surely it can be optimized.
As MZimmerman6 this is a very complex problem. Peak to peak measuring may be successful, but will certainly not if the music gets anymore complicated. I have tackled this problem before and seen other people try it as well and the most successful projects among my peers I have seen involve the following:
1) Constrain the time. It may actually be difficult for a program to determine when a note is even changing! This is especially true if you are trying to separate vocals from instrumentals, or for example when two chords play sequentially, but they have one note that stays the same between them. So by constrain the time it is meant find out when each chunk of music happens, so in your case divide the track into four tracks, one for each note. You may be able to use the attack of each note to your advantage, to automatically detect the attack as the beginning of a new segment to test.
2) Constrain the frequencies. You have to use what you know, otherwise you will need to make eigenmode comparisons. Singular value decomposition has been effective in this arena. But if you somehow have the piano playing separate notes (individually), and you have recordings of the piano playing the song, what you can do is a fast fourier transform of each segment (see above time constraints), cut out the noise, and compare them. Then you employ a subtractive method or another metric to determine the best "fit" for each note.
This is a rough explanation of the concerns, but trust me, the more constraints you can put on this sort of analysis the better.
Related
I am working on a script which performs a FFT of given short audio file in a loop. I also want to store the peak frequency but I do not know how to do that.
The code looks similar to this:
n = ...
Frequencies = zeros(1,n); % Allocating memory for the peak frequencies
for k = 1:n
str(k)
textFileName = [num2str(k) '.m4a'];
[data,fs] = audioread(textFileName);
%...
% Fast Fourier transform and plotting part works ok
%...
[peaks,frequencies] = findpeaks(abs(cutP2),cutf,'MinPeakHeight',10e-3);
% Here starts the problem
maximum_Peak = max(peaks);
Frequencies(k) = ... % I need to store the frequency which is coupled
% with the maximum amplitude but I do not know how
end
close(figure(n)) %The loop opens one redundant blank plot, I could not
%find out any other way to close it
I do not want to store the amplitudes of peak frequencies, but frequencies of peak amplitudes. If you could help me with the redundant figure, I would be happy. I tried to implement an if statement but did not work.
max contains a second output which returns the index of the maximum value. Use this second value to stores the value of interest.
[maximum_Peak,I] = max(peaks); %Note I Use 'I' for index - personal habit
Frequencies(k) = frequencies(I);
Also, if your goal is only to find the max point findpeaks may be overkill and you could potentially use:
[maximum_Peak,I] = max(abs(cutP2));
%Might want to check that max is high enough
Frequencies(k) = cutf(I);
Note although the code is similar it is not the same and depends on what you want to do.
Finally, some unsolicited advice, your use of frequencies and Frequencies is a bit of a red flag. Generally differences based on capitalization are not a good idea. Consider renaming the latter to freq_of_max_amp
I'm trying to remove the high frequency noise from the following file.
It's a file of a woman reading the news, with a high pitched noise playing loudly over it. Towards the end of the file, someone else begins to speak, but in a different language.
I want to filter out this high pitched noise, and be able to clearly hear the woman reading the news. Looking at the frequency domain:
I have tried using a low pass filter, and band stop filter. The bandstop filter produces a signal that no longer has the high pitch ringing, but the audio isn't very clear and it's hard to make out what is being said - the same goes for the low pass filter. I surmise that this is due to me filtering out not only the noise, but the harmonics of the speech as well. It was also necessary that I amplify the audio signal after I filtered it, because it was quieter than before.
Is there some clever way for me to reconstruct the harmonics of the speech in order to hear what is being said more clearly? Or is there a clever way for me to filter the signal without losing too much audio clarity?
I can include any code I used in matlab if needed.
Note:
I shifted the signal to 0 in the image I linked
I did use filtfilt() instead of filter()
I used butter() for the filters
Given the fairly dynamic nature of the interference in your sample, stationary filters are not going to yield very satisfying results. To improve performance, you would need to dynamically adjust the filtering parameters based on estimates of the interference.
Fortunately in this case the interference is pretty strong and exhibits a fairly regular pattern which makes it easier to estimate. This can be seen from the signal's spectrogram.
For the following derivations we will be assuming the samples of the wavfile has been stored in the array x and that the sampling rate is fs (which is 8000Hz in the provided sample).
[Sx,f,t] = spectrogram(x, triang(1024), 1023, [], fs, 'onesided');
Given that the interference is strong, obtaining the frequency of the interference can be done by locating the peak frequency in each time slice:
frequency = zeros(size(Sx,2),1);
for k = 1:size(Sx,2)
[pks,loc] = findpeaks(Sx(:,k));
frequency(k) = fs * (loc(1)-1);
end
Seeing that the interference is periodic we can use the Discrete Fourier Transform to decompose this signal:
M = 32*fs;
Ff = fft(frequency, M);
plot(fs*[0:M-1]/M, 20*log10(abs(Ff));
axis(0, 2);
xlabel('frequency (Hz)');
ylabel('amplitude (dB)');
Using the first two harmonics as an approximation, we can model the frequency of the interference signal as:
T = 1.0/fs
t = [0:length(x)-1]*T
freq = 750.0127340203496
+ 249.99913423501602*cos(2*pi*0.25*t - 1.5702946346796276)
+ 250.23974282864816*cos(2*pi*0.5 *t - 1.5701043282285363);
At this point we would have enough to create a narrowband filter with a center frequency (which would change dynamically as we keep updating the filter coefficients) given by that frequency model. Note however that constantly recomputing and updating the filter coefficient is a fairly expensive process and given that the interference is strong, it is possible to do better by locking on to the interference phase. This can be done by correlating small blocks of the original signal with sine and cosine at desired frequency. We can then slightly tweak the phase to align the sine/cosine with the original signal.
% Compute the phase of the sine/cosine to correlate the signal with
delta_phi = 2*pi*freq/fs;
phi = cumsum(delta_phi);
% We scale the phase adjustments with a triangular window to try to reduce
% phase discontinuities. I've chosen a window of ~200 samples somewhat arbitrarily,
% but it is large enough to cover 8 cycles of the interference around its lowest
% frequency sections (so we can get a better estimate by averaging out other signal
% contributions over multiple interference cycles), and is small enough to capture
% local phase variations.
step = 50;
L = 200;
win = triang(L);
win = win/sum(win);
for i = 0:floor((length(x)-(L-step))/step)
% The phase tweak to align the sine/cosine isn't linear, so we run a few
% iterations to let it converge to a phase locked to the original signal
for iter = 0:1
xseg = x[(i*step+1):(i*step+L+1)];
phiseg = phi[(i*step+1):(i*step+L+1)];
r1 = sum(xseg .* cos(phiseg));
r2 = sum(xseg .* sin(phiseg));
theta = atan2(r2, r1);
delta_phi[(i*step+1):(i*step+L+1)] = delta_phi[(i*step+1):(i*step+L+1)] - theta*win;
phi = cumsum(delta_phi);
end
end
Finally, we need to estimate the amplitude of the interference. Here we choose to perform the estimation over the initial 0.15 seconds where there is a little pause before the speech starts so that the estimation is not biased by the speech's amplitude:
tmax = 0.15;
nmax = floor(tmax * fs);
amp = sqrt(2*mean(x[1:nmax].^2));
% this should give us amp ~ 0.250996990794946
These parameters then allow us to fairly precisely reconstruct the interference, and correspondingly remove the interference from the original signal by direct subtraction:
y = amp * cos(phi)
x = x-y
Listening to the resulting output, you may notice a remaining faint whooshing noise, but nothing compared to the original interference. Obviously this is a fairly ideal case where the parameters of the interference are so easy to estimate that the results almost look too good to be true. You may not get the same performance with more random interference patterns.
Note: the python script used for this processing (and the corresponding .wav file output) can be found here.
Yesterday I finalised the code for detecting the audio energy of a track displayed over time, which I will eventually use as part of my audio thumbnailing project.
However I would also like a method that can detect the pitch of a track displayed over time, so I have 2 options from which to base my research upon.
[y, fs, nb] = wavread('Three.wav'); %# Load the signal into variable y
frameWidth = 441; %# 10 msec
numSamples = length(y); %# Number of samples in y
numFrames = floor(numSamples/frameWidth); %# Number of full frames in y
energy = zeros(1,numFrames); %# Initialize energy
for frame = 1:numFrames %# Loop over frames
startSample = (frame-1)*frameWidth+1; %# Starting index of frame
endSample = startSample+frameWidth-1; %# Ending index of frame
energy(frame) = sum(y(startSample:endSample).^2); %# Calculate frame energy
end
That is the correct code for the energy method, and after researching, I found that I would need to use a Discrete Time Fourier Transform to find the current pitch of each frame in the loop.
I thought that the process would be as easy as modifying the final lines of the code to include the "fft" MATLAB command for calculating Discrete Fourier Transforms but all I am getting back is errors about an unbalanced equation.
Help would be much appreciated, even if it's just a general pointer in the right direction. Thank you.
Determining pitch is a lot more complicated than just applying a DFT. It also depends on the nature of the source - different algorithms are appropriate for speech versus a musical instrument, for example. If this is a music track, as your question seems to imply, then you're probably out of luck as there is no obvious way of determining a single pitch value for multiple musical instruments playing together (how would you even define pitch in this context ?). Maybe you could be more specific about what you are trying to do - perhaps a power spectrum would be more useful than trying to determine an arbitrary pitch ?
I am pretty new to Matlab and I am trying to write a simple frequency based speech detection algorithm. The end goal is to run the script on a wav file, and have it output start/end times for each speech segment. If use the code:
fr = 128;
[ audio, fs, nbits ] = wavread(audioPath);
spectrogram(audio,fr,120,fr,fs,'yaxis')
I get a useful frequency intensity vs. time graph like this:
By looking at it, it is very easy to see when speech occurs. I could write an algorithm to automate the detection process by looking at each x-axis frame, figuring out which frequencies are dominant (have the highest intensity), testing the dominant frequencies to see if enough of them are above a certain intensity threshold (the difference between yellow and red on the graph), and then labeling that frame as either speech or non-speech. Once the frames are labeled, it would be simple to get start/end times for each speech segment.
My problem is that I don't know how to access that data. I can use the code:
[S,F,T,P] = spectrogram(audio,fr,120,fr,fs);
to get all the features of the spectrogram, but the results of that code don't make any sense to me. The bounds of the S,F,T,P arrays and matrices don't correlate to anything I see on the graph. I've looked through the help files and the API, but I get confused when they start throwing around algorithm names and acronyms - my DSP background is pretty limited.
How could I get an array of the frequency intensity values for each frame of this spectrogram analysis? I can figure the rest out from there, I just need to know how to get the appropriate data.
What you are trying to do is called speech activity detection. There are many approaches to this, the simplest might be a simple band pass filter, that passes frequencies where speech is strongest, this is between 1kHz and 8kHz. You could then compare total signal energy with bandpass limited and if majority of energy is in the speech band, classify frame as speech. That's one option, but there are others too.
To get frequencies at peaks you could use FFT to get spectrum and then use peakdetect.m. But this is a very naïve approach, as you will get a lot of peaks, belonging to harmonic frequencies of a base sine.
Theoretically you should use some sort of cepstrum (also known as spectrum of spectrum), which reduces harmonics' periodicity in spectrum to base frequency and then use that with peakdetect. Or, you could use existing tools, that do that, such as praat.
Be aware, that speech analysis is usually done on a frames of around 30ms, stepping in 10ms. You could further filter out false detection by ensuring formant is detected in N sequential frames.
Why don't you use fft with `fftshift:
%% Time specifications:
Fs = 100; % samples per second
dt = 1/Fs; % seconds per sample
StopTime = 1; % seconds
t = (0:dt:StopTime-dt)';
N = size(t,1);
%% Sine wave:
Fc = 12; % hertz
x = cos(2*pi*Fc*t);
%% Fourier Transform:
X = fftshift(fft(x));
%% Frequency specifications:
dF = Fs/N; % hertz
f = -Fs/2:dF:Fs/2-dF; % hertz
%% Plot the spectrum:
figure;
plot(f,abs(X)/N);
xlabel('Frequency (in hertz)');
title('Magnitude Response');
Why do you want to use complex stuff?
a nice and full solution may found in https://dsp.stackexchange.com/questions/1522/simplest-way-of-detecting-where-audio-envelopes-start-and-stop
Have a look at the STFT (short-time fourier transform) or (even better) the DWT (discrete wavelet transform) which both will estimate the frequency content in blocks (windows) of data, which is what you need if you want to detect sudden changes in amplitude of certain ("speech") frequencies.
Don't use a FFT since it calculates the relative frequency content over the entire duration of the signal, making it impossible to determine when a certain frequency occured in the signal.
If you still use inbuilt STFT function, then to plot the maximum you can use following command
plot(T,(floor(abs(max(S,[],1)))))
I am wondering if I am using Fourier Transformation in MATLAB the right way. I want to have all the average amplitudes for frequencies in a song. For testing purposes I am using a free mp3 download of Beethovens "For Elise" which I converted to a 8 kHz mono wave file using Audacity.
My MATLAB code is as follows:
clear all % be careful
% load file
% Für Elise Recording by Valentina Lisitsa
% from http://www.forelise.com/recordings/valentina_lisitsa
% Converted to 8 kHz mono using Audacity
allSamples = wavread('fur_elise_valentina_lisitsa_8khz_mono.wav');
% apply windowing function
w = hanning(length(allSamples));
allSamples = allSamples.*w;
% FFT needs input of length 2^x
NFFT = 2^nextpow2(length(allSamples))
% Apply FFT
fftBuckets=fft(allSamples, NFFT);
fftBuckets=fftBuckets(1:(NFFT/2+1)); % because of symetric/mirrored values
% calculate single side amplitude spectrum,
% normalize by dividing by NFFT to get the
% popular way of displaying amplitudes
% in a range of 0 to 1
fftBuckets = (2*abs(fftBuckets))/NFFT;
% plot it: max possible frequency is 4000, because sampling rate of input
% is 8000 Hz
x = linspace(1,4000,length(fftBuckets));
bar(x,fftBuckets);
The output then looks like this:
Can somebody please tell me if my code is correct? I am especially wondering about the peaks around 0.
For normalizing, do I have to divide by NFFT or length(allSamples)?
For me this doesn't really look like a bar chart, but I guess this is due to the many values I am plotting?
Thanks for any hints!
Depends on your definition of "correct". This is doing what you intended, I think, but it's probably not very useful. I would suggest using a 2D spectrogram instead, as you'll get time-localized information on frequency content.
There is no one correct way of normalising FFT output; there are various different conventions (see e.g. the discussion here). The comment in your code says that you want a range of 0 to 1; if your input values are in the range -1 to 1, then dividing by number of bins will achieve that.
Well, exactly!
I would also recommend plotting the y-axis on a logarithmic scale (in decibels), as that's roughly how the human ear interprets loudness.
Two things that jump out at me:
I'm not sure why you are including the DC (index = 1) component in your plot. Not a big deal, but of course that bin contains no frequency data
I think that dividing by length(allSamples) is more likely to be correct than dividing by NFFT. The reason is that if you want the DC component to be equal to the mean of the input data, dividing by length(allSamples) is the right thing to do.
However, like Oli said, you can't really say what the "correct" normalization is until you know exactly what you are trying to calculate. I tend to use FFTs to estimate power spectra, so I want units like "DAC / rt-Hz", which would lead to a different normalization than if you wanted something like "DAC / Hz".
Ultimately there's no substitute for thinking about exacty what you want to get out of the FFT (including units), and working out for yourself what the correct normalization should be (starting from the definition of the FFT if necessary).
You should also be aware that MATLAB's fft has no requirement to use an array length that is a power of 2 (though doing so will presumably lead to the FFT running faster). Because zero-padding will introduce some ringing, you need to think about whether it is the right thing to do for your application.
Finally, if a periodogram / power spectrum is really what you want, MATLAB provides functions like periodogram, pwelch and others that may be helpful.