Yesterday I finalised the code for detecting the audio energy of a track displayed over time, which I will eventually use as part of my audio thumbnailing project.
However I would also like a method that can detect the pitch of a track displayed over time, so I have 2 options from which to base my research upon.
[y, fs, nb] = wavread('Three.wav'); %# Load the signal into variable y
frameWidth = 441; %# 10 msec
numSamples = length(y); %# Number of samples in y
numFrames = floor(numSamples/frameWidth); %# Number of full frames in y
energy = zeros(1,numFrames); %# Initialize energy
for frame = 1:numFrames %# Loop over frames
startSample = (frame-1)*frameWidth+1; %# Starting index of frame
endSample = startSample+frameWidth-1; %# Ending index of frame
energy(frame) = sum(y(startSample:endSample).^2); %# Calculate frame energy
end
That is the correct code for the energy method, and after researching, I found that I would need to use a Discrete Time Fourier Transform to find the current pitch of each frame in the loop.
I thought that the process would be as easy as modifying the final lines of the code to include the "fft" MATLAB command for calculating Discrete Fourier Transforms but all I am getting back is errors about an unbalanced equation.
Help would be much appreciated, even if it's just a general pointer in the right direction. Thank you.
Determining pitch is a lot more complicated than just applying a DFT. It also depends on the nature of the source - different algorithms are appropriate for speech versus a musical instrument, for example. If this is a music track, as your question seems to imply, then you're probably out of luck as there is no obvious way of determining a single pitch value for multiple musical instruments playing together (how would you even define pitch in this context ?). Maybe you could be more specific about what you are trying to do - perhaps a power spectrum would be more useful than trying to determine an arbitrary pitch ?
Related
I want to generate my own samples of a Kick, Clap, Snare and Hi-Hat sounds in MATLAB based on a sample I have in .WAV format.
Right now it does not sound at all correct, and I was wondering if my code does not make sense? Or if it is that I am missing some sound theory.
Here is my code right now.
[y,fs]=audioread('cp01.wav');
Length_audio=length(y);
df=fs/Length_audio;
frequency_audio=-fs/2:df:fs/2-df;
frequency_audio = frequency_audio/(fs/2); //Normalize the frequency
figure
FFT_audio_in=fftshift(fft(y))/length(fft(y));
plot(frequency_audio,abs(FFT_audio_in));
The original plot of y.
My FFT of y
I am using the findpeaks() function to find the peaks of the FFT with amplitude greater than 0.001.
[pk, loc] = findpeaks(abs(FFT_audio_in), 'MinPeakHeight', 0.001);
I then find the corresponding normalized frequencies from the frequency audio (positive ones) and the corresponding peak.
loc = frequency_audio(loc);
loc = loc(length(loc)/2+1:length(loc))
pk = pk(length(pk)/2+1:length(pk))
So the one sided, normalized FFT looks like this.
Since it looks like the FFT, I think I should be able to recreate the sound by summing up sinusoids with the correct amplitude and frequency. Since the clap sound had 21166 data points I use this for the for loop.
for i=1:21116
clap(i) = 0;
for j = 1:length(loc);
clap(i) = bass(i) + pk(j)*sin(loc(j)*i);
end
end
But this results in the following sound, which is nowhere near the original sound.
What should I do differently?
You are taking the FFT of the entire time-period of the sample, and then generating stationary sinewaves for the whole duration. This means that the temporal signature of the drum is gone. And the temporal signature is the most characteristic of percussive unvoiced instruments.
Since this is so critical, I suggest you start there first instead of with the frequency content.
The temporal signature can be approximated by the envelope of the signal. MATLAB has a convenient function for this called envelope. Use that to extract the envelope of your sample.
Then generate some white-noise and multiply the noise by the envelope to re-create a very simple version of your percussion instrument. You should hear a clear difference between Kick, Clap, Snare and Hi-Hat, though it won't sound the same as the original.
Once this is working, you can attempt to incorporate frequency information. I recommend taking the STFT to get a spectrogram of the sound, so you can see how it the frequency spectrum changes over time.
Yesterday I finalised the code for detecting the audio energy of a track displayed over time, which I will eventually use as part of my audio thumbnailing project.
However I would also like a method that can detect the pitch of a track displayed over time, so I have 2 options from which to base my research upon.
[y, fs, nb] = wavread('Three.wav'); %# Load the signal into variable y
frameWidth = 441; %# 10 msec
numSamples = length(y); %# Number of samples in y
numFrames = floor(numSamples/frameWidth); %# Number of full frames in y
energy = zeros(1,numFrames); %# Initialize energy
for frame = 1:numFrames %# Loop over frames
startSample = (frame-1)*frameWidth+1; %# Starting index of frame
endSample = startSample+frameWidth-1; %# Ending index of frame
energy(frame) = sum(y(startSample:endSample).^2); %# Calculate frame energy
end
That is the correct code for the energy method, and after researching, I found that I would need to use a Discrete Time Fourier Transform to find the current pitch of each frame in the loop.
I thought that the process would be as easy as modifying the final lines of the code to include the "fft" MATLAB command for calculating Discrete Fourier Transforms but all I am getting back is errors about an unbalanced equation.
Help would be much appreciated, even if it's just a general pointer in the right direction. Thank you.
Determining pitch is a lot more complicated than just applying a DFT. It also depends on the nature of the source - different algorithms are appropriate for speech versus a musical instrument, for example. If this is a music track, as your question seems to imply, then you're probably out of luck as there is no obvious way of determining a single pitch value for multiple musical instruments playing together (how would you even define pitch in this context ?). Maybe you could be more specific about what you are trying to do - perhaps a power spectrum would be more useful than trying to determine an arbitrary pitch ?
I am writing a pitch adaptation function in matlab. It takes an audio signal and a pitchCoefficient vector, where each element determines by how much to pitch shift its respective frame.
The audio signal is sliced evenly depending on how many pitch coefficients there are. If there are only 2 pitch coefficients the audio will be divided into 2 halves, and the first half will be pitch shifted by the first coefficient and the second half will be pitch shifted by the second coefficient. So if my coefficients are [1,2] the first half of the audio signal will sound the same as the original and the second half of the audio will be twice as high pitched.
This is the code for my function:
function [audioModified] = modifyPitch(audio, pitchCoefficients)
nwindows = length(pitchCoefficients);
windowSize = floor(length(audio)/nwindows);
audioModified = [];
for i=1:nwindows
start = (i-1)*windowSize + 1;
finish = i*windowSize;
originalWindow = audio(start:finish, 1);
pitchCoeff = 1/pitchCoefficients(i);
timeScaledWindow = pvoc(originalWindow, pitchCoeff);
[P,Q] = rat(pitchCoeff);
pitchModifiedWindow = resample(timeScaledWindow, P, Q);
audioModified = [audioModified; pitchModifiedWindow];
end
end
However, the final audio (which is the concatenation of all the frames) has these artifacts where each frame starts with a 'tic' sound. I'm assuming this happens because of the way I concatenate the frames. If the frames are too small this effect becomes so pronounced that the audio is no longer hearable.
How should I go about mitigating or removing this problem? Is there a way to smooth the audio out the same way you can blur an image to get rid of noise?
Additional info: I use this phase vocoder (pvoc) to do the time scaling.
Try overlapping and cross-fading longer frames instead of using frames that are too small. The cross fade will help reduce the discontinuity between adjacent (re)synthesized frames.
In audio processing, any sudden jumps in amplitude (ie steeper than the current sampling rate allows) will sound like clicks, and are considered an error. So if your algorithm just suddenly changes samples in a window, chances are if you look at the resultant waveform (say in Audacity or something similar) you'll see the level make a hard rise, which is in effect producing component frequencies over the nyquist (half the sampling rate), meaning the system can't accurately represent them and they'll cause audible artifacts. There's no easy solution for that, unfortunately you can't just do things to digital audio and hope it will work, there's boat loads of DSP involved in making sure your process is representable at the current sampling rate.
Actually I think for your problem it might be sufficient to adjust the window dynamically in a way, that each window starts and ends at a point where your signal crosses zero. To get this done right you might need a higher resolution in your matlab raw data.
I'm trying to do a project that reads in a WAV file that has a sequence of music note in it using MATLAB. For example, my WAV file might contain a sequence of C-D-C-E. And feeding this file into my program would print out "C D C E."
I tried using WAVREAD to turn the file into a vector, then
used sampling to downsample it and make into a one-channel file.
Then I was able to come up with a spectrogram that has "peaks" at certain frequencies.
From here, I would like to get help on how to make MATLAB to recognize the frequencies at the peaks, therefore enabling me to print out the note.
Or am I on the wrong track?
Thanks in advance!
You are on the correct track but this is not a simple problem. What I would suggest looking into is something called a chromagram. This will use information that you gathered from the spectrogram and "bin" it into piano note frequencies. This will give an approximation of a songs harmonic content. This may not be entirely accurate though because of residual energy in the note's harmonics, but it is a start.
Do realize that transcription, which is what you are doing, is a very difficult task and has yet to be 100% solved. People are still researching this to today. I have code to generate chroma, but I will have to dig for it.
EDIT
Here is some code to chroma
clc; close all; clear all;
% didn't have wav file, but simply replace this with the following
% [audio,fs] = wavread('audioFile.wav')
audio = rand(1,10000);
fs = 44100; % temp sampling frequency, will depend on audio input
NFFT = 1024; % feel free to change FFT size
hamWin = hamming(NFFT); % window your audio signal to avoid fft edge effects
% get spectral content
S = spectrogram(audio,hamWin,NFFT/2,NFFT,fs);
% Start at center lowest piano note
A0 = 27.5;
% all 88 keys
keys = 0:87;
center = A0*2.^((keys)/12); % set filter center frequencies
left = A0*2.^((keys-1)/12); % define left frequency
left = (left+center)/2.0;
right = A0*2.^((keys+1)/12); % define right frequency
right = (right+center)/2;
% Construct a filter bank
filter = zeros(numel(center),NFFT/2+1); % place holder
freqs = linspace(0,fs/2,NFFT/2+1); % array of frequencies in spectrogram
for i = 1:numel(center)
xTemp = [0,left(i),center(i),right(i),fs/2]; % create points for filter bounds
yTemp = [0,0,1,0,0]; % set magnitudes at each filter point
filter(i,:) = interp1(xTemp,yTemp,freqs); % use interpolation to get values for frequencies
end
% multiply filter by spectrogram to get chroma values.
chroma = filter*abs(S);
%Put into 12 bin chroma
chroma12 = zeros(12,size(chroma,2));
for i = 1:size(chroma,1)
bin = mod(i,12)+1; % get modded index
chroma12(bin,:) = chroma12(bin,:) + chroma(i,:); % add octaves together
end
That should do the trick. It may not be the fastest solution, but it should get the job done.
Surely it can be optimized.
As MZimmerman6 this is a very complex problem. Peak to peak measuring may be successful, but will certainly not if the music gets anymore complicated. I have tackled this problem before and seen other people try it as well and the most successful projects among my peers I have seen involve the following:
1) Constrain the time. It may actually be difficult for a program to determine when a note is even changing! This is especially true if you are trying to separate vocals from instrumentals, or for example when two chords play sequentially, but they have one note that stays the same between them. So by constrain the time it is meant find out when each chunk of music happens, so in your case divide the track into four tracks, one for each note. You may be able to use the attack of each note to your advantage, to automatically detect the attack as the beginning of a new segment to test.
2) Constrain the frequencies. You have to use what you know, otherwise you will need to make eigenmode comparisons. Singular value decomposition has been effective in this arena. But if you somehow have the piano playing separate notes (individually), and you have recordings of the piano playing the song, what you can do is a fast fourier transform of each segment (see above time constraints), cut out the noise, and compare them. Then you employ a subtractive method or another metric to determine the best "fit" for each note.
This is a rough explanation of the concerns, but trust me, the more constraints you can put on this sort of analysis the better.
I am pretty new to Matlab and I am trying to write a simple frequency based speech detection algorithm. The end goal is to run the script on a wav file, and have it output start/end times for each speech segment. If use the code:
fr = 128;
[ audio, fs, nbits ] = wavread(audioPath);
spectrogram(audio,fr,120,fr,fs,'yaxis')
I get a useful frequency intensity vs. time graph like this:
By looking at it, it is very easy to see when speech occurs. I could write an algorithm to automate the detection process by looking at each x-axis frame, figuring out which frequencies are dominant (have the highest intensity), testing the dominant frequencies to see if enough of them are above a certain intensity threshold (the difference between yellow and red on the graph), and then labeling that frame as either speech or non-speech. Once the frames are labeled, it would be simple to get start/end times for each speech segment.
My problem is that I don't know how to access that data. I can use the code:
[S,F,T,P] = spectrogram(audio,fr,120,fr,fs);
to get all the features of the spectrogram, but the results of that code don't make any sense to me. The bounds of the S,F,T,P arrays and matrices don't correlate to anything I see on the graph. I've looked through the help files and the API, but I get confused when they start throwing around algorithm names and acronyms - my DSP background is pretty limited.
How could I get an array of the frequency intensity values for each frame of this spectrogram analysis? I can figure the rest out from there, I just need to know how to get the appropriate data.
What you are trying to do is called speech activity detection. There are many approaches to this, the simplest might be a simple band pass filter, that passes frequencies where speech is strongest, this is between 1kHz and 8kHz. You could then compare total signal energy with bandpass limited and if majority of energy is in the speech band, classify frame as speech. That's one option, but there are others too.
To get frequencies at peaks you could use FFT to get spectrum and then use peakdetect.m. But this is a very naïve approach, as you will get a lot of peaks, belonging to harmonic frequencies of a base sine.
Theoretically you should use some sort of cepstrum (also known as spectrum of spectrum), which reduces harmonics' periodicity in spectrum to base frequency and then use that with peakdetect. Or, you could use existing tools, that do that, such as praat.
Be aware, that speech analysis is usually done on a frames of around 30ms, stepping in 10ms. You could further filter out false detection by ensuring formant is detected in N sequential frames.
Why don't you use fft with `fftshift:
%% Time specifications:
Fs = 100; % samples per second
dt = 1/Fs; % seconds per sample
StopTime = 1; % seconds
t = (0:dt:StopTime-dt)';
N = size(t,1);
%% Sine wave:
Fc = 12; % hertz
x = cos(2*pi*Fc*t);
%% Fourier Transform:
X = fftshift(fft(x));
%% Frequency specifications:
dF = Fs/N; % hertz
f = -Fs/2:dF:Fs/2-dF; % hertz
%% Plot the spectrum:
figure;
plot(f,abs(X)/N);
xlabel('Frequency (in hertz)');
title('Magnitude Response');
Why do you want to use complex stuff?
a nice and full solution may found in https://dsp.stackexchange.com/questions/1522/simplest-way-of-detecting-where-audio-envelopes-start-and-stop
Have a look at the STFT (short-time fourier transform) or (even better) the DWT (discrete wavelet transform) which both will estimate the frequency content in blocks (windows) of data, which is what you need if you want to detect sudden changes in amplitude of certain ("speech") frequencies.
Don't use a FFT since it calculates the relative frequency content over the entire duration of the signal, making it impossible to determine when a certain frequency occured in the signal.
If you still use inbuilt STFT function, then to plot the maximum you can use following command
plot(T,(floor(abs(max(S,[],1)))))