What is a good algorithm for pitch shifting? - web-audio-api

I have an algorithm in place that works for audio files, but I'm trying to apply it to frames to achieve an auto tune type of effect. It shifts the pitch but makes it sound broken up or like it's skipping when I apply it to the individual frames from an audioworklet. This is essentially the algorithm in c#. I converted it to typescript to use it with WebAssembly. I'm at a loss as to why the output sounds great on a full audio file, but broken up or like it's skipping when I apply it to individual frames. Any advice would be greatly appreciated.

Related

Recommendation in Video Analysis with Neural Network

I recently took a course at Neural Networks and decided to do research work. What I have considered is designing a network that recognizes the movement of the lips, which is commonly known as lip-reading.
I know the theory about neural networks, I chose to design a Convolutional neural network but I have problems thinking about how to extract the characteristics of the video or sequence of images that will serve as input to the network that I plan to design.
Before focusing on the full investigation, I wanted to be helped a bit by giving me concepts or ideas on how to do it, mainly in the feature extraction part.
What I have thought in general is the following:
A vowel or syllable lasts approximately 1 to 2 seconds in video. From that video I have to extract a sequence of images that show how the lips move. Assuming I selected about 10 or 15 images, I suppose all those images, after being processed, should be my "input" to get the characteristics.
But I have already analyzed a single image, like the classic example of "Recognize a letter" but, as I said before, I suppose I will have a sequence of images to analyze and that confuses me a bit.
I would like to know if I'm on the right track with this idea and if not, I would they to guide me with this. I hope I have been clear with the aforementioned, thank you very much.
This paper should help you decide how to handle the sequence of frames as input to a neural network. Looks like you can concatenate(combine) all of the frames for a particular sound into one image and feed into your net for training and evaluation.
http://cs231n.stanford.edu/reports/2016/pdfs/217_Report.pdf

changing the pitch of an audio wav file in matlab?

How do you go about changing the pitch of an audio signal in matlab?. Essentially I just want to change the original qualities of the audio signal without making a dramatic change. I'm trying to use the original input audio to simulate a chorus by changing its qualities slightly so that I can have multiple variations of the audio to simulate the chorus.
This simplest approach might be a phase vocoder. You can find one matlab implementation here:
http://labrosa.ee.columbia.edu/matlab/pvoc/
This is a rabbit hole, though. There are so many more techniques that can employed to improve the quality and reduce the artifacts introduced by pitch shifting. See for example, Jean Laroche and Mark Dolson, "New Phase-Vocoder Techniques for pitch shifting, harmonizing and other exotic effects", proc. 1999 IEEE Workship on Applications of Signal Processing to Audio and Acoustics, p. 91.

How to adjust the Head-related transfer function (HRTF) in OpenAL or Core Audio?

OpenAL makes use of HRTF algorithms to fake surround sound with stereo headphones. However, there is an important dependency between HRTF and the shape of the users head and ears.
Simplified, this means: If your head / ears differ too much from the standard HRTF function they have implemented, the surround sound effect fades towards boring stereo.
I haven't yet found a way to adjust the various factors contributing to the HRTF algorithm, such as head diameter, pinna / external ear size, ear-to-ear distance, nose length and other important properties influencing the HRTF.
Is there any known way of setting these parameters for best surround sound experience?
I don't believe you can alter the HRTF in OpenAL. You certainly couldn't do it by putting in parametric values such as nose or pinna size. The only way to find out your HRTF is to put some very tiny, very accurate microphones in your ears, go into an anechoic chamber and take frequency response measurements at every angle around your head. Obviously this is time consuming, expensive and impractical. It would be fantastic to be able to work out your HRTF from measuring your head, but unfortunately acoustics isn't that deterministic and your ear is very sensitive to inaccuracies as you pointed out. I think the OpenAL HRTF is based on some KEMAR dummy head measurements (these perhaps?).
So, I think the short answer is that you can't alter the HRTF for OpenAL. Because HRTF is such a complex function that your ear is so sensitive to, there's no accurate way to approximate it with parametric values.
You might be able to make a "configuration game" out of optimizing the HRTF. I've been looking for an answer to the question if any of the virtual surround headsets or soundcards allow you adjust them to fit your personal HRTF.
Idea: You vary the different HRTF variables and play a sound. The user has to close his eyes and move the mouse into the direction he thought the sound came from. You measure how right he was.
You could use something like a thin plate spline or statistical curve fitting to plot the accuracy results and sample different regions of the multidimensional HRTF space to optimize the solution. This would be a kind of "brute force" method to find a solution that is not necessary accurate, but as good as the user has patience to optimize his personal HRTF.
According to a readme in the OpenALSoft sourcecode it uses a 32-sample convolution filter and you can create using custom HRTF samples.
It looks like it is now possible. I stumbled upon this comment which describes how to use hrtf_tables for approximations of your own ears. Google is showing me results for something called hrtf-paths as well but I'm not sure what that is.

Real time speech transformation in MATLAB

Is it possible to transform speech (pitch/formant shift) in (near) real-time using MATLAB? How can it be done?
If not, what should I use to do that?
I need to get input from the microphone, visualise the sound wave, add a filter to it, see the oscilloscope again, and play back the modified sound.
The real-time visualization (spectrogram) can be created with SparkNG package by Hideki Kawahara.
Sure. There's a demo application up on the MATLAB Central File Exchange that does something similar. It reads in a signal from the sound card (requires Data Acquisition Toolbox) in near real time, applies an FFT transform - you could do something else like applying a filter - and visualises the results in 3D graphs live. You could use it as a template and modify it to your needs, such as visualising in different ways (more of an oscilloscope style), or outputting the sound as a .wav file for later playback.
If you need properly real time, you might look into implementing in Simulink rather than just base MATLAB.

Peak detection in Performous code

I was looking to implement voice pitch detection in iphone using HPS method. But the detected tones are not very accurate. Performous does a decent job of pitch detection.
I looked through the code but i did not fully get the theory behind the calculations.
They use FFT and find the peaks. But the part where they use the phase of FFT output, got me confused.I figure they use some heuristics for voice frequencies.
So,Could anyone please explain the algorithm used in Performous to detect pitch?
[Performous][1] extracts pitch from the microphone. Also the code is open source. Here is a description of what the algorithm does, from the guy that coded it (Tronic on irc.freenode.net#performous).
PCM input (with buffering)
FFT (1024 samples at a time, remove 200 samples from front of the buffer afterwards)
Reassignment method (against the previous FFT that was 200 samples earlier)
Filtering of peaks (this part could be done much better or even left out)
Combining peaks into sets of harmonics (we call the combination a tone)
Temporal filtering of tones (update the set of tones detected earlier instead of simply using the newly detected ones)
Pick the best vocal tone (frequency limits, weighting, could use the harmonic array also but I don't think we do)
I still wasn't able from this information to figure it out and implement it. If anyone manages this, please please post your results here, and comment this response so that SO notifies me.
The task would be to create a minimal C++ wrapper around this code.