How to adjust the Head-related transfer function (HRTF) in OpenAL or Core Audio? - iphone

OpenAL makes use of HRTF algorithms to fake surround sound with stereo headphones. However, there is an important dependency between HRTF and the shape of the users head and ears.
Simplified, this means: If your head / ears differ too much from the standard HRTF function they have implemented, the surround sound effect fades towards boring stereo.
I haven't yet found a way to adjust the various factors contributing to the HRTF algorithm, such as head diameter, pinna / external ear size, ear-to-ear distance, nose length and other important properties influencing the HRTF.
Is there any known way of setting these parameters for best surround sound experience?

I don't believe you can alter the HRTF in OpenAL. You certainly couldn't do it by putting in parametric values such as nose or pinna size. The only way to find out your HRTF is to put some very tiny, very accurate microphones in your ears, go into an anechoic chamber and take frequency response measurements at every angle around your head. Obviously this is time consuming, expensive and impractical. It would be fantastic to be able to work out your HRTF from measuring your head, but unfortunately acoustics isn't that deterministic and your ear is very sensitive to inaccuracies as you pointed out. I think the OpenAL HRTF is based on some KEMAR dummy head measurements (these perhaps?).
So, I think the short answer is that you can't alter the HRTF for OpenAL. Because HRTF is such a complex function that your ear is so sensitive to, there's no accurate way to approximate it with parametric values.

You might be able to make a "configuration game" out of optimizing the HRTF. I've been looking for an answer to the question if any of the virtual surround headsets or soundcards allow you adjust them to fit your personal HRTF.
Idea: You vary the different HRTF variables and play a sound. The user has to close his eyes and move the mouse into the direction he thought the sound came from. You measure how right he was.
You could use something like a thin plate spline or statistical curve fitting to plot the accuracy results and sample different regions of the multidimensional HRTF space to optimize the solution. This would be a kind of "brute force" method to find a solution that is not necessary accurate, but as good as the user has patience to optimize his personal HRTF.
According to a readme in the OpenALSoft sourcecode it uses a 32-sample convolution filter and you can create using custom HRTF samples.

It looks like it is now possible. I stumbled upon this comment which describes how to use hrtf_tables for approximations of your own ears. Google is showing me results for something called hrtf-paths as well but I'm not sure what that is.

Related

How do I determine processor speed required for optical flow?

I'd like to use an optical flow system to get velocities from surrounding environment. I've read papers about how optical flow works, but they don't treat details about optic sensors.
My question is: How do I determine how much computational power is required to perform optical flow analysis?
I'd like to use a low-power system (like microcontrollers), but I don't know what kind of camera I could use with such a system. I mean, could it be color or does it need to be B/W? Rolling shutter or global shutter? Which frame rate or number of pixels?
I'd like to specify the system myself but, without knowing how those camera attributes impact the processing load, I'm not sure where to start.
As Chuck already said in the comment. You first need to start with something. Opticalflow calculation really depends on what you are using it for and what you are trying to achieve. For realtime applications you might want to consider using faster processors (this is always true though).
Continuing to my answer.
Opticalflow calculation performance depends on few main things:
The optical-flow method you choose (dense or sparse), you can read more about it here and here. Of course that you should take into account not only that sparse is faster than dense, also that sparse might be less accurate in some cases. Again, this depends on what you're trying to achieve.
In addition, you will see that there are different optical-flow algorithms. Some might be faster than others. There are many algorithms such as Lucas-Kanade, Horn-Schunck, TVL1, Farneback, etc.
Most optical-flow methods from libraries such as OpenCV gives you the ability to change some parameters in order to play with the trade-off between accuracy and performance. See this and also check the OpenCV methods such as this and this for example - see the different arguments.
The resolution of your image. Smaller image usually means faster calculation.
Few things you might also want to consider:
If you are using a processor that has multiple cores, make sure that you are using all the cores in the optical-flow calculation. Some libraries may already do this for you, but in some cases you will need to do it by yourself. Take a look at my question and answer in this post, it might give you some idea and help you getting starting with such case.
If you want more accurate optical-flow results you must use global shutter camera. Rolling shutter cameras, such as most of the web-cams, will give you an extra error you don't want.
You don't need color image, if you have a grayscale camera it will be even better. If not, you will need to convert it to grayscale (not B/W) for faster performance as well.
Some libraries such as OpenCV has an option (in some cases) to run these algorithms on a GPU. If using a GPU is an option you might want to consider this as well.
From my own experience, the main thing that gave me a boost in performance was changing my resolution from 640x480 to 320x240 and even 160x120. In my case it didn't really hurt the accuracy.
I used an Odroid U3 mini-pc with OpenCV PyrLK algorithm and input frames of 320x240 resolution. After applying what's described here (splitting the image to 4 for parallel calculation) it worked pretty well (realtime).
The answer given by Sarid has some strong points, and many of them are shared by researchers around the world. My opinions are shared by anyone who has actually worked with these topics in the real-world setting.... with real world, i mean implementing optical flow in drones, on mobile phones and IP cameras that are not sitting in a protected office, and where other systems (such as humans) need to interact and be co-dependent.
First of all, depending on your problem, you may want to invest time in looking for ready-made solutions. Optical flow sensors are readily available, cheap and robust (but usually not strong in accuracy). These are the kind of sensors you find in optical mice. They are low power, and easily interfaced with micro-controllers. Some have staggering sample rates of thousands of fps. They commonly have low spatial resolution however, and (to emphasize) high robustness but low accuracy.
If instead you are looking for the kind of optical flow that can be used for shape from motion, pedestrian detection and video-encoding, for example, then you are probably better off to look for something more advanced, and thats where Sarids answer becomes relevant.
Since your question has been migrated from robotics stack exchange, I am going to assume you are interested applications close to machine control and human machine interaction. In that case, the most important aspects are the ones usually most ignored by people working in the field of optical flow estimation, namely:
Latency. If you have a human interfacing at the front-end... then the common term is "glass-to-glass latency". This is completely different from the fps of your system, which is connected to throughput. If you find that you are in a discussion with someone, and they do not understand the difference between latency and fps, then they are not the expert you are interested in. For example, almost all researchers in computer vision who do GPU implementations of optical flow add massive latency by allowing for frame delays and ineffecient memory handling (inefficient from perspective of latency, but efficient in terms of throughput and hard-ware utilization). Consider the problem of controlling a drone, say make it self-stabilizing, it is better to receive a bad optical flow estimation 10 ms earlier, then a good one with 10 ms extra delay.... especially if the optical system does not give you any upper bounds of the delay for any given time.
Algorithm stability. This is completely different from accuracy. Accuracy is what 99% of all research in optical flow has been obsessing about for the last 30 years. Stability is not at all something evaluated in the Middlebury benchmark for example. Stability deals with how small changes in your data will guarantee small changes in the estimated optical flow. While some good work has been done in the community (on robust statistics most interestingly) in the end the final evaluation of any algortihm disregards stability. Consider the optical mouse as a good example. The first generations of optical mice had higher accuracy (the average error from the true motion was smaller) but they had lower stability (especially when you ran the mice over "bad textures", with rotational motions). Later generations of optical mouse have worse accuracy, but are focusing on the stability, as that is the most important thing. You dont experience the mouse cursor jumping around as much as you did the earlier days of the devices.... but if you move the mouse on your mat, left and right repeatedly, you will see the cursor slowly drifting (i.e. low accuracy).
Heat. Any device that will estimate high accuracy optical flow, will require lots of computations. When it comes to computations per watt, GPUs are not that good. In drones, you may be able to get away with this, because it is a setting where you have active cooling as a by-product of the propulsion system. In the real-world, you most often can not assume active cooling nor unlimited power supply.
To conclude, its a fascinating area, and I hope you have a great experience coding solutions.

Audio processing with fft iphone app [duplicate]

I'm trying to write a simple tuner (no, not to make yet another tuner app), and am looking at the AurioTouch sample source (has anyone tried to comment this code??).
My worry is that aurioTouch doesn't seem to actually work very well when looking at the frequency domain graph. I play a single note on an instrument and I don't see a nicely ordered, small, set of frequencies with one string peak at the appropriate frequency of the note.
Has anyone used aurioTouch enough to know whether the underlying code is functional or whether it is just a crude sample?
Other options I have are to use FFTW or KISS FFT. Anyone have any experience with those?
Thanks.
You're expecting the wrong thing!!
Not the library's fault
Whether the library produces it properly or not, you're looking for a pattern that rarely actually exists in real-life sounds. Only a perfect sine wave, electronically generated, will cause an even partway discrete appearing 'spike' in the freq. graph. If you don't believe it try firing up a 'spectrum analyzer' visualization in winamp or media player. It's not so much the PC's fault.
Real sound waves are complicated animals
Picture a sawtooth or sqaure wave in your mind's eye. those sharp turnaround - corners or points on the wave, look like tons of higher harmonics to the FFT or even a real fourier. And if you've ever seen a real 'sqaure wave/sawtooth' on a scope, or even a 'sine wave' produced by an instrument that is supposed to produce a sinewave, take a look at all the sharp nooks and crannies in just ONE note (if you don't have a scope just zoom way in on the wave in audacity - the more you zoom, the higher notes you're looking at). Yep, those deviations all count as frequencies.
It's hard to tell the difference between one note and a whole orchestra sometimes in a spectrum analysis.
But I hear single notes!
So how does the ear do it? It considers the entire waveform. Then your lower brain lies to your upper brain about what the input is: one note, not a mess of overtones.
You can't do it as fully, but you can approximate it via 'training.'
Approximation: building some smarts
PLAY the note on the instrument and 'save' the frequency graph. Do this for notes in several frequency ranges, or better yet all notes.
Then interpolate the notes to fill in gaps (by 1/2 or 1/4 steps) by multiplying the saved graphs for that instrument by 2^(1/12) (or 1/24 for 1/4 steps, etc).
Figure out how to store them in a quickly-searchable data structure like a BST or trie. Only it would have to return a 'how close is this' score. It would have to identify the match via proportions of frequencies as well, in case it came in different volumes.
Using the smarts
Next time you're looking for a note from that instrument, just take the 'heard' freq graph and find it in that data structure. You can record several instruments that make different waveforms and search for them too. If there are background sounds or multiple notes, take the closest match. Then if you want to identify other notes, 'subtract' the found frequency pattern from the sampled one, and rinse, lather repeat.
It won't work by your voice...
If you ever tried to tune yourself by singing into a guitar tuner, you'll know that tuners arent that clever. Of course some instruments (voice esp) really float around the pitch and generate an ever-evolving waveform (even without somebody singing).
What are you trying to accomplish?
You would not have to totally get this fancy for a 'simple' tuner app, but if you're not making just another tuner app them I'm guessing you actually want to identify notes (e.g., maybe you want to autogenerate midi files from songs on the radio ;-)
Good luck. I hope you find a library that does all this junk instead of having to roll your own.
Edit 2017
Note this webpage: http://www.feilding.net/sfuad/musi3012-01/html/lectures/015_instruments_II.htm
Well down the page, there are spectrum analyses of various organ pipes. There are many, many overtones. These are possible to detect - with enough work - if you 'train' your app with them first (just like telling a kid, 'this is what a clarinet sounds like...')
aurioTouch looks weird because the frequency axis is on a linear scale. It's very difficult to interpret FFT output when the x-axis is anything other than a logarithmic scale (traditionally log2).
If you can't use aurioTouch's integer-FFT, check out my library:
http://github.com/alexbw/iPhoneFFT
It uses double-precision, has support for multiple window types, and implements Welch's method (which should give you more stable spectra when viewed over time).
#zaph, the FFT does compute a true Discrete Fourier Transform. It is simply an efficient algorithm that takes advantage of the bit-wise representation of digital signals.
FFTs use frequency bins and the bin frequency width is based on the FFT parameters. To find a frequency you will need to record it sampled at a rate at least twice the highest frequency present in the sample. Then find the time between the cycles. If it is not a pure frequency this will of course be harder.
I am using Ooura FFT to compute the FFT of acceleromter data. I do not always obtain the correct spectrum. For some reason, Ooura FFT produces completely wrong results with spectral magnitudes of the order 10^200 across all frequencies.

C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input

I want to detect not the pitch, but the pitch class of a sung note.
So, whether it is C4 or C5 is not important: they must both be detected as C.
Imagine the 12 semitones arranged on a clock face, with the needle pointing to the pitch class. That's what I'm after! ideally I would like to be able to tell whether the sung note is spot-on or slightly off.
This is not a duplicate of previously asked questions, as it introduces the constraints that:
the sound source is a single human voice, hopefully with negligible background interference (although I may need to deal with this)
the octave is not important, only the pitch class
EDIT -- Links:
Real time pitch detection
Using the Apple FFT and Accelerate Framework
See my answer here for getting smooth FREQUENCY detection: https://stackoverflow.com/a/11042551/1457445
As far as snapping this frequency to the nearest note -- here is a method I created for my tuner app:
- (int) snapFreqToMIDI: (float) frequencyy {
int midiNote = (12*(log10(frequencyy/referenceA)/log10(2)) + 57) + 0.5;
return midiNote;
}
This will return the MIDI note value (http://www.phys.unsw.edu.au/jw/notes.html)
In order to get a string from this MIDI note value:
- (NSString*) midiToString: (int) midiNote {
NSArray *noteStrings = [[NSArray alloc] initWithObjects:#"C", #"C#", #"D", #"D#", #"E", #"F", #"F#", #"G", #"G#", #"A", #"A#", #"B", nil];
return [noteStrings objectAtIndex:midiNote%12];
}
For an example implementation of the pitch detection with output smoothing, look at musicianskit.com/developer.php
Pitch is a human psycho-perceptual phenomena. Peak frequency content is not the same as either pitch or pitch class. FFT and DFT methods will not directly provide pitch, only frequency. Neither will zero crossing measurements work well for human voice sources. Try AMDF, ASDF, autocorrelation or cepstral methods. There are also plenty of academic papers on the subject of pitch estimation.
There is another long list of pitch estimation algorithms here.
Edited addition: Apple's SpeakHere and aurioTouch sample apps (available from their iOS dev center) contain example source code for getting PCM sample blocks from the iPhone's mic.
Most of the frequency detection algorithms cited in other answers don't work well for voice. To see why this is so intuitively, consider that all the vowels in a language can be sung at one particular note. Even though all those vowels have very different frequency content, they would all have to be detected as the same note. Any note detection algorithm for voices must take this into account somehow. Furthermore, human speech and song contains many fricatives, many of which have no implicit pitch in them.
In the generic (non voice case) the feature you are looking for is called the chroma feature and there is a fairly large body of work on the subject. It is equivalently known as the harmonic pitch class profile. The original reference paper on the concept is Tayuka Fujishima's "Real-Time Chord Recognition of Musical Sound: A System Using Common Lisp Music". The Wikipedia entry has an overview of a more modern variant of the algorithm. There are a bunch of free papers and MATLAB implementations of chroma feature detection.
However, since you are focusing on the human voice only, and since the human voice naturally contains tons of overtones, what you are practically looking for in this specific scenario is a fundamental frequency detection algorithm, or f0 detection algorithm. There are several such algorithms explicitly tuned for voice. Also, here is a widely cited algorithm that works on multiple voices at once. You'd then check the detected frequency against the equal-tempered scale and then find the closest match.
Since I suspect that you're trying to build a pitch detector and/or corrector a la Autotune, you may want to use M. Morise's excellent WORLD implementation, which permits fast and good quality detection and modification of f0 on voice streams.
Lastly, be aware that there are only a few vocal pitch detectors that work well within the vocal fry register. Almost all of them, including WORLD, fail on vocal fry as well as very low voices. A number of papers refer to vocal fry as "creaky voice" and have developed specific algorithms to help with that type of voice input specifically.
If you are looking for the pitch class you should have a look at the chromagram (http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/)
You can also simply dectect the f0 (using something like YIN algorithm) and return the appropriate semitone, most of fundamental frequency estimation algorithms suffer from octave error
Perform a Discrete Fourier Transform on samples from your input waveform, then sum values that correspond to equivalent notes in different octaves. Take the largest value as the dominant frequency.
You can likely find some existing DFT code in Objective C that suits your needs.
Putting up information as I find it...
Pitch detection algorithm on Wikipedia is a good place to start. It lists a few methods that fail for determining octave, which is okay for my purpose.
A good explanation of autocorrelation can be found here (why can't Wikipedia put things simply like that??).
Finally I have closure on this one, thanks to this article from DSP Dimension
The article contains source code.
Basically he performs an FFT. then he explains that frequencies that don't coincide spot on with the centre of the bin they fall in will smear over nearby bins in a sort of bell shaped curve. and he explains how to extract the exact frequency from this data in a second pass (FFT being the first pass).
the article then goes further to pitch shift; I can simply delete the code.
note that they supply a commercial library that does the same thing (and far more) only super optimised. there is a free version of the library that would probably do everything I need, although since I have worked through the iOS audio subsystem, I might as well just implement it myself.
for the record, I found an alternative way to extract the exact frequency by approximating a quadratic curve over the bin and its two neighbours here. I have no idea what is the relative accuracy between these two approaches.
As others have mentioned you should use a pitch detection algorithm. Since that ground is well-covered I will address a few particulars of your question. You said that you are looking for the pitch class of the note. However, the way to find this is to calculate the frequency of the note and then use a table to convert it to the pitch class, octave, and cents. I don't know of any way to obtain the pitch class without finding the fundamental frequency.
You will need a real-time pitch detection algorithm. In evaluating algorithms pay attention to the latency implied by each algorithm, compared with the accuracy you desire. Although some algorithms are better than others, fundamentally you must trade one for the other and cannot know both with certainty -- sort of like the Heisenberg uncertainty principle. (How can you know the note is C4 when only a fraction of a cycle has been heard?)
Your "smoothing" approach is equivalent to a digital filter, which will alter the frequency characteristics of the voice. In short, it may interfere with your attempts to estimate the pitch. If you have an interest in digital audio, digital filters are fundamental and useful tools in that field, and a fascinating subject besides. It helps to have a strong math background in understanding them, but you don't necessarily need that to get the basic idea.
Also, your zero crossing method is a basic technique to estimate the period of a waveform and thus the pitch. It can be done this way, but only with a lot of heuristics and fine-tuning. (Essentially, develop a number of "candidate" pitches and try to infer the dominant one. A lot of special cases will emerge that will confuse this. A quick one is the less 's'.) You'll find it much easier to begin with a frequency domain pitch detection algorithm.
if you re beginner this may be very helpful. It is available both on Java and IOS.
dywapitchtrack for ios
dywapitchtrack for java

Does enlarging images make them easier to analyze programmatically?

Can you enlarge a feature so that rather than take up a certain number of pixels it actually takes up one or two times that many to make it easier to analyze? Would there be a way to generalize that in MATLAB?
This sounds an awful lot like a fictitious "zoom, enhance!" procedure that you'd hear about on CSI. In general, "blowing up" a feature doesn't make it any easier to analyze, because no additional information is created when you do this. Generally you would apply other, different transformations like noise reduction to make analysis easier.
As John F has stated, you are not adding any information. In fact, with more pixels to crunch through you are making it "harder" in the sense of requiring more processing.
You might be able to intelligently increase the resolution of an image using Compressed Sensing. It will require some work (or at least some serious thought), though, as you'll have to determine how best to sample the image you already have. There's a large number of papers referenced at Rice University Compressive Sensing Resources.
The challenge is that the image is already sampled using Nyquist-Shannon constraints. You essentially have to re-sample it using a linear basis function (with IID random elements) in such a way that the estimate is at the desired resolution and find some surrogate for the original image at that same resolution that doesn't bias the estimate.
The function imresize is useful for, well, resizing images, larger or smaller. And imcrop is useful for cropping images.
You might get other more useful answers if you tag the question image-processing too.

AurioTouch & FFT for an instrument tuner

I'm trying to write a simple tuner (no, not to make yet another tuner app), and am looking at the AurioTouch sample source (has anyone tried to comment this code??).
My worry is that aurioTouch doesn't seem to actually work very well when looking at the frequency domain graph. I play a single note on an instrument and I don't see a nicely ordered, small, set of frequencies with one string peak at the appropriate frequency of the note.
Has anyone used aurioTouch enough to know whether the underlying code is functional or whether it is just a crude sample?
Other options I have are to use FFTW or KISS FFT. Anyone have any experience with those?
Thanks.
You're expecting the wrong thing!!
Not the library's fault
Whether the library produces it properly or not, you're looking for a pattern that rarely actually exists in real-life sounds. Only a perfect sine wave, electronically generated, will cause an even partway discrete appearing 'spike' in the freq. graph. If you don't believe it try firing up a 'spectrum analyzer' visualization in winamp or media player. It's not so much the PC's fault.
Real sound waves are complicated animals
Picture a sawtooth or sqaure wave in your mind's eye. those sharp turnaround - corners or points on the wave, look like tons of higher harmonics to the FFT or even a real fourier. And if you've ever seen a real 'sqaure wave/sawtooth' on a scope, or even a 'sine wave' produced by an instrument that is supposed to produce a sinewave, take a look at all the sharp nooks and crannies in just ONE note (if you don't have a scope just zoom way in on the wave in audacity - the more you zoom, the higher notes you're looking at). Yep, those deviations all count as frequencies.
It's hard to tell the difference between one note and a whole orchestra sometimes in a spectrum analysis.
But I hear single notes!
So how does the ear do it? It considers the entire waveform. Then your lower brain lies to your upper brain about what the input is: one note, not a mess of overtones.
You can't do it as fully, but you can approximate it via 'training.'
Approximation: building some smarts
PLAY the note on the instrument and 'save' the frequency graph. Do this for notes in several frequency ranges, or better yet all notes.
Then interpolate the notes to fill in gaps (by 1/2 or 1/4 steps) by multiplying the saved graphs for that instrument by 2^(1/12) (or 1/24 for 1/4 steps, etc).
Figure out how to store them in a quickly-searchable data structure like a BST or trie. Only it would have to return a 'how close is this' score. It would have to identify the match via proportions of frequencies as well, in case it came in different volumes.
Using the smarts
Next time you're looking for a note from that instrument, just take the 'heard' freq graph and find it in that data structure. You can record several instruments that make different waveforms and search for them too. If there are background sounds or multiple notes, take the closest match. Then if you want to identify other notes, 'subtract' the found frequency pattern from the sampled one, and rinse, lather repeat.
It won't work by your voice...
If you ever tried to tune yourself by singing into a guitar tuner, you'll know that tuners arent that clever. Of course some instruments (voice esp) really float around the pitch and generate an ever-evolving waveform (even without somebody singing).
What are you trying to accomplish?
You would not have to totally get this fancy for a 'simple' tuner app, but if you're not making just another tuner app them I'm guessing you actually want to identify notes (e.g., maybe you want to autogenerate midi files from songs on the radio ;-)
Good luck. I hope you find a library that does all this junk instead of having to roll your own.
Edit 2017
Note this webpage: http://www.feilding.net/sfuad/musi3012-01/html/lectures/015_instruments_II.htm
Well down the page, there are spectrum analyses of various organ pipes. There are many, many overtones. These are possible to detect - with enough work - if you 'train' your app with them first (just like telling a kid, 'this is what a clarinet sounds like...')
aurioTouch looks weird because the frequency axis is on a linear scale. It's very difficult to interpret FFT output when the x-axis is anything other than a logarithmic scale (traditionally log2).
If you can't use aurioTouch's integer-FFT, check out my library:
http://github.com/alexbw/iPhoneFFT
It uses double-precision, has support for multiple window types, and implements Welch's method (which should give you more stable spectra when viewed over time).
#zaph, the FFT does compute a true Discrete Fourier Transform. It is simply an efficient algorithm that takes advantage of the bit-wise representation of digital signals.
FFTs use frequency bins and the bin frequency width is based on the FFT parameters. To find a frequency you will need to record it sampled at a rate at least twice the highest frequency present in the sample. Then find the time between the cycles. If it is not a pure frequency this will of course be harder.
I am using Ooura FFT to compute the FFT of acceleromter data. I do not always obtain the correct spectrum. For some reason, Ooura FFT produces completely wrong results with spectral magnitudes of the order 10^200 across all frequencies.