How would you go about comparing a spoken word to an audio file and determining if they match? For example, if I say "apple" to my iPhone application, I would like for it to record the audio and compare it with a prerecorded audio file of someone saying "apple". It should be able to determine that the two spoken words match.
What kind of algorithm or library could I use to perform this kind of voice-based audio file matching?
You should look up Acoustic Fingerprinting see wikipedia link below. Shazam is basically doing it for music.
http://en.wikipedia.org/wiki/Acoustic_fingerprint
I know this question is old, but I discovered this library today:
http://www.ispikit.com/
Sphinx does voice recognition and pocketSphinx has been ported to the iPhone by Brian King
check https://github.com/KingOfBrian/VocalKit
He has provided excellent details and made it easy to implement for yourself. I've run his example and modified my own rendition of it.
You can use a neural networks library and teach it to recognize different speech patterns. This will require some know how behind the general theory of neural networks and how they can be used to create systems that will behave a particular way. If you know nothing about the subject, you can get started on just the basics and then use a library rather than implementing something yourself. Hope that helps.
Related
I have a general question regarding how I should proceed with my music visualization endeavors. I am interested in visualizing classical music pieces, either recorded or live. I have used so far Processing, but I am open to other software or programming languages, too. Since my background is musicology and music theory, I would like to incorporate music theoretical concepts or any tools that MIR research has made available in analyzing music, so that I could create visualizations based on them.
However, I don't really know from where to start. I feel like I would like to study MIR concepts, but I feel it can become very technical and it seems that I don't need to be knowledgeable about all facets of MIR research. For example I want to analyze and visualize music, but not to synthesize. Or, could it be that learning about sound synthesis would also help me with my visualizations? And to what extent?
Alternatively, would it be better, to just pick some software/language and try its specific libraries for audio analysis??
Basically my dilemma is whether I should start with MIR theory or there is an alternative way of getting a good overview of all the features that one could maybe extract from a music piece, in order to have a richer palette for visualizations.
If you could give me some advice on this, I would be very grateful.
Thanks, Ilias
Hi guys I'm looking for a solution to that enable a user compare a image to a previously store image. For example, i take a picture on my iPhone of a chair and then it will compare to a locally store image, if the similarity is reasonable then it confirms the image and calls another action.
Most solutions I've been able to find require cloud processing on third party servers(Visioniq, moodstocks, kooaba etc). Is this because the iPhone doesn't have sufficient processing power to complete this task?
A small reference library will be stored on the device and referenced when needed.
Users will be able to index their own pictures for recognition.
Is there any such solution that anyone has heard of? My searches have only shown cloud solutions from the above mentioned BaaS providers.
Any help will be much appreciated.
Regards,
Frank
Try using OpenCV library in iOS
OpenCV
You can try using FlannBasedMatcher to match images. Documentation and code is available here You can see "Feature Matching FLANN" here.
I have a MP3 file and need to constantly detect and show Hz value of this playing MP3 file. A bit of googling shows that I have 2 opportunities: use FFT or use Apple Accelerate framework. Unfortunately I haven't found any easy to use sample of neither. All samples, like AurioTouch etc, need tons of code to get simple number for sample buffer. Is there any easy example for pitch detection for iOS?
For example I've found https://github.com/clindsey/pkmFFT, but it's missing some files that its' author has removed. Anything working like that?
I'm afraid not. Working with sound is generally hard, and Core Audio makes no exception. Now to the matter at hand.
FFT is an algorithm for transforming input from time domain to frequency domain. Is not necessarily linked with sound processing, you can use it for other things other than sound as well.
Accelerate is an Apple provided framework, which among many other things offer an FFT implementation. So, you actually don't have two options there, just one and its implementation.
Now, depending on what you want to do(e.g. if you favour speed over accuracy, robustness over simplicity etc) and the type of waveform you have(simple, complex, human speech, music), FFT may be not enough on its own or not even the right choice for your task. There are other options, auto-correlation, zero-crossing, cepstral analysis, maximum likelihood to mention some. But none are trivial, except for zero-crossing, which also gives you the poorest results and will fail to work with complex waveforms.
Here is a good place to start:
http://blog.bjornroche.com/2012/07/frequency-detection-using-fft-aka-pitch.html
There are also other question on SO.
However, as indicated by other answers, this is not something that can just be "magically" done. Even if you license code from someone (eg, iZotope, and z-plane both make excellent code for doing what you want to do), you still need to understand what's going on to get data in and out of their libraries.
If you need fast pitch detection go with http://www.schmittmachine.com/dywapitchtrack.html
You'll find a IOS sample code inside.
If you need FFT you should use Apple Accelerate framework.
Hope this help
I would like to start on Chinese hand-writing recognition program for IPhone...but I couldn't find any library or API that can help me to do so. It's hard for me to write the algorithm myself because of my time span.
Some of suggestion recommended that I should make use of a back-end server to do the recognition work. But I don't know how to set up that kind of server.
So any suggestion or basic steps that can help me to achieve this personal project?
You might want to check out Zinnia. Tegaki relies on other APIs (Zinnia is one of them) to do the actual character recognition.
I haven't looked at the code, but I gather it's written in C or C++, so should suit your needs better than Tegaki.
I'm building an app that has a requirement for really accurate positional audio, down to the level of modelling inter-aural time difference (ITD), the slight delay difference between stereo channels that varies with a sound's position relative to a listener. Unfortunately, the iPhone's implementation of OpenAL doesn't have this feature, nor is a delay Audio Unit supplied in the SDK.
After a bit of reading around, I've decided that the best way to approach this problem is to implement my own delay by manipulating an AudioQueue (I can also see some projects in my future which may require learning this stuff, so this is as good an excuse to learn as any). However, I don't have any experience in low-level audio programming at all, and certainly none with AudioQueue. Trying to learn both:
a) the general theory of audio processing
and
b) the specifics of how AudioQueue implements that theory
is proving far too much to take in all at once :(
So, my questions are:
1) where's a good place to start learning about DSP and how audio generation and processing works in general (down to the level of how audio data is structured in memory, how mixing works, that kinda thing)?
2) what's a good way to get a feel for how AudioQueue does this? Are there any good examples of how to get it reading from a generated ring buffer, rather that just fetching bits of a file on-demand with AudioFileReadPackets, like Apple's SpeakHere example does?
and, most importantly
3) is there a simpler way of doing this that I've overlooked?
I think Richard Lyons' "Understanding Digital Signal Processing" is widely revered as a good starter DSP book, though it's all math and no code.
If timing is so important, you'll likely want to use the Remote I/O audio unit, rather than the higher-latency audio queue. Some of the audio unit examples may be helpful to you here, like the "aurioTouch" example that uses the Remote I/O unit for capture and performs an FFT on it to get the frequencies.
If the built-in AL isn't going to do it for you, I think you've opted into the "crazy hard" level of difficulty.
Sounds like you should probably be on the coreaudio-api list (lists.apple.com), where Apple's Core Audio engineers hang out.
Another great resource for learning the fundamental basics of DSP and their applications is The Scientist and Engineer's Guide to Digital Signal Processing by Steven W. Smith. It is available online for free at http://www.dspguide.com/ but you can also order a printed copy.
I really like how the author builds up the fundamental theory in a way that very palatable.
Furthermore, you should check out the Core Audio Public Utility which you'll find at /Developer/Extras/CoreAudio/PublicUtility. It covers a lot of the basic structures you'll need to get in place in order to work with CoreAudio.