I want to interpret an audio file on my Raspberrry Pi. Does anybody has some experience with this? (Audio interpreter) - raspberry-pi

I want to use a radio scanner to listen to some frequences. The audio output should go into my Raspberry Pi, where I want to interpret it.
For example the scanner detects a frequence -> Somebody says: "Hello, World." -> I want to display "Hello World" on my monitor.
Later I want to interpret the text more.
Does anybody can tell me more about possible software/hardware solutions?
Are there, for example, libaries or templates for a use case like this?
I'm using an Raspberry Pi 4b!
Thank you!

Offer some perspective
Remove noise: Use noise reduction software or algorithm to remove noise from audio and extract human voice
Extract speech: Use speech recognition software or algorithms to extract speech from audio.
https://aws.amazon.com/cn/transcribe
https://cloud.google.com/speech-to-text
Note that the effectiveness of these software or algorithms may vary depending on the quality of the input audio, the language, the speaker's voice, and other factors. Therefore, it may take some experimentation to determine the best tool for your application scenario.

Related

Watson 'Speech to text' not recognizing microphone input properly

Iam using Unity SDK provided for IBM Watson services. I try to use 'ExampleStreaming.cs' sample provided for speech to text recognition. I test the app in unity editor.
This sample uses Microphone as audio input and gets results for voice input from the user. However, when I use microphone as input, the transcribed results are far from being correct. When I say "Create a black box", the results are inappropriate, with the word results being completely irrelevant to input.
When I use pre-recorded voice clips, the output is perfect.
Does the service perform incorrectly for Indian accent?.
What is the reason for poor microphone input recognition?
The docs say:
"In general, the service is sensitive to background noise. For instance, engine noise, working devices, street noise, and talking can significantly reduce accuracy. In addition, the microphones that are typically installed on mobile devices and tablets are often inadequate. The service performs best when professional microphones are used to capture audio with better quality."
I use Logitech headset mic as input source.
Satish,
Try to "clean up" the audio as best you can - by limiting background noise. Also be aware that you can use one of two different processing models - one for broadband and one for narrowband. Try them both, and see which is most appropriate for your input device.
In addition, you can find that the underlying speech model does not handle all of the domain specific terms that you might be looking for. In these cases you can customize and expand the speech model, as explained in the documentation on Using Custom Language Models (https://console.bluemix.net/docs/services/speech-to-text/custom.html#custom). While this is a bit more involved, it can often make a huge difference in accuracy and overall usability.

How to do High End video encoding on BeagleboneBlak

As we know BeagleBone Black dont have a DSP on SoC specific for the Video processing but is there any way we can achieve that by adding some extra DSP board.
I mean like Raspberry got Video Processing, so anyone tried to integrate both to get, so we have both the things to make that work.
I know its not the optimal way and these both are different but i have only one BBB and one Raspberry and I am trying to achieve some 1080p video streaming with better quality.
There is no DSP on BeagleBoneBlack, you need to use DSP functions.
If your input is audio, you can use ALSA.
When you say "dont have a DSP on SoC specific for the Video processing" - I think you mean what is usually called a VPU (Video Processing Unit), and indeed Beaglebone Black's AM3358 processor doesn't have it (source: http://www.ti.com/lit/gpn/am3358)
x264 has ARM NEON optimizations, so it can encode video reasonably well in software, 640x480#30fps should be fine, but 1920x1080#30fps is likely out of reach (you may get 8-10fps).
On Raspberry Pi, you can use gstreamer with omxh264enc to take advantage of the onboard VPU to encode video. I think it is a bit rough (not as solid as raspivid etc) but this should get you started: https://blankstechblog.wordpress.com/2015/01/25/hardware-video-encoding-progess-with-the-raspberry-pi/

How to encode live broadcast of the Local FM radio stations

We are in the midst of research stage for our coming web project. We would like to make a website that streams (all) the Local FM Radio Stations.
In research for the right tools to set up the said website, several questions arises.
What software do we need to encode the live broadcast of (all) the Local FM Radio Stations? How can we connect to FM Radio Stations?
Do we need Virtual Private Server to run the software from question number One, 24/7? Can VPS do that, run a software 24/7?
If we manage to encode the live broadcast of (all) the local FM Radio stations, how do we send this thing to our website? Can we use a simple audio player such as quicktime/flash or html5 audio player and embed it to our website?
I hope someone will help us on this matter. You help is greatly appreciated. :)
Audio Capture
The first thing you need to do is set up an encoder source for your streams. I highly recommend putting the encoder at each radio station. The quality of FM radio isn't the greatest. You will get much better audio quality at the station. In addition, at least here in the US, many radio stations have all of their studios in the same place. It isn't uncommon to find 8 stations all coming from the same set of offices. Therefore, you may only have to install equipment in 3 or 4 buildings to cover all the stations in your market.
Most stations these days are using digital mixing. Buy a sound card that has a compatible digital input. AES/EBU and S/PDIF are common, and sound cards that support these are affordable.
If you must capture audio over the air, make sure you are using high quality receivers (digital where available), with a high quality outdoor antenna. There are a variety of receivers you can purchase, many of which mount directly in a rack.
Encoding
Now for the actual encoding, you need software. I've always had good luck with EdCast (if you can find the version prior to "EdCast Reborn"). SAM is a good choice for stations that have their own music library they need to manage, but I don't suggest it in your case. You can even use VLC for this part.
You will need to pick a good codec. If you want compatibility with HTML5, you will want to encode in MP3 and Ogg Vorbis. aacPlus is a good choice for saving bandwidth while still providing a decent audio quality. Most stations these days use aacPlus when possible, but not all browsers can play it, which is why you also need the other two. You can (and should) use multiple codecs per station.
Server Software
I highly recommend Icecast or SHOUTcast. They take your encoded audio and distribute it to listeners. They serve up an HTTP-like stream, which is generally compatible. If you are interested, I also do Icecast/SHOUTcast compatible hosting, with the goal of being compatible with more devices, particularly mobile.
Playback
Many stations these days use a player that tries HTML5, and falls back to Flash if necessary. jPlayer is a common choice, but there are many others. It is also good to provide a link to a playlist file containing the URL of your stream, so that users can listen in their own audio player if they choose.

Is number recognition on iPhone possible in real-time?

I need to recognise numbers from the camera image on iPhone, in real-time. I know there will be no more than 5 digits on the image.
Is this problem realistic to solve given the computational specifications of the iPhone?
Does anyone have any experience using the Tesseract OCR library, and do you think it could be solved by using it?
The depends on your definition of "real-time", but yes, it should be possible to do relatively fast recognition of just the digits 0-9 on an iPhone 4, particularly if you can fonts, lighting conditions, etc. that they will appear in.
I highly recommend reading the article on how Sudoku Grab does its recognition of puzzles using the iPhone camera. In their case, a trained neural network was used to identify the digits, which should be reasonably simple and fast on modern iOS hardware.
The current recognition libraries out there, like OpenCV, will use the iPhone's CPU to do the processing. I've heard that they can do even more complex tasks like facial recognition fast enough to use with video sources while showing a minimal amount of stutter.
For even better performance, I believe that there's a lot of potential in the programmable GPUs on the newer iOS devices. In my benchmarks, I saw a 14X - 28X speedup when using the iPhone 4's GPU for simple image processing. While few people are looking at this right now, something like Sudoku Grab's neural network should be a parallel enough process to benefit from running on the GPU.
It should be computationally possible. There are apps that can get a bar code in real time and also an app that does real time translation. (Word Lens). I'm not sure what libraries they use, however.
YES it is possible using the tesseract engine
Here is the sample code if you like to check...
https://github.com/nolanbrown/Tesseract-iPhone-Demo
There is free SDK for that: http://rtrsdk.com/ Supports both iOS and Andorid, works in real-time, helps you capture any text, numbers should not be a problem.
Disclaimer: I work for ABBYY
Yes. Bender can help you with that. It lets you build and run neural nets on iOS. As it uses Metal under the hood, it runs fast and smooth. It also supports running TensorFlow models directly.
So you can run in Bender an existing model in TensorFlow trained for digit recognition Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras if you need help
Disclaimer: I worked on this project.

iPhone: CPU power to do DSP/Fourier transform/frequency domain?

I want to analyze MIC audio on an ongoing basis (not just a snipper or prerecorded sample), and display frequency graph and filter out certain aspects of the audio. Is the iPhone powerful enough for that? I suspect the answer is a yes, given the Google and iPhone voice recognition, Shazaam and other music recognition apps, and guitar tuner apps out there. However, I don't know what limitations I'll have to deal with.
Anyone play around with this area?
Apple's sample code aurioTouch has a FFT implementation.
The apps that I've seen do some sort of music/voice recognition need an internet connection, so it's highly likely that these just so some sort of feature calculation on the audio and send these features via http to do the recognition on the server.
In any case, frequency graphs and filtering have been done before on lesser CPUs a dozen years ago. The iPhone should be no problem.
"Fast enough" may be a function of your (or your customer's) expectations on how much frequency resolution you are looking for and your base sample rate.
An N-point FFT is on the order of N*log2(N) computations, so if you don't have enough MIPS, reducing N is a potential area of concession for you.
In many applications, sample rate is a non-negotiable, but if it was, this would be another possibility.
I made an app that calculates the FFT live
http://www.itunes.com/apps/oscope
You can find my code for the FFT on GitHub (although it's a little rough)
http://github.com/alexbw/iPhoneFFT
Apple's new iPhone OS 4.0 SDK allows for built-in computation of the FFT with the "Accelerate" library, so I'd definitely start working with the new OS if it's a central part of your app's functionality.
You cant just port FFT code written in C into your app...there is the thumb compiler option that complicates floating point arithmetic. You need to put it in arm mode