In Unity, how to segment the user's voice from microphone based on loudness? - unity3d

I need to collect voice pieces from a continuous audio stream. I need to process later the user's voice piece that has just been said (not for speech recognition). What I am focusing on is only the voice's segmentation based on its loudness.
If after at least 1 second of silence, his voice becomes loud enough for a while, and then silent again for at least 1 second, I say this is a sentence and the voice should be segmented here.
I just know I can get raw audio data from the AudioClip created by Microphone.Start(). I want to write some code like this:
void Start()
{
audio = Microphone.Start(deviceName, true, 10, 16000);
}
void Update()
{
audio.GetData(fdata, 0);
for(int i = 0; i < fdata.Length; i++) {
u16data[i] = Convert.ToUInt16(fdata[i] * 65535);
}
// ... Process u16data
}
But what I'm not sure is:
Every frame when I call audio.GetData(fdata, 0), what I get is the latest 10 seconds of sound data if fdata is big enough or shorter than 10 seconds if fdata is not big enough, is it right?
fdata is a float array, and what I need is a 16 kHz, 16 bit PCM buffer. Is it right to convert the data like: u16data[i] = fdata[i] * 65535?
What is the right way to detect loud moments and silent moments in fdata?

No. you have to read starting at the current position within the AudioClip using Microphone.GetPosition
Get the position in samples of the recording.
and pass the optained index to AudioClip.GetData
Use the offsetSamples parameter to start the read from a specific position in the clip
fdata = new float[clip.samples * clip.channels];
var currentIndex = Microphone.GetPosition(null);
audio.GetData(fdata, currentIndex);
I don't understand what exactly you convert this for. fdata will contain
floats ranging from -1.0f to 1.0f (AudioClip.GetData)
so if for some reason you need to get values between short.MinValue (= -32768) and short.MaxValue(= 32767) than yes you can do that using
u16data[i] = Convert.ToUInt16(fdata[i] * short.MaxValue);
note however that Convert.ToUInt16(float):
value, rounded to the nearest 16-bit unsigned integer. If value is halfway between two whole numbers, the even number is returned; that is, 4.5 is converted to 4, and 5.5 is converted to 6.
you might want to rather use Mathf.RoundToInt first to also round up if a value is e.g. 4.5.
u16data[i] = Convert.ToUInt16(Mathf.RoundToInt(fdata[i] * short.MaxValue));
Your naming however suggests that you are actually trying to get unsigned values ushort (or also UInt16). For this you can not have negative values! So you have to shift the float values up in order to map the range (-1.0f | 1.0f ) to the range (0.0f | 1.0f) before multiplaying it by ushort.MaxValue(= 65535)
u16data[i] = Convert.ToUInt16(Mathf.RoundToInt(fdata[i] + 1) / 2 * ushort.MaxValue);
What you receive from AudioClip.GetData are the gain values of the audio track between -1.0f and 1.0f.
so a "loud" moment would be where
Mathf.Abs(fdata[i]) >= aCertainLoudThreshold;
a "silent" moment would be where
Mathf.Abs(fdata[i]) <= aCertainSiltenThreshold;
where aCertainSiltenThreshold might e.g. be 0.2f and aCertainLoudThreshold might e.g. be 0.8f.

Related

Gstreamer appsrc seeking video being grabbed over http by chunks

I am making some media player for watching videos recorded on server (is made by me too).
Videos are just big daily .ts files and sending whole file is not an option. So I made availability for HTTP request that sends response with ~20 sec of video data.
HTTP request contains byte offset in video so server can seek and read fast.
On client side data is pushed to appsrc and displayed by pipeline.
appsrc's properties:
duration is set correctly (with error less than half a second),
stream-type is set to GST_APP_STREAM_TYPE_SEEKABLE so I can perform seeks (seek-data signals)
appsrc has connected 'need-data' and 'seek-data' signals. There is offset is being remembered.
'need-data' uses offset to request next chunk of video (and adds size of data to offset so I can request another one later).
'seek-data' changes offset if seek was requested.
If I watch video from start everything is fine and chunks are grabbed one by another. If I try to perform a seek problems start.
For example, seek function:
//pipeline is defined outside of function
void Seek(gint64 offset_ns)
{
// current position, duration of video and seek position
gint64 pos_ns, dur_ns, seek_ns;
dur_ns = GetCurrentVideoDuration();
gst_element_query_position(pipeline,GST_FORMAT_TIME,&pos_ns);
seek_ns = pos_ns + offset_ns;
if (seek_ns < 0)
seek_ns = 0;
gst_element_seek (pipeline, 1, GST_FORMAT_TIME,
GST_SEEK_FLAG_ACCURATE | GST_SEEK_FLAG_FLUSH,
GST_SEEK_TYPE_SET, seek_ns,
GST_SEEK_TYPE_SET, dur_ns);
}
After function call 'seek-data' is called
gboolean seek_data_callback (GstElement * appsrc, guint64 offset, gpointer udata)
{
//remember requested offset
lastOffset = offset;
return true;
}
Let's say pos_ns was 5000000000 (5 seconds: 5 * GST_SECOND);
offset_ns was 30000000000 (30 seconds : 10 * GST_SECOND)
So seek_ns = pos_ns + offset_ns = 35 * GST_SECOND
With this, lastOffset should be increasing, but sometimes it decreases, increases, equal to duration and it looks like I'm missing something.
I'm not sure how does offset is being calculated in GStreamer and I don't know if it is possible to calculate offset by myself.
What problem can this be?

AVAudioPCMBuffer built programmatically, not playing back in stereo

I'm trying to fill an AVAudioPCMBuffer programmatically in Swift to build a metronome. This is the first real app I'm trying to build, so it's also my first audio app. Right now I'm experimenting with different frameworks and methods of getting the metronome looping accurately.
I'm trying to build an AVAudioPCMBuffer with the length of a measure/bar so that I can use the .Loops option of the AVAudioPlayerNode's scheduleBuffer method. I start by loading my file(2 ch, 44100 Hz, Float32, non-inter, *.wav and *.m4a both have same issue) into a buffer, then copying that buffer frame by frame separated by empty frames into the barBuffer. The loop below is how I'm accomplishing this.
If I schedule the original buffer to play, it will play back in stereo, but when I schedule the barBuffer, I only get the left channel. As I said I'm a beginner at programming, and have no experience with audio programming, so this might be my lack of knowledge on 32 bit float channels, or on this data type UnsafePointer<UnsafeMutablePointer<float>>. When I look at the floatChannelData property in swift, the description makes it sound like this should be copying two channels.
var j = 0
for i in 0..<Int(capacity) {
barBuffer.floatChannelData.memory[j] = buffer.floatChannelData.memory[i]
j += 1
}
j += Int(silenceLengthInSamples)
// loop runs 4 times for 4 beats per bar.
edit: I removed the glaring mistake i += 1, thanks to hotpaw2. The right channel is still missing when barBuffer is played back though.
Unsafe pointers in swift are pretty weird to get used to.
floatChannelData.memory[j] only accesses the first channel of data. To access the other channel(s), you have a couple choices:
Using advancedBy
// Where current channel is at 0
// Get a channel pointer aka UnsafePointer<UnsafeMutablePointer<Float>>
let channelN = floatChannelData.advancedBy( channelNumber )
// Get channel data aka UnsafeMutablePointer<Float>
let channelNData = channelN.memory
// Get first two floats of channel channelNumber
let floatOne = channelNData.memory
let floatTwo = channelNData.advancedBy(1).memory
Using Subscript
// Get channel data aka UnsafeMutablePointer<Float>
let channelNData = floatChannelData[ channelNumber ]
// Get first two floats of channel channelNumber
let floatOne = channelNData[0]
let floatTwo = channelNData[1]
Using subscript is much clearer and the step of advancing and then manually
accessing memory is implicit.
For your loop, try accessing all channels of the buffer by doing something like this:
for i in 0..<Int(capacity) {
for n in 0..<Int(buffer.format.channelCount) {
barBuffer.floatChannelData[n][j] = buffer.floatChannelData[n][i]
}
}
Hope this helps!
This looks like a misunderstanding of Swift "for" loops. The Swift "for" loop automatically increments the "i" array index. But you are incrementing it again in the loop body, which means that you end up skipping every other sample (the Right channel) in your initial buffer.

Interpreting inputBuffer's Value in a Callback

I am basing my code off of Portaudio's paex_record_file.c example. One of the parameters in the callback is inputBuffer, and I wanted to use its data to calculate other numbers with the double/float type. I changed the file from a .raw to a .txt, but notepad still cannot read it, leading me to believe its data is not actually encoded as a number. How is the data stored in inputBuffer and how can I do arithmetic with it (add, multiply, divide, etc)?
This is how I initialized inputParameters:
inputParameters.device = Pa_GetDefaultInputDevice(); /* default input device */
if (inputParameters.device == paNoDevice) {
fprintf(stderr,"Error: No default input device.\n");
goto error;
}
inputParameters.channelCount = 2; /* stereo input */
inputParameters.sampleFormat = paFloat32;
inputParameters.suggestedLatency = Pa_GetDeviceInfo( inputParameters.device )->defaultLowInputLatency;
inputParameters.hostApiSpecificStreamInfo = NULL;
This question is somewhat related to print floats from audio input callback function (unanswered).
The inputBuffer parameter to the callback is a void*. The actual type of the underlying buffer depends on the parameters and the flags that you pass to Pa_OpenStream.
If you specified paFloat32 then there will be a float* in there somewhere. However the are two possibilities:
Interleaved: inputParameters.sampleFormat = paFloat32;
Non-Interleaved: inputParameters.sampleFormat = paFloat32|paNonInterleaved;
You specified the interleaved option. In this case, inputBuffer points to a single buffer of interleaved floats. So you can write:
float *samples = (float*)inputBuffer;
In a two channel stream samples will contain interleaved left and right samples, e.g.:
samples[0]; // first left sample
samples[1]; // first right sample
samples[2]; // second left sample
samples[3]; // second right sample
// etc.
For completeness: If it had been a non-interleaved stream then inputBuffer points to an array of pointers to single-channel buffers. To extract the buffer pointers you would write something like:
float *left = ((float **) inputBuffer)[0];
float *right = ((float **) inputBuffer)[1];
Note that in all cases framesPerBuffer counts frames not samples. A frame includes one sample from each channel. For example, in a stereo stream, a frame includes both the left and right channel samples.

Help with live-updating sound on the iPhone

My question is a little tricky, and I'm not exactly experienced (I might get some terms wrong), so here goes.
I'm declaring an instance of an object called "Singer". The instance is called "singer1". "singer1" produces an audio signal. Now, the following is the code where the specifics of the audio signal are determined:
OSStatus playbackCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
//Singer *me = (Singer *)inRefCon;
static int phase = 0;
for(UInt32 i = 0; i < ioData->mNumberBuffers; i++) {
int samples = ioData->mBuffers[i].mDataByteSize / sizeof(SInt16);
SInt16 values[samples];
float waves;
float volume=.5;
for(int j = 0; j < samples; j++) {
waves = 0;
waves += sin(kWaveform * 600 * phase)*volume;
waves += sin(kWaveform * 400 * phase)*volume;
waves += sin(kWaveform * 200 * phase)*volume;
waves += sin(kWaveform * 100 * phase)*volume;
waves *= 32500 / 4; // <--------- make sure to divide by how many waves you're stacking
values[j] = (SInt16)waves;
values[j] += values[j]<<16;
phase++;
}
memcpy(ioData->mBuffers[i].mData, values, samples * sizeof(SInt16));
}
return noErr;
}
99% of this is borrowed code, so I only have a basic understanding of how it works (I don't know about the OSStatus class or method or whatever this is. However, you see those 4 lines with 600, 400, 200 and 100 in them? Those determine the frequency. Now, what I want to do (for now) is insert my own variable in there in place of a constant, which I can change on a whim. This variable is called "fr1". "fr1" is declared in the header file, but if I try to compile I get an error about "fr1" being undeclared. Currently, my technique to fix this is the following: right beneath where I #import stuff, I add the line
fr1=0.0;//any number will work properly
This sort of works, as the code will compile and singer1.fr1 will actually change values if I tell it to. The problems are now this:A)even though this compiles and the tone specified will play (0.0 is no tone), I get the warnings "Data definition has no type or storage class" and "Type defaults to 'int' in declaration of 'fr1'". I bet this is because for some reason it's not seeing my previous declaration in the header file (as a float). However, again, if I leave this line out the code won't compile because "fr1 is undeclared". B)Just because I change the value of fr1 doesn't mean that singer1 will update the value stored inside the "playbackcallback" variable or whatever is in charge of updating the output buffers. Perhaps this can be fixed by coding differently? C)even if this did work, there is still a noticeable "gap" when pausing/playing the audio, which I need to eliminate. This might mean a complete overhaul of the code so that I can "dynamically" insert new values without disrupting anything. However, the reason I'm going through all this effort to post is because this method does exactly what I want (I can compute a value mathematically and it goes straight to the DAC, which means I can use it in the future to make triangle, square, etc waves easily). I have uploaded Singer.h and .m to pastebin for your veiwing pleasure, perhaps they will help. Sorry, I can't post 2 HTML tags so here are the full links.
(http://pastebin.com/ewhKW2Tk)
(http://pastebin.com/CNAT4gFv)
So, TL;DR, all I really want to do is be able to define the current equation/value of the 4 waves and re-define them very often without a gap in the sound.
Thanks. (And sorry if the post was confusing or got off track, which I'm pretty sure it did.)
My understanding is that your callback function is called every time the buffer needs to be re-filled. So changing fr1..fr4 will alter the waveform, but only when the buffer updates. You shouldn't need to stop and re-start the sound to get a change, but you will notice an abrupt shift in the timbre if you change your fr values. In order to get a smooth transition in timbre, you'd have to implement something that smoothly changes the fr values over time. Tweaking the buffer size will give you some control over how responsive the sound is to your changing fr values.
Your issue with fr being undefined is due to your callback being a straight c function. Your fr variables are declared as objective-c instance variables as part of your Singer object. They are not accessible by default.
take a look at this project, and see how he implements access to his instance variables from within his callback. Basically he passes a reference to his instance to the callback function, and then accesses instance variables through that.
https://github.com/youpy/dowoscillator
notice:
Sinewave *sineObject = inRefCon;
float freq = sineObject.frequency * 2 * M_PI / samplingRate;
and:
AURenderCallbackStruct input;
input.inputProc = RenderCallback;
input.inputProcRefCon = self;
Also, you'll want to move your callback function outside of your #implementation block, because it's not actually part of your Singer object.
You can see this all in action here: https://github.com/coryalder/SineWaver

iPhone audio analysis

I'm looking into developing an iPhone app that will potentially involve a "simple" analysis of audio it is receiving from the standard phone mic. Specifically, I am interested in the highs and lows the mic pics up, and really everything in between is irrelevant to me. Is there an app that does this already (just so I can see what its capable of)? And where should I look to get started on such code? Thanks for your help.
Look in the Audio Queue framework. This is what I use to get a high water mark:
AudioQueueRef audioQueue; // Imagine this is correctly set up
UInt32 dataSize = sizeof(AudioQueueLevelMeterState) * recordFormat.mChannelsPerFrame;
AudioQueueLevelMeterState *levels = (AudioQueueLevelMeterState*)malloc(dataSize);
float channelAvg = 0;
OSStatus rc = AudioQueueGetProperty(audioQueue, kAudioQueueProperty_CurrentLevelMeter, levels, &dataSize);
if (rc) {
NSLog(#"AudioQueueGetProperty(CurrentLevelMeter) returned %#", rc);
} else {
for (int i = 0; i < recordFormat.mChannelsPerFrame; i++) {
channelAvg += levels[i].mPeakPower;
}
}
free(levels);
// This works because one channel always has an mAveragePower of 0.
return channelAvg;
You can get peak power in either dB Free Scale (with kAudioQueueProperty_CurrentLevelMeterDB) or simply as a float in the interval [0.0, 1.0] (with kAudioQueueProperty_CurrentLevelMeter).
Don't forget to activate level metering for AudioQueue first:
UInt32 d = 1;
OSStatus status = AudioQueueSetProperty(mQueue, kAudioQueueProperty_EnableLevelMetering, &d, sizeof(UInt32));
Check the 'SpeakHere' sample code. it will show you how to record audio using the AudioQueue API. It also contains some code to analyze the audio realtime to show a level meter.
You might actually be able to use most of that level meter code to respond to 'highs' and 'lows'.
The AurioTouch example code performs Fourier analysis
on the mic input. Could be a good starting point:
https://developer.apple.com/iPhone/library/samplecode/aurioTouch/index.html
Probably overkill for your application.