I have an app that captures audio and video using AVAssetWriter. It runs a fast fourier transform (FFT) on the audio to create a visual spectrum of the captured audio in real time.
Up until the release of iPhone11, this all worked fine. Users with the iPhone 11, however, are reporting that audio is not being captured at all. I have managed to narrow down the issue - The number of samples returned in captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) is either 940 or 941 - On previous phone models, this is always 1024 samples. I use CMSampleBufferGetNumSamples to get the number of samples. My FFT calculations rely on having the number of samples be a power of 2, so it drops all frames on the newer model iPhones.
Can anybody shed light on why the new iPhone11 is returning an unusual number of samples? Here is how I have configured the AVAssetWriter:
self.videoWriter = try AVAssetWriter(outputURL: self.outputURL, fileType: AVFileType.mp4)
var videoSettings: [String : Any]
if #available(iOS 11.0, *) {
videoSettings = [
AVVideoCodecKey : AVVideoCodecType.h264,
AVVideoWidthKey : Constants.VIDEO_WIDTH,
AVVideoHeightKey : Constants.VIDEO_HEIGHT,
]
} else {
videoSettings = [
AVVideoCodecKey : AVVideoCodecH264,
AVVideoWidthKey : Constants.VIDEO_WIDTH,
AVVideoHeightKey : Constants.VIDEO_HEIGHT,
]
}
//Video Input
videoWriterVideoInput = AVAssetWriterInput(mediaType: AVMediaType.video, outputSettings: videoSettings)
videoWriterVideoInput?.expectsMediaDataInRealTime = true;
if (videoWriter?.canAdd(videoWriterVideoInput!))!
{
videoWriter?.add(videoWriterVideoInput!)
}
//Audio Settings
let audioSettings : [String : Any] = [
AVFormatIDKey : kAudioFormatMPEG4AAC,
AVSampleRateKey : Constants.AUDIO_SAMPLE_RATE, //Float(44100.0)
AVEncoderBitRateKey : Constants.AUDIO_BIT_RATE, //64000
AVNumberOfChannelsKey: Constants.AUDIO_NUMBER_CHANNELS //1
]
//Audio Input
videoWriterAudioInput = AVAssetWriterInput(mediaType: AVMediaType.audio, outputSettings: audioSettings)
videoWriterAudioInput?.expectsMediaDataInRealTime = true;
if (videoWriter?.canAdd(videoWriterAudioInput!))!
{
videoWriter?.add(videoWriterAudioInput!)
}
You can't assume a fixed sample rate. Depending on the microphone and many other factors of a device, you can't always assume it will be the same. This doesn't help with the FFT library I'm using (TempiFFT) - To get this to work you need to detect the sample rate ahead of time.
Rather than:
let fft = TempiFFT(withSize: 1024, sampleRate: Constants.AUDIO_SAMPLE_RATE)
I need to first detect what the sample rate is when I start my AVCaptureSession, and then pass that detected value to the FFT library:
//During initialization of AVCaptureSession
audioSampleRate = Float(AVAudioSession.sharedInstance().sampleRate)
...
//Run FFT calculations
let fft = TempiFFT(withSize: 1024, sampleRate: audioSampleRate)
Update
On some devices, you may not receive a full 1024 samples in your loop (on iPhone 11 I was getting 941) - if it doesn't have the right number of frames, you may get unexpected behavior from the FFT. I needed to create a circular buffer to store the samples upon return of each output til I had at least 1024 samples to perform the FFT.
Related
I'm trying to create an app that does live Video & Audio recording, using AVFoundation.
Also using AVAssetWriter I'm writing the buffers to a local file.
For the Video CMSampleBuffer I'm Using the AVCaptureVideoDataOutputSampleBufferDelegate output in AVCaptureSession which is straightforward.
For the Audio CMSampleBuffer I'm creating the buffer from the AudioUnit record callback.
The way I'm calculating the presentation time for the Audio buffer is like so:
var timebaseInfo = mach_timebase_info_data_t(numer: 0, denom: 0)
let timebaseStatus = mach_timebase_info(&timebaseInfo)
if timebaseStatus != KERN_SUCCESS {
debugPrint("not working")
return
}
let hostTime = time * UInt64(timebaseInfo.numer / timebaseInfo.denom)
let presentationTIme = CMTime(value: CMTimeValue(hostTime), timescale: 1000000000)
let duration = CMTime(value: CMTimeValue(1), timescale: CMTimeScale(self.sampleRate))
var timing = CMSampleTimingInfo(
duration: duration,
presentationTimeStamp: presentationTIme,
decodeTimeStamp: CMTime.invalid
)
self.sampleRate is a variable that changes at the recording start, but most of the times is 48000.
When getting the CMSampleBuffers of both the video and the audio the presentation times has a really big difference.
Audio - CMTime(value: 981750843366125, timescale: 1000000000, flags: __C.CMTimeFlags(rawValue: 1), epoch: 0)
Video - CMTime(value: 997714237615541, timescale: 1000000000, flags: __C.CMTimeFlags(rawValue: 1), epoch: 0)
This creates a big gap when trying to write the buffers to the file.
My questions is
Am I calculating the presentation Time of the audio buffer correctly? if so, what am I missing?
How can I make sure the Audio & the Video are in the same area of time (I know that there should be a small millisecond difference between them)
Ok, so this was my fault.
As Rhythmic Fistman in the comments suggested, I was getting truncation with my time calculation:
let hostTime = time * UInt64(timebaseInfo.numer / timebaseInfo.denom)
Changing to this calculation fixed it
let hostTime = (time * UInt64(timebaseInfo.numer)) / UInt64(timebaseInfo.denom)
I converted my mlmodel from tf.keras. The goal is to recognize handwritten text from the image
When I run it using this code:
func performCoreMLImageRecognition(_ image: UIImage) {
let model = try! HTRModel()
// process input image
let scale = image.scaledImage(200)
let sized = scale?.resize(size: CGSize(width: 200, height: 50))
let gray = sized?.rgb2GrayScale()
guard let pixelBuffer = sized?.pixelBufferGray(width: 200, height: 50) else { fatalError("Cannot convert image to pixelBufferGray")}
UIImageWriteToSavedPhotosAlbum(gray! ,
self,
#selector(self.didFinishSavingImage(_:didFinishSavingWithError:contextInfo:)),
nil)
let mlArray = try! MLMultiArray(shape: [1, 1], dataType: MLMultiArrayDataType.float32)
let htrinput = HTRInput(image: pixelBuffer, label: mlArray)
if let prediction = try? model.prediction(input: htrinput) {
print(prediction)
}
}
I get the following error:
[espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "Invalid argument": generic_reshape_kernel: Invalid bottom shape (64 12 1 1 1) for reshape to (768 50 -1 1 1) status=-6
2021-01-21 20:23:50.712585+0900 Guided Camera[7575:1794819] [coreml] Error computing NN outputs -6
2021-01-21 20:23:50.712611+0900 Guided Camera[7575:1794819]
[coreml] Failure in -executePlan:error:.
Here is the model configuration
The model ran perfectly fine. Where am I going wrong in this. I am not well versed with swift and need help.
What does this error mean and How do I resolve this error?
Sometimes during the conversion from Keras (or whatever) to Core ML, the converter doesn't understand how to handle certain operations, which results in a model that doesn't work.
In your case, there is a layer that outputs a tensor with shape (64, 12, 1, 1, 1) while there is a reshape layer that expects something that can be reshaped to (768, 50, -1, 1, 1).
You'll need to find out which layer does this reshape and then examine the Core ML model why it gets an input tensor that is not the correct size. Just because it works OK in Keras does not mean the conversion to Core ML was flawless.
You can examine the Core ML model with Netron, an open source model viewer.
(Note that 64x12 = 768, so the issue appears to be with the 50 in that tensor.)
I'm reading an input file and with the offline manual rendering mode, I want to perform amplitude modulation and to write the result to an output file.
For the sake of testing, I produce pure sine waves - this works well for frequencies lower than 6.000 Hz. For higher frequencies (my goal consists in working with ca. 20.000 Hz), the signal (thus listening the output file) is distorted, and the spectrum ends at 8.000 Hz - no pure spectrum anymore with multiple peaks between 0 and 8.000 Hz.
Here's my code snippet:
let outputFile: AVAudioFile
do {
let documentsURL = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
let outputURL = documentsURL.appendingPathComponent("output.caf")
outputFile = try AVAudioFile(forWriting: outputURL, settings: sourceFile.fileFormat.settings)
} catch {
fatalError("Unable to open output audio file: \(error).")
}
var sampleTime: Float32 = 0
while engine.manualRenderingSampleTime < sourceFile.length {
do {
let frameCount = sourceFile.length - engine.manualRenderingSampleTime
let framesToRender = min(AVAudioFrameCount(frameCount), buffer.frameCapacity)
let status = try engine.renderOffline(framesToRender, to: buffer)
switch status {
case .success:
// The data rendered successfully. Write it to the output file.
let sampleRate:Float = Float((mixer.outputFormat(forBus: 0).sampleRate))
let modulationFrequency: Float = 20000.0
for i in stride(from:0, to: Int(buffer.frameLength), by: 1) {
let val = sinf(2.0 * .pi * modulationFrequency * Float(sampleTime) / Float(sampleRate))
// TODO: perform modulation later
buffer.floatChannelData?.pointee[Int(i)] = val
sampleTime = sampleTime + 1.0
}
try outputFile.write(from: buffer)
case .insufficientDataFromInputNode:
// Applicable only when using the input node as one of the sources.
break
case .cannotDoInCurrentContext:
// The engine couldn't render in the current render call.
// Retry in the next iteration.
break
case .error:
// An error occurred while rendering the audio.
fatalError("The manual rendering failed.")
#unknown default:
fatalError("unknown error")
}
} catch {
fatalError("The manual rendering failed: \(error).")
}
}
My question: is there s.th. wrong with my code? Or has anybody an idea how to produce output files with sine waves of higher frequencies?
I suppose that the manual rendering mode is not fast enough in order to deal with higher frequencies.
Update:
in the meantime, I did analyze the output file with Audacity.
Above please find the waveform of 1.000 Hz, below the same with 20.000 Hz:
When I zoom in, I see the following:
Comparing the spectrums of the two output files, I get the following:
It's strange that with higher frequencies, the amplitude goes toward zero.
In addition, I see more frequencies in the second spectrum.
A new question in conjunction with the outcome is the correctness of the following algorithm:
// Process the audio in `renderBuffer` here
for i in 0..<Int(renderBuffer.frameLength) {
let val = sinf(1000.0*Float(index) *2 * .pi / Float(sampleRate))
renderBuffer.floatChannelData?.pointee[i] = val
index += 1
}
I did check the sample rate, which is 48000 - I know that when the sampling frequency is greater than twice the maximum frequency of the signal being sampled, the original signal can be faithfully reconstructed.
Update 2:
I changed the settings as follows:
settings[AVFormatIDKey] = kAudioFormatAppleLossless
settings[AVAudioFileTypeKey] = kAudioFileCAFType
settings[AVSampleRateKey] = readBuffer.format.sampleRate
settings[AVNumberOfChannelsKey] = 1
settings[AVLinearPCMIsFloatKey] = (readBuffer.format.commonFormat == .pcmFormatInt32)
settings[AVSampleRateConverterAudioQualityKey] = AVAudioQuality.max
settings[AVLinearPCMBitDepthKey] = 32
settings[AVEncoderAudioQualityKey] = AVAudioQuality.max
Now the quality of the output signal is better, but not perfect. I get higher amplitudes, but always more than one frequency in the spectrum analyzer. Maybe a workaround could consist in applying a high pass filter?
In the meantime, I did work with a kind of SignalGenerator, streaming the manipulated buffer (with sine waves) directly to the loudspeaker - in this case, the output is perfect. I think that routing the signal to a file causes such issues.
The speed of manual rendering mode is not the issue, as speed in the context of manual rendering is somewhat irrelevant.
Here is skeleton code for manual rendering from a source file to an output file:
// Open the input file
let file = try! AVAudioFile(forReading: URL(fileURLWithPath: "/tmp/test.wav"))
let engine = AVAudioEngine()
let player = AVAudioPlayerNode()
engine.attach(player)
engine.connect(player, to:engine.mainMixerNode, format: nil)
// Run the engine in manual rendering mode using chunks of 512 frames
let renderSize: AVAudioFrameCount = 512
// Use the file's processing format as the rendering format
let renderFormat = AVAudioFormat(commonFormat: file.processingFormat.commonFormat, sampleRate: file.processingFormat.sampleRate, channels: file.processingFormat.channelCount, interleaved: true)!
let renderBuffer = AVAudioPCMBuffer(pcmFormat: renderFormat, frameCapacity: renderSize)!
try! engine.enableManualRenderingMode(.offline, format: renderFormat, maximumFrameCount: renderBuffer.frameCapacity)
try! engine.start()
player.play()
// The render format is also the output format
let output = try! AVAudioFile(forWriting: URL(fileURLWithPath: "/tmp/foo.wav"), settings: renderFormat.settings, commonFormat: renderFormat.commonFormat, interleaved: renderFormat.isInterleaved)
// Read using a buffer sized to produce `renderSize` frames of output
let readBuffer = AVAudioPCMBuffer(pcmFormat: file.processingFormat, frameCapacity: renderSize)!
// Process the file
while true {
do {
// Processing is finished if all frames have been read
if file.framePosition == file.length {
break
}
try file.read(into: readBuffer)
player.scheduleBuffer(readBuffer, completionHandler: nil)
let result = try engine.renderOffline(readBuffer.frameLength, to: renderBuffer)
// Process the audio in `renderBuffer` here
// Write the audio
try output.write(from: renderBuffer)
if result != .success {
break
}
}
catch {
break
}
}
player.stop()
engine.stop()
Here is a snippet showing how to set the same sample rate throughout the engine:
// Replace:
//engine.connect(player, to:engine.mainMixerNode, format: nil)
// With:
let busFormat = AVAudioFormat(standardFormatWithSampleRate: file.fileFormat.sampleRate, channels: file.fileFormat.channelCount)
engine.disconnectNodeInput(engine.outputNode, bus: 0)
engine.connect(engine.mainMixerNode, to: engine.outputNode, format: busFormat)
engine.connect(player, to:engine.mainMixerNode, format: busFormat)
Verify the sample rates are the same throughout with:
NSLog("%#", engine)
________ GraphDescription ________
AVAudioEngineGraph 0x7f8194905af0: initialized = 0, running = 0, number of nodes = 3
******** output chain ********
node 0x600001db9500 {'auou' 'ahal' 'appl'}, 'U'
inputs = 1
(bus0, en1) <- (bus0) 0x600001d80b80, {'aumx' 'mcmx' 'appl'}, [ 2 ch, 48000 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 0x600001d80b80 {'aumx' 'mcmx' 'appl'}, 'U'
inputs = 1
(bus0, en1) <- (bus0) 0x600000fa0200, {'augn' 'sspl' 'appl'}, [ 2 ch, 48000 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
outputs = 1
(bus0, en1) -> (bus0) 0x600001db9500, {'auou' 'ahal' 'appl'}, [ 2 ch, 48000 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 0x600000fa0200 {'augn' 'sspl' 'appl'}, 'U'
outputs = 1
(bus0, en1) -> (bus0) 0x600001d80b80, {'aumx' 'mcmx' 'appl'}, [ 2 ch, 48000 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
______________________________________
I have created an app which I am using to take acoustic measurements. The app generates a log sine sweep stimulus, and when the user presses 'start' the app simultaneously plays the stimulus sound, and records the microphone input.
All fairly standard stuff. I am using core audio as down the line I want to really delve into different functionality, and potentially use multiple interfaces, so have to start learning somewhere.
This is for iOS so I am creating an AUGraph with remoteIO Audio Unit for input and output. I have declared the audio formats, and they are correct as no errors are shown and the AUGraph initialises, starts, plays sound and records.
I have a render callback on the input scope to input 1 of my mixer. (ie, every time more audio is needed, the render callback is called and this reads a few samples into the buffer from my stimulus array of floats).
let genContext = Unmanaged.passRetained(self).toOpaque()
var genCallbackStruct = AURenderCallbackStruct(inputProc: genCallback,
inputProcRefCon: genContext)
AudioUnitSetProperty(mixerUnit!, kAudioUnitProperty_SetRenderCallback,
kAudioUnitScope_Input, 1, &genCallbackStruct,
UInt32(MemoryLayout<AURenderCallbackStruct>.size))
I then have an input callback which is called every time the buffer is full on the output scope of the remoteIO input. This callback saves the samples to an array.
var inputCallbackStruct = AURenderCallbackStruct(inputProc: recordingCallback,
inputProcRefCon: context)
AudioUnitSetProperty(remoteIOUnit!, kAudioOutputUnitProperty_SetInputCallback,
kAudioUnitScope_Global, 0, &inputCallbackStruct,
UInt32(MemoryLayout<AURenderCallbackStruct>.size))
Once the stimulus reaches the last sample, the AUGraph is stopped, and then I write both the stimulus and the recorded array to separate WAV files so I can check my data. What I am finding is that there is currently about 3000 samples delay between the recorded input and the stimulus.
Whilst it is hard to see the start of the waveforms (both the speakers and the microphone may not detect that low), the ends of the stimulus (bottom WAV) and the recorded should roughly line up.
There will be propagation time for the audio, I realise this, but at 44100Hz sample rate, that's 68ms. Core audio is meant to keep latency down.
So my question is this, can anybody account for this additional latency which seems quite high
my inputCallback is as follows:
let recordingCallback: AURenderCallback = { (
inRefCon,
ioActionFlags,
inTimeStamp,
inBusNumber,
frameCount,
ioData ) -> OSStatus in
let audioObject = unsafeBitCast(inRefCon, to: AudioEngine.self)
var err: OSStatus = noErr
var bufferList = AudioBufferList(
mNumberBuffers: 1,
mBuffers: AudioBuffer(
mNumberChannels: UInt32(1),
mDataByteSize: 512,
mData: nil))
if let au: AudioUnit = audioObject.remoteIOUnit! {
err = AudioUnitRender(au,
ioActionFlags,
inTimeStamp,
inBusNumber,
frameCount,
&bufferList)
}
let data = Data(bytes: bufferList.mBuffers.mData!, count: Int(bufferList.mBuffers.mDataByteSize))
let samples = data.withUnsafeBytes {
UnsafeBufferPointer<Int16>(start: $0, count: data.count / MemoryLayout<Int16>.size)
}
let factor = Float(Int16.max)
var floats: [Float] = Array(repeating: 0.0, count: samples.count)
for i in 0..<samples.count {
floats[i] = (Float(samples[i]) / factor)
}
var j = audioObject.in1BufIndex
let m = audioObject.in1BufSize
for i in 0..<(floats.count) {
audioObject.in1Buf[j] = Float(floats[I])
j += 1 ; if j >= m { j = 0 }
}
audioObject.in1BufIndex = j
audioObject.inputCallbackFrameSize = Int(frameCount)
audioObject.callbackcount += 1
var WindowSize = totalRecordSize / Int(frameCount)
if audioObject.callbackcount == WindowSize {
audioObject.running = false
}
return 0
}
So from when the engine starts, this callback should be called after the first set of data is collected from remoteIO. 512 samples as that is the default allocated buffer size. All it does is convert from the signed integer into Float, and save to a buffer. The value in1BufIndex is a reference to the last index in the array written to, and this is referenced and written to with each callback, to make sure the data in the array lines up.
Currently it seems about 3000 samples of silence is in the recorded array before the captured sweep is heard. Inspecting the recorded array by debugging in Xcode, all samples have values (and yes the first 3000 are very quiet), but somehow this doesn't add up.
Below is the generator Callback used to play my stimulus
let genCallback: AURenderCallback = { (
inRefCon,
ioActionFlags,
inTimeStamp,
inBusNumber,
frameCount,
ioData) -> OSStatus in
let audioObject = unsafeBitCast(inRefCon, to: AudioEngine.self)
for buffer in UnsafeMutableAudioBufferListPointer(ioData!) {
var frames = buffer.mData!.assumingMemoryBound(to: Float.self)
var j = 0
if audioObject.stimulusReadIndex < (audioObject.Stimulus.count - Int(frameCount)){
for i in stride(from: 0, to: Int(frameCount), by: 1) {
frames[i] = Float((audioObject.Stimulus[j + audioObject.stimulusReadIndex]))
j += 1
audioObject.in2Buf[j + audioObject.stimulusReadIndex] = Float((audioObject.Stimulus[j + audioObject.stimulusReadIndex]))
}
audioObject.stimulusReadIndex += Int(frameCount)
}
}
return noErr;
}
There may be at least 4 things contributing to the round trip latency.
512 samples, or 11 mS, is the time required to gather enough samples before remoteIO can call your callback.
Sound propagates at about 1 foot per millisecond, double that for a round trip.
The DAC has an output latency.
There is the time needed for the multiple ADCs (there’s more than 1 microphone on your iOS device) to sample and post-process the audio (for sigma-delta, beam forming, equalization, and etc.). The post processing might be done in blocks, thus incurring the latency to gather enough samples (an undocumented number) for one block.
There’s possibly also added overhead latency in moving data (hardware DMA of some unknown block size?) between the ADC and system memory, as well as driver and OS context switching overhead.
There’s also a startup latency to power up the audio hardware subsystems (amplifiers, etc.), so it may be best to start playing and recording audio well before outputting your sound (frequency sweep).
I have a memory reference, mBuffers.mData (from an AudioUnit bufferList), declared in the OS X and iOS framework headers as an:
UnsafeMutablePointer<Void>
What is an efficient way to write lots of Int16 values into memory referenced by this pointer?
A disassembly of this Swift source code:
for i in 0..<count {
var x : Int16 = someFastCalculation()
let loByte : Int32 = Int32(x) & 0x00ff
let hiByte : Int32 = (Int32(x) >> 8) & 0x00ff
memset(mBuffers.mData + 2 * i , loByte, 1)
memset(mBuffers.mData + 2 * i + 1, hiByte, 1)
}
shows lots of instructions setting up the memset() function calls (far more instructions than in my someFastCalculation). This is a loop inside a real-time audio callback, so efficient code to minimize latency and battery consumption is important.
Is there a faster way?
This Swift source allows array assignment of individual audio samples to an Audio Unit (or AUAudioUnit) audio buffer, and compiles down to a faster result than using memset.
let mutableData = UnsafeMutablePointer<Int16>(mBuffers.mData)
let sampleArray = UnsafeMutableBufferPointer<Int16>(
start: mutableData,
count: Int(mBuffers.mDataByteSize)/sizeof(Int16))
for i in 0..<count {
let x : Int16 = mySampleSynthFunction(i)
sampleArray[i] = x
}
More complete Gist here .