Perform normalization using Accelerate framework - swift

I need to perform simple math operation on Data that contains RGB pixels data. Currently Im doing this like so:
let imageMean: Float = 127.5
let imageStd: Float = 127.5
let rgbData: Data // Some data containing RGB pixels
let floats = (0..<rgbData.count).map {
(Float(rgbData[$0]) - imageMean) / imageStd
}
return Data(bytes: floats, count: floats.count * MemoryLayout<Float>.size)
This works, but it's too slow. I was hoping I could use the Accelerate framework to calculate this faster, but have no idea how to do this. I reserved some space so that it's not allocated every time this function starts, like so:
inputBufferDataNormalized = malloc(width * height * 3) // 3 channels RGB
I tried few functions, like vDSP_vasm, but I couldn't make it work. Can someone direct me to how to use it? Basically I need to replace this map function, because it takes too long time. And probably it would be great to use pre-allocated space all the time.

Following up on my comment on your other related question. You can use SIMD to parallelize the operation, but you'd need to split the original array into chunks.
This is a simplified example that assumes that the array is exactly divisible by 64, for example, an array of 1024 elements:
let arr: [Float] = (0 ..< 1024).map { _ in Float.random(in: 0...1) }
let imageMean: Float = 127.5
let imageStd: Float = 127.5
var chunks = [SIMD64<Float>]()
chunks.reserveCapacity(arr.count / 64)
for i in stride(from: 0, to: arr.count, by: 64) {
let v = SIMD64.init(arr[i ..< i+64])
chunks.append((v - imageMean) / imageStd) // same calculation using SIMD
}
You can now access each chunk with a subscript:
var results: [Float] = []
results.reserveCapacity(arr.count)
for chunk in chunks {
for i in chunk.indices {
results.append(chunk[i])
}
}
Of course, you'd need to deal with a remainder if the array isn't exactly divisible by 64.

I have found a way to do this using Accelerate. First I reserve space for converted buffer like so
var inputBufferDataRawFloat = [Float](repeating: 0, count: width * height * 3)
Then I can use it like so:
let rawBytes = [UInt8](rgbData)
vDSP_vfltu8(rawBytes, 1, &inputBufferDataRawFloat, 1, vDSP_Length(rawBytes.count))
vDSP.add(inputBufferDataRawScalars.mean, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
vDSP.multiply(inputBufferDataRawScalars.std, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
return Data(bytes: inputBufferDataRawFloat, count: inputBufferDataRawFloat.count * MemoryLayout<Float>.size)
Works very fast. Maybe there is better function in Accelerate, if anyone know of it, please let me know. It need to perform function (A[n] + B) * C (or to be exact (A[n] - B) / C but the first one could be converted to this).

Related

How to populate a pixel buffer much faster?

As part of a hobby project, I'm working on a 2D game engine that will draw each pixel every frame, using a color from a palette.
I am looking for a way to do that while maintaining a reasonable frame rate (60fps being the minimum).
Without any game-logic in place, I am updating the values of my pixels with some value form the palette.
I'm currently taking the mod of an index, to (hopefully) prevent the compiler from doing some loop-optimisation it could do with a fixed value.
Below is my (very naive?) implementation of updating the bytes in the pixel array.
On an iPhone 12 Pro, each run of updating all pixel values takes on average 43 ms, while on a simulator running on an M1 mac, it takes 15 ms. Both unacceptable, as that would leave not for any additional game logic (which would be much more operations than taking the mod of an Int).
I was planning to look into Metal and set up a surface, but clearly the bottleneck here is the CPU, so if I can optimize this code, I could go for a higher-level framework.
Any suggestions on a performant way to write this many bytes much, much faster (parallelisation is not an option)?
Instruments shows that most of the time is being spent in Swifts IndexingIterator.next() function. Maybe there is way to reduce the time spent there, there is quite a substantial subtree inside it.
struct BGRA
{
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
}
let BGRAPallet =
[
BGRA(blue: 124, green: 124, red: 124, alpha: 0xff),
BGRA(blue: 252, green: 0, red: 0, alpha: 0xff),
// ... 62 more values in my code, omitting here for brevity
]
private func test()
{
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start = Date.now
for _ in 0 ..< runCount
{
for index in 0 ..< pixelBufferPtr.count
{
pixelBufferPtr[index] = BGRAPallet[index % BGRAPallet.count]
}
}
let elapsed = Date.now.timeIntervalSince(start)
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}
First of all, I don't believe you're testing an optimized build for two reasons:
You say “This was measured with optimization set to Fastest [-O3].” But the Swift compiler doesn't recognize -O3 as a command-line flag. The C/C++ compiler recognizes that flag. For Swift the flags are -Onone, -O, -Osize, and -Ounchecked.
I ran your code on my M1 Max MacBook Pro in Debug configuration and it reported 15ms. Then I ran it in Release configuration and it reported 0ms. I had to increase the screen size to 2560x2400 (100x the pixels) to get it to report a time of 3ms.
Now, looking at your code, here are some things that stand out:
You're picking a color using BGRAPalette[index % BGRAPalette.count]. Since your palette size is 64, you can say BGRAPalette[index & 0b0011_1111] for the same result. I expected Swift to optimize that for me, but apparently it didn't, because making that change reduced the reported time to 2ms.
Indexing into BGRAPalette incurs a bounds check. You can avoid the bounds check by grabbing an UnsafeBufferPointer for the palette. Adding this optimization reduced the reported time to 1ms.
Here's my version:
public struct BGRA {
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
}
func rng() -> UInt8 { UInt8.random(in: .min ... .max) }
let BGRAPalette = (0 ..< 64).map { _ in
BGRA(blue: rng(), green: rng(), red: rng(), alpha: rng())
}
public func test() {
let screenWidth: Int = 2560
let screenHeight: Int = 2400
let pixelCount = screenWidth * screenHeight
let pixelBuffer = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: pixelCount)
let runCount = 1000
let start = SuspendingClock().now
BGRAPalette.withUnsafeBufferPointer { paletteBuffer in
for _ in 0 ..< runCount
{
for index in 0 ..< pixelCount
{
pixelBuffer[index] = paletteBuffer[index & 0b0011_1111]
}
}
}
let end = SuspendingClock().now
let elapsed = end - start
let msElapsed = elapsed.components.seconds * 1000 + elapsed.components.attoseconds / 1_000_000_000_000_000
print("Average time per run: \(msElapsed / Int64(runCount)) ms")
// return pixelBuffer
}
#main
struct MyMain {
static func main() {
test()
}
}
In addition to the two optimizations I described, I removed the dependency on Foundation (so I could paste the code into the compiler explorer) and corrected the spelling of ‘palette‘.
But realistically, even this isn't probably isn't particularly good test of your fill rate. You didn't say what kind of game you want to write, but given your screen size of 256x240, it's likely to use a tile-based map and sprites. If so, you shouldn't copy a pixel at a time. You can write a blitter that copies blocks of pixels at a time, using CPU instructions that operate on more than 32 bits at a time. ARM64 has 128-bit (16-byte) registers.
But even more realistically, you should use learn to use the GPU for your blitting. Not only is it faster for this sort of thing, it's probably more power-efficient too. Even though you're lighting up more of the chip, you're lighting it up for shorter intervals.
Okay so i gave this a shot. You can probably move to using single input (or instruction), multiple output. This would probably speed up the whole process by 10 - 20 percent or so.
Here is the code link (same code will be pasted below for brevity): https://codecatch.net/post/4b9683bf-8e35-4bf5-a1a9-801ab2e73805
I made two versions just in case your systems architecture doesn't support simd_uint4. Let me know if this is what you were looking for.
import simd
private func test() {
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start = Date.now
for _ in 0 ..< runCount {
var index = 0
var palletIndex = 0
let palletCount = BGRAPallet.count
while index < pixelBufferPtr.count {
let bgra = BGRAPallet[palletIndex]
let bgraVector = simd_uint4(bgra.blue, bgra.green, bgra.red, bgra.alpha)
let maxCount = min(pixelBufferPtr.count - index, 4)
let pixelBuffer = pixelBufferPtr.baseAddress! + index
pixelBuffer.storeBytes(of: bgraVector, as: simd_uint4.self)
palletIndex += 1
if palletIndex == palletCount {
palletIndex = 0
}
index += maxCount
}
}
let elapsed = Date.now.timeIntervalSince(start)
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}

Matching Torch STFT with Accelerate

Im trying to re-implement Torch's STFT code in Swift with Accelerate / vDSP, to produce a Log Mel Spectrogram by post processing the STFT so I can use the Mel Spectrogram as an input for a CoreML port of OpenAI's Whisper
Pytorch's native STFT / Mel code produces this Spectrogram (its clipped due to importing raw float 32s into Photoshop lol)
and mine:
Obviously the two things to notice are the values, and the lifted frequency components.
The STFT Docs here https://pytorch.org/docs/stable/generated/torch.stft.html
X[ω,m]=
k=0
∑
win_length-1
​
window[k] input[m×hop_length+k] * exp(−j * (2π⋅ωk) /win_length)
I believe Im properly handling window[k] input[m×hop_length+k] but I'm a bit lost as to how to calculate the exponent and what -J is referring to in the documentation, and how to convert the final exponential in vDSP. Also, if its a sum, how do I get the 200 elements I need!?
My Log Mel Spectrogram
My code follows:
func processData(audio: [Int16]) -> [Float]
{
assert(self.sampleCount == audio.count)
var audioFloat:[Float] = [Float](repeating: 0, count: audio.count)
vDSP.convertElements(of: audio, to: &audioFloat)
vDSP.divide(audioFloat, 32768.0, result: &audioFloat)
// Up to this point, Python and swift are numerically identical
// insert numFFT/2 samples before and numFFT/2 after so we have a extra numFFT amount to process
// TODO: Is this stricly necessary?
audioFloat.insert(contentsOf: [Float](repeating: 0, count: self.numFFT/2), at: 0)
audioFloat.append(contentsOf: [Float](repeating: 0, count: self.numFFT/2))
// Split Complex arrays holding the FFT results
var allSampleReal = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
var allSampleImaginary = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
// Step 2 - we need to create 200 x 3000 matrix of STFTs - note we appear to want to output complex numbers (?)
for (m) in 0 ..< self.melSampleCount
{
// Slice numFFTs every hop count (barf) and make a mel spectrum out of it
// audioFrame ends up holding split complex numbers
var audioFrame = Array<Float>( audioFloat[ (m * self.hopCount) ..< ( (m * self.hopCount) + self.numFFT) ] )
// Copy of audioFrame original samples
let audioFrameOriginal = audioFrame
assert(audioFrame.count == self.numFFT)
// Split Complex arrays holding a single FFT result of our Audio Frame, which gets appended to the allSample Split Complex arrays
var sampleReal:[Float] = [Float](repeating: 0, count: self.numFFT/2)
var sampleImaginary:[Float] = [Float](repeating: 0, count: self.numFFT/2)
sampleReal.withUnsafeMutableBytes { unsafeReal in
sampleImaginary.withUnsafeMutableBytes { unsafeImaginary in
vDSP.multiply(audioFrame,
hanningWindow,
result: &audioFrame)
var complexSignal = DSPSplitComplex(realp: unsafeReal.bindMemory(to: Float.self).baseAddress!,
imagp: unsafeImaginary.bindMemory(to: Float.self).baseAddress!)
audioFrame.withUnsafeBytes { unsafeAudioBytes in
vDSP.convert(interleavedComplexVector: [DSPComplex](unsafeAudioBytes.bindMemory(to: DSPComplex.self)),
toSplitComplexVector: &complexSignal)
}
// Step 3 - creating the FFT
self.fft.forward(input: complexSignal, output: &complexSignal)
}
}
// We need to match: https://pytorch.org/docs/stable/generated/torch.stft.html
// At this point, I'm unsure how to continue?
// let twoπ = Float.pi * 2
// let freqstep:Float = Float(16000 / (self.numFFT/2))
//
// var w:Float = 0.0
// for (k) in 0 ..< self.numFFT/2
// {
// let j:Float = sampleImaginary[k]
// let sample = audioFrame[k]
//
// let exponent = -j * ( (twoπ * freqstep * Float(k) ) / Float((self.numFFT/2)))
//
// w += powf(sample, exponent)
// }
allSampleReal[m] = sampleReal
allSampleImaginary[m] = sampleImaginary
}
// We now have allSample Split Complex holding 3000 200 dimensional real and imaginary FFT results
// We create flattened 3000 x 200 array of DSPSplitComplex values
var flattnedReal:[Float] = allSampleReal.flatMap { $0 }
var flattnedImaginary:[Float] = allSampleImaginary.flatMap { $0 }

vDSP_conv occasionally returns NANs

I'm using vDSP_conv to perform autocorrelation. Mostly it works just fine but every so often it's filling the output array with NaNs.
The code:
func corr_test() {
var pass = 0
var x = [Float]()
for i in 0..<2000 {
x.append(Float(i))
}
while true {
print("pass \(pass)")
let corr = autocorr(x)
if corr[1].isNaN {
print("!!!")
}
pass += 1
}
}
func autocorr(a: [Float]) -> [Float] {
let resultLen = a.count * 2 + 1
let padding = [Float].init(count: a.count, repeatedValue: 0.0)
let a_pad = padding + a + padding
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(a_pad, 1, a_pad, 1, &result, 1, UInt(resultLen), UInt(a_pad.count))
return result
}
The output:
pass ...
pass 169
pass 170
pass 171
(lldb) p corr
([Float]) $R0 = 4001 values {
[0] = 2.66466637E+9
[1] = NaN
[2] = NaN
[3] = NaN
[4] = NaN
...
I'm not sure what's going on here. I think I'm handling the 0 padding correctly since if I weren't I don't think I'd be getting correct results 99% of the time.
Ideas? Gracias.
Figured it out. The key was this comment from https://developer.apple.com/library/mac/samplecode/vDSPExamples/Listings/DemonstrateConvolution_c.html :
// “The signal length is padded a bit. This length is not actually passed to the vDSP_conv routine; it is the number of elements
// that the signal array must contain. The SignalLength defined below is used to allocate space, and it is the filter length
// rounded up to a multiple of four elements and added to the result length. The extra elements give the vDSP_conv routine
// leeway to perform vector-load instructions, which load multiple elements even if they are not all used. If the caller did not
// guarantee that memory beyond the values used in the signal array were accessible, a memory access violation might result.”
“Padded a bit.” Thanks for being so specific. Anyway here's the final working product:
func autocorr(a: [Float]) -> [Float] {
let filterLen = a.count
let resultLen = filterLen * 2 - 1
let signalLen = ((filterLen + 3) & 0xFFFFFFFC) + resultLen
let padding1 = [Float].init(count: a.count - 1, repeatedValue: 0.0)
let padding2 = [Float].init(count: (signalLen - padding1.count - a.count), repeatedValue: 0.0)
let signal = padding1 + a + padding2
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(signal, 1, a, 1, &result, 1, UInt(resultLen), UInt(filterLen))
// Remove the first n-1 values which are just mirrored from the end so that [0] always has the autocorrelation.
result.removeFirst(filterLen - 1)
return result
}
Note that the results here aren't normalized.

Efficiently writing Int16 data to memory in Swift?

I have a memory reference, mBuffers.mData (from an AudioUnit bufferList), declared in the OS X and iOS framework headers as an:
UnsafeMutablePointer<Void>
What is an efficient way to write lots of Int16 values into memory referenced by this pointer?
A disassembly of this Swift source code:
for i in 0..<count {
var x : Int16 = someFastCalculation()
let loByte : Int32 = Int32(x) & 0x00ff
let hiByte : Int32 = (Int32(x) >> 8) & 0x00ff
memset(mBuffers.mData + 2 * i , loByte, 1)
memset(mBuffers.mData + 2 * i + 1, hiByte, 1)
}
shows lots of instructions setting up the memset() function calls (far more instructions than in my someFastCalculation). This is a loop inside a real-time audio callback, so efficient code to minimize latency and battery consumption is important.
Is there a faster way?
This Swift source allows array assignment of individual audio samples to an Audio Unit (or AUAudioUnit) audio buffer, and compiles down to a faster result than using memset.
let mutableData = UnsafeMutablePointer<Int16>(mBuffers.mData)
let sampleArray = UnsafeMutableBufferPointer<Int16>(
start: mutableData,
count: Int(mBuffers.mDataByteSize)/sizeof(Int16))
for i in 0..<count {
let x : Int16 = mySampleSynthFunction(i)
sampleArray[i] = x
}
More complete Gist here .

Linear regression - accelerate framework in Swift

My first question here at Stackoverflow... hope my question is specific enough.
I have an array in Swift with measurements at certain dates. Like:
var myArray:[(day: Int, mW: Double)] = []
myArray.append(day:0, mW: 31.98)
myArray.append(day:1, mW: 31.89)
myArray.append(day:2, mW: 31.77)
myArray.append(day:4, mW: 31.58)
myArray.append(day:6, mW: 31.46)
Some days are missing, I just didn't take a measurement... All measurements should be on a line, more or less. So I thought about linear regression. I found the Accelerate framework, but the documentation is missing and I can't find examples.
For the missing measurements I would like to have a function, with as input a missing day and as output a best guess, based on the other measurements.
func bG(day: Int) -> Double {
return // return best guess for measurement
}
Thanks for helping out.
Jan
My answer doesn't specifically talk about the Accelerate Framework, however I thought the question was interesting and thought I'd give it a stab. From what I gather you're basically looking to create a line of best fit and interpolate or extrapolate more values of mW from that. To do that I used the Least Square Method, detailed here: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html and implemented this in Playgrounds using Swift:
// The typealias allows us to use '$X.day' and '$X.mW',
// instead of '$X.0' and '$X.1' in the following closures.
typealias PointTuple = (day: Double, mW: Double)
// The days are the values on the x-axis.
// mW is the value on the y-axis.
let points: [PointTuple] = [(0.0, 31.98),
(1.0, 31.89),
(2.0, 31.77),
(4.0, 31.58),
(6.0, 31.46)]
// When using reduce, $0 is the current total.
let meanDays = points.reduce(0) { $0 + $1.day } / Double(points.count)
let meanMW = points.reduce(0) { $0 + $1.mW } / Double(points.count)
let a = points.reduce(0) { $0 + ($1.day - meanDays) * ($1.mW - meanMW) }
let b = points.reduce(0) { $0 + pow($1.day - meanDays, 2) }
// The equation of a straight line is: y = mx + c
// Where m is the gradient and c is the y intercept.
let m = a / b
let c = meanMW - m * meanDays
In the code above a and b refer to the following formula from the website:
a:
b:
Now you can create the function which uses the line of best fit to interpolate/extrapolate mW:
func bG(day: Double) -> Double {
return m * day + c
}
And use it like so:
bG(3) // 31.70
bG(5) // 31.52
bG(7) // 31.35
If you want to do fast linear regressions in Swift, I suggest using the Upsurge framework. It provides a number of simple functions that wrap the Accelerate library and so you get the benefits of SIMD on either iOS or OSX
without having to worry about the complexity of vDSP calls.
To do a linear regression with base Upsurge functions is simply:
let meanx = mean(x)
let meany = mean(y)
let meanxy = mean(x * y)
let meanx_sqr = measq(x)
let slope = (meanx * meany - meanxy) / (meanx * meanx - meanx_sqr)
let intercept = meany - slope * meanx
This is essentially what is implemented in the linregress function.
You can use it with an array of [Double], other classes such as RealArray (comes with Upsurge) or your own objects if they can expose contiguous memory.
So a script to meet your needs would look like:
#!/usr/bin/env cato
import Upsurge
typealias PointTuple = (day: Double, mW:Double)
var myArray:[PointTuple] = []
myArray.append((0, 31.98))
myArray.append((1, 31.89))
myArray.append((2, 31.77))
myArray.append((4, 31.58))
myArray.append((6, 31.46))
let x = myArray.map { $0.day }
let y = myArray.map { $0.mW }
let (slope, intercept) = Upsurge.linregress(x, y)
func bG(day: Double) -> Double {
return slope * day + intercept
}
(I left in the appends rather than using literals as you are likely programmatically adding to your array if it is of significant length)
and full disclaimer: I contributed the linregress code. I hope to also add the co-efficient of determination at some point in the future.
To estimate the values between different points, you can also use SKKeyframeSequence from SpriteKit
https://developer.apple.com/documentation/spritekit/skinterpolationmode/spline
import SpriteKit
let sequence = SKKeyframeSequence(keyframeValues: [0, 20, 40, 60, 80, 100], times: [64, 128, 256, 512, 1024, 2048])
sequence.interpolationMode = .spline // .linear, .step
let estimatedValue = sequence.sample(atTime: CGFloat(1500)) as! Double // 1500 is the value you want to estimate
print(estimatedValue)