Matching Torch STFT with Accelerate - swift

Im trying to re-implement Torch's STFT code in Swift with Accelerate / vDSP, to produce a Log Mel Spectrogram by post processing the STFT so I can use the Mel Spectrogram as an input for a CoreML port of OpenAI's Whisper
Pytorch's native STFT / Mel code produces this Spectrogram (its clipped due to importing raw float 32s into Photoshop lol)
and mine:
Obviously the two things to notice are the values, and the lifted frequency components.
The STFT Docs here https://pytorch.org/docs/stable/generated/torch.stft.html
X[ω,m]=
k=0
∑
win_length-1
​
window[k] input[m×hop_length+k] * exp(−j * (2π⋅ωk) /win_length)
I believe Im properly handling window[k] input[m×hop_length+k] but I'm a bit lost as to how to calculate the exponent and what -J is referring to in the documentation, and how to convert the final exponential in vDSP. Also, if its a sum, how do I get the 200 elements I need!?
My Log Mel Spectrogram
My code follows:
func processData(audio: [Int16]) -> [Float]
{
assert(self.sampleCount == audio.count)
var audioFloat:[Float] = [Float](repeating: 0, count: audio.count)
vDSP.convertElements(of: audio, to: &audioFloat)
vDSP.divide(audioFloat, 32768.0, result: &audioFloat)
// Up to this point, Python and swift are numerically identical
// insert numFFT/2 samples before and numFFT/2 after so we have a extra numFFT amount to process
// TODO: Is this stricly necessary?
audioFloat.insert(contentsOf: [Float](repeating: 0, count: self.numFFT/2), at: 0)
audioFloat.append(contentsOf: [Float](repeating: 0, count: self.numFFT/2))
// Split Complex arrays holding the FFT results
var allSampleReal = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
var allSampleImaginary = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
// Step 2 - we need to create 200 x 3000 matrix of STFTs - note we appear to want to output complex numbers (?)
for (m) in 0 ..< self.melSampleCount
{
// Slice numFFTs every hop count (barf) and make a mel spectrum out of it
// audioFrame ends up holding split complex numbers
var audioFrame = Array<Float>( audioFloat[ (m * self.hopCount) ..< ( (m * self.hopCount) + self.numFFT) ] )
// Copy of audioFrame original samples
let audioFrameOriginal = audioFrame
assert(audioFrame.count == self.numFFT)
// Split Complex arrays holding a single FFT result of our Audio Frame, which gets appended to the allSample Split Complex arrays
var sampleReal:[Float] = [Float](repeating: 0, count: self.numFFT/2)
var sampleImaginary:[Float] = [Float](repeating: 0, count: self.numFFT/2)
sampleReal.withUnsafeMutableBytes { unsafeReal in
sampleImaginary.withUnsafeMutableBytes { unsafeImaginary in
vDSP.multiply(audioFrame,
hanningWindow,
result: &audioFrame)
var complexSignal = DSPSplitComplex(realp: unsafeReal.bindMemory(to: Float.self).baseAddress!,
imagp: unsafeImaginary.bindMemory(to: Float.self).baseAddress!)
audioFrame.withUnsafeBytes { unsafeAudioBytes in
vDSP.convert(interleavedComplexVector: [DSPComplex](unsafeAudioBytes.bindMemory(to: DSPComplex.self)),
toSplitComplexVector: &complexSignal)
}
// Step 3 - creating the FFT
self.fft.forward(input: complexSignal, output: &complexSignal)
}
}
// We need to match: https://pytorch.org/docs/stable/generated/torch.stft.html
// At this point, I'm unsure how to continue?
// let twoπ = Float.pi * 2
// let freqstep:Float = Float(16000 / (self.numFFT/2))
//
// var w:Float = 0.0
// for (k) in 0 ..< self.numFFT/2
// {
// let j:Float = sampleImaginary[k]
// let sample = audioFrame[k]
//
// let exponent = -j * ( (twoπ * freqstep * Float(k) ) / Float((self.numFFT/2)))
//
// w += powf(sample, exponent)
// }
allSampleReal[m] = sampleReal
allSampleImaginary[m] = sampleImaginary
}
// We now have allSample Split Complex holding 3000 200 dimensional real and imaginary FFT results
// We create flattened 3000 x 200 array of DSPSplitComplex values
var flattnedReal:[Float] = allSampleReal.flatMap { $0 }
var flattnedImaginary:[Float] = allSampleImaginary.flatMap { $0 }

Related

Determine number of threads for element-wise array addition in Metal

In this example there are two large 1D arrays of size n. The arrays are added together element-wise to calculate a 1D results array using the Accelerate vDSP.add() function and a Metal GPU compute kernel adder().
// Size of each array
private let n = 5_000_000
// Create two random arrays of size n
private var array1 = (1...n).map{ _ in Float.random(in: 1...10) }
private var array2 = (1...n).map{ _ in Float.random(in: 1...10) }
// Add two arrays using Accelerate vDSP
addAccel(array1, array2)
// Add two arrays using Metal on the GPU
addMetal(array1, array2)
The Accelerate code is shown below:
import Accelerate
func addAccel(_ arr1: [Float], _ arr2: [Float]) {
let tic = DispatchTime.now().uptimeNanoseconds
// Add two arrays and store results
let y = vDSP.add(arr1, arr2)
// Print out elapsed time
let toc = DispatchTime.now().uptimeNanoseconds
let elapsed = Float(toc - tic) / 1_000_000_000
print("\nAccelerate vDSP elapsed time is \(elapsed) s")
// Print out some results
for i in 0..<3 {
let a1 = String(format: "%.4f", arr1[i])
let a2 = String(format: "%.4f", arr2[i])
let y = String(format: "%.4f", y[i])
print("\(a1) + \(a2) = \(y)")
}
}
The Metal code is shown below:
import MetalKit
private func setupMetal(arr1: [Float], arr2: [Float]) -> (MTLCommandBuffer?, MTLBuffer?) {
// Get the Metal GPU device
let device = MTLCreateSystemDefaultDevice()
// Queue for sending commands to the GPU
let commandQueue = device?.makeCommandQueue()
// Get our Metal GPU function
let gpuFunctionLibrary = device?.makeDefaultLibrary()
let adderGpuFunction = gpuFunctionLibrary?.makeFunction(name: "adder")
var adderComputePipelineState: MTLComputePipelineState!
do {
adderComputePipelineState = try device?.makeComputePipelineState(function: adderGpuFunction!)
} catch {
print(error)
}
// Create the buffers to be sent to the GPU from our arrays
let count = arr1.count
let arr1Buff = device?.makeBuffer(bytes: arr1,
length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
let arr2Buff = device?.makeBuffer(bytes: arr2,
length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
// Create a buffer to be sent to the command queue
let commandBuffer = commandQueue?.makeCommandBuffer()
// Create an encoder to set values on the compute function
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(adderComputePipelineState)
// Set the parameters of our GPU function
commandEncoder?.setBuffer(arr1Buff, offset: 0, index: 0)
commandEncoder?.setBuffer(arr2Buff, offset: 0, index: 1)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 2)
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
let maxThreadsPerThreadgroup = adderComputePipelineState.maxTotalThreadsPerThreadgroup
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
// Tell the encoder that it is done encoding. Now we can send this off to the GPU.
commandEncoder?.endEncoding()
return (commandBuffer, resultBuff)
}
func addMetal(_ arr1: [Float], _ arr2: [Float]) {
let (commandBuffer, resultBuff) = setupMetal(arr1: arr1, arr2: arr2)
let tic = DispatchTime.now().uptimeNanoseconds
// Push this command to the command queue for processing
commandBuffer?.commit()
// Wait until the GPU function completes before working with any of the data
commandBuffer?.waitUntilCompleted()
// Get the pointer to the beginning of our data
let count = arr1.count
var resultBufferPointer = resultBuff?.contents().bindMemory(to: Float.self, capacity: MemoryLayout<Float>.size * count)
// Print out elapsed time
let toc = DispatchTime.now().uptimeNanoseconds
let elapsed = Float(toc - tic) / 1_000_000_000
print("\nMetal GPU elapsed time is \(elapsed) s")
// Print out the results
for i in 0..<3 {
let a1 = String(format: "%.4f", arr1[i])
let a2 = String(format: "%.4f", arr2[i])
let y = String(format: "%.4f", Float(resultBufferPointer!.pointee))
print("\(a1) + \(a2) = \(y)")
resultBufferPointer = resultBufferPointer?.advanced(by: 1)
}
}
#include <metal_stdlib>
using namespace metal;
kernel void adder(
constant float *array1 [[ buffer(0) ]],
constant float *array2 [[ buffer(1) ]],
device float *result [[ buffer(2) ]],
uint index [[ thread_position_in_grid ]])
{
result[index] = array1[index] + array2[index];
}
Results from running the above code on a 2019 MacBook Pro are given below. Specs for the laptop are 2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4, Intel UHD Graphics 630 1536 MB, and AMD Radeon Pro 5500M.
Accelerate vDSP elapsed time is 0.004532601 s
7.8964 + 6.3815 = 14.2779
9.3661 + 8.9641 = 18.3301
4.5389 + 8.5737 = 13.1126
Metal GPU elapsed time is 0.012219718 s
7.8964 + 6.3815 = 14.2779
9.3661 + 8.9641 = 18.3301
4.5389 + 8.5737 = 13.1126
Based on the elapsed times, the Accelerate function is faster than the Metal compute function. I think this is because I did not properly define the threads. How do I determine the optimum number of threads per grid and threads per thread group for this example?
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
let maxThreadsPerThreadgroup = adderComputePipelineState.maxTotalThreadsPerThreadgroup
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
For metal you are measuring time in both computation and data transfer from GPU to CPU and also creating array on CPU.
You should use addcompletedhandler for gpu compution time

Get/extract factorisation from SparseOpaqueFactorization using Accelerate

I am writing some Linear Algebra algorithms using Apples Swift / Accelerate framework. All works and the solved Ax = b equations produce the right results (this code is from the apple examples).
I would like to be able to extract the LLT factorisation from the
SparseOpaqueFactorization_Double
object. But there doesn't seem to be any way to extract (to print) the factorisation. Does anyone know of a way of extracting the factorised matrix from the SparseOpaqueFactorization_Double object?
import Foundation
import Accelerate
print("Hello, World!")
// Example of a symmetric sparse matrix, empty cells represent zeros.
var rowIndices: [Int32] = [0, 1, 3, // Column 0
1, 2, 3, // Column 1
2, // col 2
3] // Col 3
// note that the Matrix representation is the upper triangular
// here. Since the matrix is symmetric, no need to store the lower
// triangular.
var values: [Double] = [10.0, 1.0 , 2.5, // Column 0
12.0, -0.3, 1.1, // Column 1
9.5, // Col 2
6.0 ] // Column 3
var columnStarts = [0, // Column 0
3, // Column 1
6, 7, // Column 2
8] // col 3
var attributes = SparseAttributes_t()
attributes.triangle = SparseLowerTriangle
attributes.kind = SparseSymmetric
let structure = SparseMatrixStructure(rowCount: 4,
columnCount: 4,
columnStarts: &columnStarts,
rowIndices: &rowIndices,
attributes: attributes,
blockSize: 1)
let llt: SparseOpaqueFactorization_Double = values.withUnsafeMutableBufferPointer { valuesPtr in
let a = SparseMatrix_Double(
structure: structure,
data: valuesPtr.baseAddress!
)
return SparseFactor(SparseFactorizationCholesky, a)
}
var bValues = [ 2.20, 2.85, 2.79, 2.87 ]
var xValues = [ 0.00, 0.00, 0.00, 0.00 ]
bValues.withUnsafeMutableBufferPointer { bPtr in
xValues.withUnsafeMutableBufferPointer { xPtr in
let b = DenseVector_Double(
count: 4,
data: bPtr.baseAddress!
)
let x = DenseVector_Double(
count: 4,
data: xPtr.baseAddress!
)
SparseSolve(llt, b, x)
}
}
for val in xValues {
print("x = " + String(format: "%.2f", val), terminator: " ")
}
print("")
print("Success")
OK so after much sleuthing around the apple swift headers, I have solved this problem.
There is an Accelerate API call called
public func SparseCreateSubfactor(_ subfactor: SparseSubfactor_t, _ Factor: SparseOpaqueFactorization_Double) -> SparseOpaqueSubfactor_Double
which returns this SparceOpaqueSubfactor_ type. This can be used in a matrix multiplication to produce a "transparent" result (i.e. a matrix you can use/print/see). So I multiplied the SubFactor for the Lower triangular part of the Cholesky factorisation by the Identity matrix to extract the factors. Works a treat!
let subfactors = SparseCreateSubfactor(SparseSubfactorL, llt)
var identValues = generateIdentity(n)
ppm(identValues)
let sparseAs = SparseAttributes_t(transpose: false,
triangle: SparseUpperTriangle,
kind: SparseOrdinary,
_reserved: 0,
_allocatedBySparse: false)
let identity_m = DenseMatrix_Double(rowCount: Int32(n),
columnCount: Int32(n),
columnStride: Int32(n),
attributes: sparseAs,
data: &identValues)
SparseMultiply(subfactors, identity_m) // Output is in identity_m after the call
I wrote a small function to generate an identity matrix which I've used in the code above:
func generateIdentity(_ dimension: Int) -> [Double] {
var iden = Array<Double>()
for i in 0...dimension - 1 {
for j in 0...dimension - 1 {
if i == j {
iden.append(1.0)
} else {
iden.append(0.0)
}
}
}
return iden
}

Perform normalization using Accelerate framework

I need to perform simple math operation on Data that contains RGB pixels data. Currently Im doing this like so:
let imageMean: Float = 127.5
let imageStd: Float = 127.5
let rgbData: Data // Some data containing RGB pixels
let floats = (0..<rgbData.count).map {
(Float(rgbData[$0]) - imageMean) / imageStd
}
return Data(bytes: floats, count: floats.count * MemoryLayout<Float>.size)
This works, but it's too slow. I was hoping I could use the Accelerate framework to calculate this faster, but have no idea how to do this. I reserved some space so that it's not allocated every time this function starts, like so:
inputBufferDataNormalized = malloc(width * height * 3) // 3 channels RGB
I tried few functions, like vDSP_vasm, but I couldn't make it work. Can someone direct me to how to use it? Basically I need to replace this map function, because it takes too long time. And probably it would be great to use pre-allocated space all the time.
Following up on my comment on your other related question. You can use SIMD to parallelize the operation, but you'd need to split the original array into chunks.
This is a simplified example that assumes that the array is exactly divisible by 64, for example, an array of 1024 elements:
let arr: [Float] = (0 ..< 1024).map { _ in Float.random(in: 0...1) }
let imageMean: Float = 127.5
let imageStd: Float = 127.5
var chunks = [SIMD64<Float>]()
chunks.reserveCapacity(arr.count / 64)
for i in stride(from: 0, to: arr.count, by: 64) {
let v = SIMD64.init(arr[i ..< i+64])
chunks.append((v - imageMean) / imageStd) // same calculation using SIMD
}
You can now access each chunk with a subscript:
var results: [Float] = []
results.reserveCapacity(arr.count)
for chunk in chunks {
for i in chunk.indices {
results.append(chunk[i])
}
}
Of course, you'd need to deal with a remainder if the array isn't exactly divisible by 64.
I have found a way to do this using Accelerate. First I reserve space for converted buffer like so
var inputBufferDataRawFloat = [Float](repeating: 0, count: width * height * 3)
Then I can use it like so:
let rawBytes = [UInt8](rgbData)
vDSP_vfltu8(rawBytes, 1, &inputBufferDataRawFloat, 1, vDSP_Length(rawBytes.count))
vDSP.add(inputBufferDataRawScalars.mean, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
vDSP.multiply(inputBufferDataRawScalars.std, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
return Data(bytes: inputBufferDataRawFloat, count: inputBufferDataRawFloat.count * MemoryLayout<Float>.size)
Works very fast. Maybe there is better function in Accelerate, if anyone know of it, please let me know. It need to perform function (A[n] + B) * C (or to be exact (A[n] - B) / C but the first one could be converted to this).

Relative Strength Index in Swift

I am trying to code an RSI (which has been a good way for me to learn API data fetching and algorithms already).
The API I am fetching data from comes from a reputable exchange so I know the values my algorithm is analyzing are correct, that's a good start.
The issue I'm having is that the result of my calculations are completely off from what I can read on that particular exchange and which also provides an RSI indicator (I assume they analyze their own data, so the same data as I have).
I used the exact same API to translate the Ichimoku indicator into code and this time everything is correct! I believe my RSI calculations might be wrong somehow but I've checked and re-checked many times.
I also have a "literal" version of the code where every step is calculated like an excel sheet. It's pretty stupid in code but it validates the logic of the calculation and the results are the same as the following code.
Here is my code to calculate the RSI :
let period = 14
// Upward Movements and Downward Movements
var upwardMovements : [Double] = []
var downwardMovements : [Double] = []
for idx in 0..<15 {
let diff = items[idx + 1].close - items[idx].close
upwardMovements.append(max(diff, 0))
downwardMovements.append(max(-diff, 0))
}
// Average Upward Movements and Average Downward Movements
let averageUpwardMovement1 = upwardMovements[0..<period].reduce(0, +) / Double(period)
let averageDownwardMovement1 = downwardMovements[0..<period].reduce(0, +) / Double(period)
let averageUpwardMovement2 = (averageUpwardMovement1 * Double(period - 1) + upwardMovements[period]) / Double(period)
let averageDownwardMovement2 = (averageDownwardMovement1 * Double(period - 1) + downwardMovements[period]) / Double(period)
// Relative Strength
let relativeStrength1 = averageUpwardMovement1 / averageDownwardMovement1
let relativeStrength2 = averageUpwardMovement2 / averageDownwardMovement2
// Relative Strength Index
let rSI1 = 100 - (100 / (relativeStrength1 + 1))
let rSI2 = 100 - (100 / (relativeStrength2 + 1))
// Relative Strength Index Average
let relativeStrengthAverage = (rSI1 + rSI2) / 2
BitcoinRelativeStrengthIndex.bitcoinRSI = relativeStrengthAverage
Readings at 3:23pm this afternoon give 73.93 for my algorithm and 18.74 on the exchange. As the markets are crashing right now and I have access to different RSIs on different exchanges, they all display an RSI below 20 so my calculations are off.
Do you guys have any idea?
I am answering this 2 years later, but hopefully it helps someone.
RSI gets more precise the more data points you feed into it. For a default RSI period of 14, you should have at least 200 previous data points. The more, the better!
Let's suppose you have an array of close candle prices for a given market. The following function will return RSI values for each candle. You should always ignore the first data points, since they are not precise enough or the number of candles is not the 14 (or whatever your periods number is).
func computeRSI(on prices: [Double], periods: Int = 14, minimumPoints: Int = 200) -> [Double] {
precondition(periods > 1 && minimumPoints > periods && prices.count >= minimumPoints)
return Array(unsafeUninitializedCapacity: prices.count) { (buffer, count) in
buffer.initialize(repeating: 50)
var (previousPrice, gain, loss) = (prices[0], 0.0, 0.0)
for p in stride(from: 1, through: periods, by: 1) {
let price = prices[p]
let value = price - previousPrice
if value > 0 {
gain += value
} else {
loss -= value
}
previousPrice = price
}
let (numPeriods, numPeriodsMinusOne) = (Double(periods), Double(periods &- 1))
var avg = (gain: gain / numPeriods, loss: loss /numPeriods)
buffer[periods] = (avg.loss > .zero) ? 100 - 100 / (1 + avg.gain/avg.loss) : 100
for p in stride(from: periods &+ 1, to: prices.count, by: 1) {
let price = prices[p]
avg.gain *= numPeriodsMinusOne
avg.loss *= numPeriodsMinusOne
let value = price - previousPrice
if value > 0 {
avg.gain += value
} else {
avg.loss -= value
}
avg.gain /= numPeriods
avg.loss /= numPeriods
if avgLoss > .zero {
buffer[p] = 100 - 100 / (1 + avg.gain/avg.loss)
} else {
buffer[p] = 100
}
previousPrice = price
}
count = prices.count
}
}
Please note that the code is very imperative to reduce the amount of operations/loops and get the maximum compiler optimizations. You might be able to squeeze more performance using the Accelerate framework, though. We are also handling the edge case where you might get all gains or losses in a periods range.
If you want to have a running RSI calculation. Just store the last RSI value and perform the RSI equation for the new price.

vDSP_conv occasionally returns NANs

I'm using vDSP_conv to perform autocorrelation. Mostly it works just fine but every so often it's filling the output array with NaNs.
The code:
func corr_test() {
var pass = 0
var x = [Float]()
for i in 0..<2000 {
x.append(Float(i))
}
while true {
print("pass \(pass)")
let corr = autocorr(x)
if corr[1].isNaN {
print("!!!")
}
pass += 1
}
}
func autocorr(a: [Float]) -> [Float] {
let resultLen = a.count * 2 + 1
let padding = [Float].init(count: a.count, repeatedValue: 0.0)
let a_pad = padding + a + padding
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(a_pad, 1, a_pad, 1, &result, 1, UInt(resultLen), UInt(a_pad.count))
return result
}
The output:
pass ...
pass 169
pass 170
pass 171
(lldb) p corr
([Float]) $R0 = 4001 values {
[0] = 2.66466637E+9
[1] = NaN
[2] = NaN
[3] = NaN
[4] = NaN
...
I'm not sure what's going on here. I think I'm handling the 0 padding correctly since if I weren't I don't think I'd be getting correct results 99% of the time.
Ideas? Gracias.
Figured it out. The key was this comment from https://developer.apple.com/library/mac/samplecode/vDSPExamples/Listings/DemonstrateConvolution_c.html :
// “The signal length is padded a bit. This length is not actually passed to the vDSP_conv routine; it is the number of elements
// that the signal array must contain. The SignalLength defined below is used to allocate space, and it is the filter length
// rounded up to a multiple of four elements and added to the result length. The extra elements give the vDSP_conv routine
// leeway to perform vector-load instructions, which load multiple elements even if they are not all used. If the caller did not
// guarantee that memory beyond the values used in the signal array were accessible, a memory access violation might result.”
“Padded a bit.” Thanks for being so specific. Anyway here's the final working product:
func autocorr(a: [Float]) -> [Float] {
let filterLen = a.count
let resultLen = filterLen * 2 - 1
let signalLen = ((filterLen + 3) & 0xFFFFFFFC) + resultLen
let padding1 = [Float].init(count: a.count - 1, repeatedValue: 0.0)
let padding2 = [Float].init(count: (signalLen - padding1.count - a.count), repeatedValue: 0.0)
let signal = padding1 + a + padding2
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(signal, 1, a, 1, &result, 1, UInt(resultLen), UInt(filterLen))
// Remove the first n-1 values which are just mirrored from the end so that [0] always has the autocorrelation.
result.removeFirst(filterLen - 1)
return result
}
Note that the results here aren't normalized.