what causes the iteration of an array to conclude prematurely in metal?

what causes the iteration of an array to conclude prematurely in metal? - swift

What I'm trying to do
I'm testing out metals capability to work with loops. Since I can't define new constants in metal, I'm passing a uint into the buffer and use it to iterate over an array filled with integers. It looks like this in swift.
let array1: [Int] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
The problem(s)
However when reading the result array buffer in Swift after completing the loop in metal, it seems like not every element has been allocated.
#include <metal_stdlib>
using namespace metal;
kernel void shader(constant int *arr [[ buffer(0) ]],
device int *resultArray [[ buffer(1) ]],
constant uint &iter [[ buffer(2) ]]) // value of 12
{
for (uint i = 0; i < iter; i++){
resultArray[i] = arr[i];
}
}
out
1
2
3
4
5
6
0
0
0
0
0
0
Similarly, using the iterator to set allocate each element of resultArray, yields strange results
for (uint i = 0; i < iter; i++){
resultArray[i] = i;
}
out
4294967296
12884901890
21474836484
30064771078
38654705672
47244640266
0
0
0
0
0
0
Multiplication seems to work
for (uint i = 0; i < iter; i++){
resultArray[i] = arr[i] * i;
}
out
0
4
12
24
40
60
0
0
0
0
0
0
Addition does not
for (uint i = 0; i < iter; i++){
resultArray[i] = arr[i] + i;
}
out
4294967297
12884901892
21474836487
30064771082
38654705677
47244640272
0
0
0
0
0
0
When however, I set iter to a value of for example 24 or higher, it at least iterated over the whole arrays of size 12.
for (uint i = 0; i < iter; i++){ // iter now value of 100
resultArray[i] = arr[i] * iter;
}
100
200
300
400
500
600
100
200
300
400
500
600
What is going on here?
MCVE
yes, it's a lot of code to get a simple loop running in metal, please bare with me
main.swift
import MetalKit
let array1: [Int] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
func gpuProcess(arr1: [Int]) {
let size = arr1.count // value of 12
// GPU we want to use
let device = MTLCreateSystemDefaultDevice()
// Fifo queue for sending commands to the gpu
let commandQueue = device?.makeCommandQueue()
// The library for getting our metal functions
let gpuFunctionLibrary = device?.makeDefaultLibrary()
// Grab gpu function
let additionGPUFunction = gpuFunctionLibrary?.makeFunction(name: "shader")
var additionComputePipelineState: MTLComputePipelineState!
do {
additionComputePipelineState = try device?.makeComputePipelineState(function: additionGPUFunction!)
} catch {
print(error)
}
// Create buffers to be sent to the gpu from our array
let arr1Buff = device?.makeBuffer(bytes: arr1,
length: MemoryLayout<Int>.size * size ,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Int>.size * size,
options: .storageModeShared)
// Create the buffer to be sent to the command queue
let commandBuffer = commandQueue?.makeCommandBuffer()
// Create an encoder to set values on the compute function
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(additionComputePipelineState)
// Set the parameters of our gpu function
commandEncoder?.setBuffer(arr1Buff, offset: 0, index: 0)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 1)
// Set parameters for our iterator
var count = size
commandEncoder?.setBytes(&count, length: MemoryLayout.size(ofValue: count), index: 2)
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: 1, height: 1, depth: 1)
let maxThreadsPerThreadgroup = additionComputePipelineState.maxTotalThreadsPerThreadgroup // 1024
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
// Tell encoder that it is done encoding. Now we can send this off to the gpu.
commandEncoder?.endEncoding()
// Push this command to the command queue for processing
commandBuffer?.commit()
// Wait until the gpu function completes before working with any of the data
commandBuffer?.waitUntilCompleted()
// Get the pointer to the beginning of our data
var resultBufferPointer = resultBuff?.contents().bindMemory(to: Int.self,
capacity: MemoryLayout<Int>.size * size)
// Print out all of our new added together array information
for _ in 0..<size {
print("\(Int(resultBufferPointer!.pointee) as Any)")
resultBufferPointer = resultBufferPointer?.advanced(by: 1)
}
}
// Call function
gpuProcess(arr1: array1)
compute.metal
#include <metal_stdlib>
using namespace metal;
kernel void shader(constant int *arr [[ buffer(0) ]],
device int *resultArray [[ buffer(1) ]],
constant uint &iter [[ buffer(2) ]]) // value of 12
{
for (uint i = 0; i < iter; i++){
resultArray[i] = arr[i] * iter;
}
}

You are using 64 bit Int in Swift and 32 bit integers in MSL. Your GPU threads are also overlapping their work. Instead, use Int32 in Swift and make each thread process their own piece of data. Like this
import MetalKit
let array1: [Int32] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
func gpuProcess(arr1: [Int32]) {
let size = arr1.count // value of 12
// GPU we want to use
let device = MTLCreateSystemDefaultDevice()
// Fifo queue for sending commands to the gpu
let commandQueue = device?.makeCommandQueue()
// The library for getting our metal functions
let gpuFunctionLibrary = device?.makeDefaultLibrary()
// Grab gpu function
let additionGPUFunction = gpuFunctionLibrary?.makeFunction(name: "shader")
var additionComputePipelineState: MTLComputePipelineState!
do {
additionComputePipelineState = try device?.makeComputePipelineState(function: additionGPUFunction!)
} catch {
print(error)
}
// Create buffers to be sent to the gpu from our array
let arr1Buff = device?.makeBuffer(bytes: arr1,
length: MemoryLayout<Int32>.stride * size ,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Int32>.stride * size,
options: .storageModeShared)
// Create the buffer to be sent to the command queue
let commandBuffer = commandQueue?.makeCommandBuffer()
// Create an encoder to set values on the compute function
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(additionComputePipelineState)
// Set the parameters of our gpu function
commandEncoder?.setBuffer(arr1Buff, offset: 0, index: 0)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 1)
// Set parameters for our iterator
var count = size
commandEncoder?.setBytes(&count, length: MemoryLayout.size(ofValue: count), index: 2)
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: 1, height: 1, depth: 1)
let maxThreadsPerThreadgroup = additionComputePipelineState.maxTotalThreadsPerThreadgroup // 1024
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
// Tell encoder that it is done encoding. Now we can send this off to the gpu.
commandEncoder?.endEncoding()
// Push this command to the command queue for processing
commandBuffer?.commit()
// Wait until the gpu function completes before working with any of the data
commandBuffer?.waitUntilCompleted()
// Get the pointer to the beginning of our data
var resultBufferPointer = resultBuff?.contents().bindMemory(to: Int32.self,
capacity: MemoryLayout<Int32>.stride * size)
// Print out all of our new added together array information
for _ in 0..<size {
print("\(Int32(resultBufferPointer!.pointee) as Any)")
resultBufferPointer = resultBufferPointer?.advanced(by: 1)
}
}
// Call function
gpuProcess(arr1: array1)
Kernel:
#include <metal_stdlib>
using namespace metal;
kernel void shader(constant int *arr [[ buffer(0) ]],
device int *resultArray [[ buffer(1) ]],
constant uint &iter [[ buffer(2) ]],
uint gid [[ thread_position_in_grid ]], // this is thread index in grid, since you have height and depth of a dispatch set to 1 in CPU code, you can use 1D `int` here.
)
{
// Early out if gid is out of array boudns
if(gid >= iter)
{
return;
}
// Each thread processes it's own data
resultArray[gid] = arr[gid] * iter;
}
For more information on how to use Metal for compute refer to developer docs and for the information about attributes such as thread_position_in_grid refer to Metal Shading Language specification.

Related

Matching Torch STFT with Accelerate

Im trying to re-implement Torch's STFT code in Swift with Accelerate / vDSP, to produce a Log Mel Spectrogram by post processing the STFT so I can use the Mel Spectrogram as an input for a CoreML port of OpenAI's Whisper
Pytorch's native STFT / Mel code produces this Spectrogram (its clipped due to importing raw float 32s into Photoshop lol)
and mine:
Obviously the two things to notice are the values, and the lifted frequency components.
The STFT Docs here https://pytorch.org/docs/stable/generated/torch.stft.html
X[ω,m]=
k=0
∑
win_length-1

window[k] input[m×hop_length+k] * exp(−j * (2π⋅ωk) /win_length)
I believe Im properly handling window[k] input[m×hop_length+k] but I'm a bit lost as to how to calculate the exponent and what -J is referring to in the documentation, and how to convert the final exponential in vDSP. Also, if its a sum, how do I get the 200 elements I need!?
My Log Mel Spectrogram
My code follows:
func processData(audio: [Int16]) -> [Float]
{
assert(self.sampleCount == audio.count)
var audioFloat:[Float] = [Float](repeating: 0, count: audio.count)
vDSP.convertElements(of: audio, to: &audioFloat)
vDSP.divide(audioFloat, 32768.0, result: &audioFloat)
// Up to this point, Python and swift are numerically identical
// insert numFFT/2 samples before and numFFT/2 after so we have a extra numFFT amount to process
// TODO: Is this stricly necessary?
audioFloat.insert(contentsOf: [Float](repeating: 0, count: self.numFFT/2), at: 0)
audioFloat.append(contentsOf: [Float](repeating: 0, count: self.numFFT/2))
// Split Complex arrays holding the FFT results
var allSampleReal = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
var allSampleImaginary = [[Float]](repeating: [Float](repeating: 0, count: self.numFFT/2), count: self.melSampleCount)
// Step 2 - we need to create 200 x 3000 matrix of STFTs - note we appear to want to output complex numbers (?)
for (m) in 0 ..< self.melSampleCount
{
// Slice numFFTs every hop count (barf) and make a mel spectrum out of it
// audioFrame ends up holding split complex numbers
var audioFrame = Array<Float>( audioFloat[ (m * self.hopCount) ..< ( (m * self.hopCount) + self.numFFT) ] )
// Copy of audioFrame original samples
let audioFrameOriginal = audioFrame
assert(audioFrame.count == self.numFFT)
// Split Complex arrays holding a single FFT result of our Audio Frame, which gets appended to the allSample Split Complex arrays
var sampleReal:[Float] = [Float](repeating: 0, count: self.numFFT/2)
var sampleImaginary:[Float] = [Float](repeating: 0, count: self.numFFT/2)
sampleReal.withUnsafeMutableBytes { unsafeReal in
sampleImaginary.withUnsafeMutableBytes { unsafeImaginary in
vDSP.multiply(audioFrame,
hanningWindow,
result: &audioFrame)
var complexSignal = DSPSplitComplex(realp: unsafeReal.bindMemory(to: Float.self).baseAddress!,
imagp: unsafeImaginary.bindMemory(to: Float.self).baseAddress!)
audioFrame.withUnsafeBytes { unsafeAudioBytes in
vDSP.convert(interleavedComplexVector: [DSPComplex](unsafeAudioBytes.bindMemory(to: DSPComplex.self)),
toSplitComplexVector: &complexSignal)
}
// Step 3 - creating the FFT
self.fft.forward(input: complexSignal, output: &complexSignal)
}
}
// We need to match: https://pytorch.org/docs/stable/generated/torch.stft.html
// At this point, I'm unsure how to continue?
// let twoπ = Float.pi * 2
// let freqstep:Float = Float(16000 / (self.numFFT/2))
//
// var w:Float = 0.0
// for (k) in 0 ..< self.numFFT/2
// {
// let j:Float = sampleImaginary[k]
// let sample = audioFrame[k]
//
// let exponent = -j * ( (twoπ * freqstep * Float(k) ) / Float((self.numFFT/2)))
//
// w += powf(sample, exponent)
// }
allSampleReal[m] = sampleReal
allSampleImaginary[m] = sampleImaginary
}
// We now have allSample Split Complex holding 3000 200 dimensional real and imaginary FFT results
// We create flattened 3000 x 200 array of DSPSplitComplex values
var flattnedReal:[Float] = allSampleReal.flatMap { $0 }
var flattnedImaginary:[Float] = allSampleImaginary.flatMap { $0 }

Determine number of threads for element-wise array addition in Metal

In this example there are two large 1D arrays of size n. The arrays are added together element-wise to calculate a 1D results array using the Accelerate vDSP.add() function and a Metal GPU compute kernel adder().
// Size of each array
private let n = 5_000_000
// Create two random arrays of size n
private var array1 = (1...n).map{ _ in Float.random(in: 1...10) }
private var array2 = (1...n).map{ _ in Float.random(in: 1...10) }
// Add two arrays using Accelerate vDSP
addAccel(array1, array2)
// Add two arrays using Metal on the GPU
addMetal(array1, array2)
The Accelerate code is shown below:
import Accelerate
func addAccel(_ arr1: [Float], _ arr2: [Float]) {
let tic = DispatchTime.now().uptimeNanoseconds
// Add two arrays and store results
let y = vDSP.add(arr1, arr2)
// Print out elapsed time
let toc = DispatchTime.now().uptimeNanoseconds
let elapsed = Float(toc - tic) / 1_000_000_000
print("\nAccelerate vDSP elapsed time is \(elapsed) s")
// Print out some results
for i in 0..<3 {
let a1 = String(format: "%.4f", arr1[i])
let a2 = String(format: "%.4f", arr2[i])
let y = String(format: "%.4f", y[i])
print("\(a1) + \(a2) = \(y)")
}
}
The Metal code is shown below:
import MetalKit
private func setupMetal(arr1: [Float], arr2: [Float]) -> (MTLCommandBuffer?, MTLBuffer?) {
// Get the Metal GPU device
let device = MTLCreateSystemDefaultDevice()
// Queue for sending commands to the GPU
let commandQueue = device?.makeCommandQueue()
// Get our Metal GPU function
let gpuFunctionLibrary = device?.makeDefaultLibrary()
let adderGpuFunction = gpuFunctionLibrary?.makeFunction(name: "adder")
var adderComputePipelineState: MTLComputePipelineState!
do {
adderComputePipelineState = try device?.makeComputePipelineState(function: adderGpuFunction!)
} catch {
print(error)
}
// Create the buffers to be sent to the GPU from our arrays
let count = arr1.count
let arr1Buff = device?.makeBuffer(bytes: arr1,
length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
let arr2Buff = device?.makeBuffer(bytes: arr2,
length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Float>.size * count,
options: .storageModeShared)
// Create a buffer to be sent to the command queue
let commandBuffer = commandQueue?.makeCommandBuffer()
// Create an encoder to set values on the compute function
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(adderComputePipelineState)
// Set the parameters of our GPU function
commandEncoder?.setBuffer(arr1Buff, offset: 0, index: 0)
commandEncoder?.setBuffer(arr2Buff, offset: 0, index: 1)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 2)
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
let maxThreadsPerThreadgroup = adderComputePipelineState.maxTotalThreadsPerThreadgroup
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
// Tell the encoder that it is done encoding. Now we can send this off to the GPU.
commandEncoder?.endEncoding()
return (commandBuffer, resultBuff)
}
func addMetal(_ arr1: [Float], _ arr2: [Float]) {
let (commandBuffer, resultBuff) = setupMetal(arr1: arr1, arr2: arr2)
let tic = DispatchTime.now().uptimeNanoseconds
// Push this command to the command queue for processing
commandBuffer?.commit()
// Wait until the GPU function completes before working with any of the data
commandBuffer?.waitUntilCompleted()
// Get the pointer to the beginning of our data
let count = arr1.count
var resultBufferPointer = resultBuff?.contents().bindMemory(to: Float.self, capacity: MemoryLayout<Float>.size * count)
// Print out elapsed time
let toc = DispatchTime.now().uptimeNanoseconds
let elapsed = Float(toc - tic) / 1_000_000_000
print("\nMetal GPU elapsed time is \(elapsed) s")
// Print out the results
for i in 0..<3 {
let a1 = String(format: "%.4f", arr1[i])
let a2 = String(format: "%.4f", arr2[i])
let y = String(format: "%.4f", Float(resultBufferPointer!.pointee))
print("\(a1) + \(a2) = \(y)")
resultBufferPointer = resultBufferPointer?.advanced(by: 1)
}
}
#include <metal_stdlib>
using namespace metal;
kernel void adder(
constant float *array1 [[ buffer(0) ]],
constant float *array2 [[ buffer(1) ]],
device float *result [[ buffer(2) ]],
uint index [[ thread_position_in_grid ]])
{
result[index] = array1[index] + array2[index];
}
Results from running the above code on a 2019 MacBook Pro are given below. Specs for the laptop are 2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4, Intel UHD Graphics 630 1536 MB, and AMD Radeon Pro 5500M.
Accelerate vDSP elapsed time is 0.004532601 s
7.8964 + 6.3815 = 14.2779
9.3661 + 8.9641 = 18.3301
4.5389 + 8.5737 = 13.1126
Metal GPU elapsed time is 0.012219718 s
7.8964 + 6.3815 = 14.2779
9.3661 + 8.9641 = 18.3301
4.5389 + 8.5737 = 13.1126
Based on the elapsed times, the Accelerate function is faster than the Metal compute function. I think this is because I did not properly define the threads. How do I determine the optimum number of threads per grid and threads per thread group for this example?
// Figure out how many threads we need to use for our operation
let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
let maxThreadsPerThreadgroup = adderComputePipelineState.maxTotalThreadsPerThreadgroup
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)

For metal you are measuring time in both computation and data transfer from GPU to CPU and also creating array on CPU.
You should use addcompletedhandler for gpu compution time

Convert UInt32 to 4 bytes Swift

How do I convert a UInt32 value to 4 bytes in swift?
I have a value of (3) when I get;
IPP_ORIENTATION.PORTRAIT.rawValue
Now, I need to convert that value into 4 bytes.
Thanks.

let value: UInt32 = 1
var u32LE = value.littleEndian // or simply value
let dataLE = Data(bytes: &u32LE, count: 4)
let bytesLE = Array(dataLE) // [1, 0, 0, 0]
var u32BE = value.bigEndian
let dataBE = Data(bytes: &u32BE, count: 4)
let bytesBE = Array(dataBE) // [0, 0, 0, 1]

Without debug mode UnsafePointer withMemoryRebound will gives wrong value

Here I am trying to concatenate 5 bytes into the single Integer value, I am getting an issue with UnsafePointer withMemoryRebound method.
when I am debugging and checking logs it will gives the correct value. But when I try without debug, it will give the wrong value.(4 out of 5 times wrong value). I got confuses on this API. Is it correct way I am using?
case 1:
let data = [UInt8](rowData) // rowData is type of Data class
let totalKM_BitsArray = [data[8],data[7],data[6],data[5],data[4]]
self.totalKm = UnsafePointer(totalKM_BitsArray).withMemoryRebound(to:UInt64.self, capacity: 1) {$0.pointee}
case 2:
Below code will work for both Enable or Disable debug mode And gives the correct value.
let byte0 : UInt64 = UInt64(data[4])<<64
let byte1 : UInt64 = UInt64(data[5])<<32
let byte2 : UInt64 = UInt64(data[6])<<16
let byte3 : UInt64 = UInt64(data[7])<<8
let byte4 : UInt64 = UInt64(data[8])
self.totalKm = byte0 | byte1 | byte2 | byte3 | byte4
Please suggest me UnsafePointer way of using? Why will this issue come?
Addtional infomation :
let totalKm : UInt64
let data = [UInt8](rowData) // data contain [100, 200, 28, 155, 0, 0, 0, 26, 244, 0, 0, 0, 45, 69, 0, 0, 0, 4, 246]
let totalKM_BitsArray = [data[8],data[7],data[6],data[5],data[4]] // contain [ 244,26,0,0,0]
self.totalKm = UnsafePointer(totalKM_BitsArray).withMemoryRebound(to:UInt64.self, capacity: 1) {$0.pointee}
// when print log gives correct value, when run on device give wrong 3544649566089386 like this.
self.totalKm = byte0 | byte1 | byte2 | byte3 | byte4
// output is 6900 This is correct as expected

There are a few problems with this approach:
let data = [UInt8](rowData) // rowData is type of Data class
let totalKM_BitsArray = [data[8], data[7], data[6], data[5], data[4]]
self.totalKm = UnsafePointer(totalKM_BitsArray)
.withMemoryRebound(to:UInt64.self, capacity: 1) { $0.pointee }
Dereferencing UnsafePointer(totalKM_BitsArray) is undefined behaviour, as the pointer to totalKM_BitsArray's buffer is only temporarily valid for the duration of the initialiser call (hopefully at some point in the future Swift will warn on such constructs).
You're trying to bind only 5 instances of UInt8 to UInt64, so the remaining 3 instances will be garbage.
You can only withMemoryRebound(_:) between types of the same size and stride; which is not the case for UInt8 and UInt64.
It's dependant on the endianness of your platform; data[8] will be the least significant byte on a little endian platform, but the most significant byte on a big endian platform.
Your implementation with bit shifting avoids all of these problems (and is generally the safer way to go as you don't have to consider things like layout compatibility, alignment, and pointer aliasing).
However, assuming that you just wanted to pad out your data with zeroes for the most significant bytes, with rowData[4] to rowData[8] making up the rest of the less significant bytes, then you'll want your bit-shifting implementation to look like this:
let rowData = Data([
100, 200, 28, 155, 0, 0, 0, 26, 244, 0, 0, 0, 45, 69, 0, 0, 0, 4, 246
])
let byte0 = UInt64(rowData[4]) << 32
let byte1 = UInt64(rowData[5]) << 24
let byte2 = UInt64(rowData[6]) << 16
let byte3 = UInt64(rowData[7]) << 8
let byte4 = UInt64(rowData[8])
let totalKm = byte0 | byte1 | byte2 | byte3 | byte4
print(totalKm) // 6900
or, iteratively:
var totalKm: UInt64 = 0
for byte in rowData[4 ... 8] {
totalKm = (totalKm << 8) | UInt64(byte)
}
print(totalKm) // 6900
or, using reduce(_:_:):
let totalKm = rowData[4 ... 8].reduce(0 as UInt64) { accum, byte in
(accum << 8) | UInt64(byte)
}
print(totalKm) // 6900
We can even abstract this into an extension on Data in order to make it easier to load such fixed width integers:
enum Endianness {
case big, little
}
extension Data {
/// Loads the type `I` from the buffer. If there aren't enough bytes to
/// represent `I`, the most significant bits are padded with zeros.
func load<I : FixedWidthInteger>(
fromByteOffset offset: Int = 0, as type: I.Type, endianness: Endianness = .big
) -> I {
let (wholeBytes, spareBits) = I.bitWidth.quotientAndRemainder(dividingBy: 8)
let bytesToRead = Swift.min(count, spareBits == 0 ? wholeBytes : wholeBytes + 1)
let range = startIndex + offset ..< startIndex + offset + bytesToRead
let bytes: Data
switch endianness {
case .big:
bytes = self[range]
case .little:
bytes = Data(self[range].reversed())
}
return bytes.reduce(0) { accum, byte in
(accum << 8) | I(byte)
}
}
}
We're doing a bit of extra work here in order to we read the right number of bytes, as well as being able to handle both big and little endian. But now that we've written it, we can simply write:
let totalKm = rowData[4 ... 8].load(as: UInt64.self)
print(totalKm) // 6900
Note that so far I've assumed that the Data you're getting is zero-indexed. This is safe for the above examples, but isn't necessarily safe depending on where the data is coming from (as it could be a slice). You should be able to do Data(someUnknownDataValue) in order to get a zero-indexed data value that you can work with, although unfortunately I don't believe there's any documentation that guarantees this.
In order to ensure you're correctly indexing an arbitrary Data value, you can define the following extension in order to perform the correct offsetting in the case where you're dealing with a slice:
extension Data {
subscript(offset offset: Int) -> Element {
get { return self[startIndex + offset] }
set { self[startIndex + offset] = newValue }
}
subscript<R : RangeExpression>(
offset range: R
) -> SubSequence where R.Bound == Index {
get {
let concreteRange = range.relative(to: self)
return self[startIndex + concreteRange.lowerBound ..<
startIndex + concreteRange.upperBound]
}
set {
let concreteRange = range.relative(to: self)
self[startIndex + concreteRange.lowerBound ..<
startIndex + concreteRange.upperBound] = newValue
}
}
}
Which you can use then call as e.g data[offset: 4] or data[offset: 4 ... 8].load(as: UInt64.self).
Finally it's worth noting that while you could probably implement this as a re-interpretation of bits by using Data's withUnsafeBytes(_:) method:
let rowData = Data([
100, 200, 28, 155, 0, 0, 0, 26, 244, 0, 0, 0, 45, 69, 0, 0, 0, 4, 246
])
let kmData = Data([0, 0, 0] + rowData[4 ... 8])
let totalKm = kmData.withUnsafeBytes { buffer in
UInt64(bigEndian: buffer.load(as: UInt64.self))
}
print(totalKm) // 6900
This is relying on Data's buffer being 64-bit aligned, which isn't guaranteed. You'll get a runtime error for attempting to load a misaligned value, for example:
let data = Data([0x01, 0x02, 0x03])
let i = data[1...].withUnsafeBytes { buffer in
buffer.load(as: UInt16.self) // Fatal error: load from misaligned raw pointer
}
By loading individual UInt8 values instead and performing bit shifting, we can avoid such alignment issues (however if/when UnsafeMutableRawPointer supports unaligned loads, this will no longer be an issue).

vDSP_conv occasionally returns NANs

I'm using vDSP_conv to perform autocorrelation. Mostly it works just fine but every so often it's filling the output array with NaNs.
The code:
func corr_test() {
var pass = 0
var x = [Float]()
for i in 0..<2000 {
x.append(Float(i))
}
while true {
print("pass \(pass)")
let corr = autocorr(x)
if corr[1].isNaN {
print("!!!")
}
pass += 1
}
}
func autocorr(a: [Float]) -> [Float] {
let resultLen = a.count * 2 + 1
let padding = [Float].init(count: a.count, repeatedValue: 0.0)
let a_pad = padding + a + padding
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(a_pad, 1, a_pad, 1, &result, 1, UInt(resultLen), UInt(a_pad.count))
return result
}
The output:
pass ...
pass 169
pass 170
pass 171
(lldb) p corr
([Float]) $R0 = 4001 values {
[0] = 2.66466637E+9
[1] = NaN
[2] = NaN
[3] = NaN
[4] = NaN
...
I'm not sure what's going on here. I think I'm handling the 0 padding correctly since if I weren't I don't think I'd be getting correct results 99% of the time.
Ideas? Gracias.

Figured it out. The key was this comment from https://developer.apple.com/library/mac/samplecode/vDSPExamples/Listings/DemonstrateConvolution_c.html :
// “The signal length is padded a bit. This length is not actually passed to the vDSP_conv routine; it is the number of elements
// that the signal array must contain. The SignalLength defined below is used to allocate space, and it is the filter length
// rounded up to a multiple of four elements and added to the result length. The extra elements give the vDSP_conv routine
// leeway to perform vector-load instructions, which load multiple elements even if they are not all used. If the caller did not
// guarantee that memory beyond the values used in the signal array were accessible, a memory access violation might result.”
“Padded a bit.” Thanks for being so specific. Anyway here's the final working product:
func autocorr(a: [Float]) -> [Float] {
let filterLen = a.count
let resultLen = filterLen * 2 - 1
let signalLen = ((filterLen + 3) & 0xFFFFFFFC) + resultLen
let padding1 = [Float].init(count: a.count - 1, repeatedValue: 0.0)
let padding2 = [Float].init(count: (signalLen - padding1.count - a.count), repeatedValue: 0.0)
let signal = padding1 + a + padding2
var result = [Float].init(count: resultLen, repeatedValue: 0.0)
vDSP_conv(signal, 1, a, 1, &result, 1, UInt(resultLen), UInt(filterLen))
// Remove the first n-1 values which are just mirrored from the end so that [0] always has the autocorrelation.
result.removeFirst(filterLen - 1)
return result
}
Note that the results here aren't normalized.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

what causes the iteration of an array to conclude prematurely in metal? - swift

Related

Matching Torch STFT with Accelerate

Determine number of threads for element-wise array addition in Metal

Convert UInt32 to 4 bytes Swift

Without debug mode UnsafePointer withMemoryRebound will gives wrong value

vDSP_conv occasionally returns NANs

Categories

Resources