Compute sum of array values in parallel with metal swift

Compute sum of array values in parallel with metal swift - swift

I am trying to compute sum of large array in parallel with metal swift.
Is there a god way to do it?
My plane was that I divide my array to sub arrays, compute sum of one sub arrays in parallel and then when parallel computation is finished compute sum of sub sums.
for example if I have
array = [a0,....an]
I divide array in sub arrays :
array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]
sums for this arrays is computed in parallel and I get
sum_1, sum_2, sum_3, ... sum_n/1
at the end just compute sum of sub sums.
I create application which run my metal shader, but some things I don't understand quite.
var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]
// get device
let device: MTLDevice! = MTLCreateSystemDefaultDevice()
// get library
let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()
// queue
let commandQueue:MTLCommandQueue! = device.newCommandQueue()
// function
let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")
// pipeline with function
let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)
// buffer for function
let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()
// encode function
let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()
// add function to encode
commandEncoder.setComputePipelineState(pipelineState)
// options
let resourceOption = MTLResourceOptions()
let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])
let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)
commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)
var result:[Float] = [0,0,0]
let resultBiteLenght = sizeofValue(result[0])
let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)
commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)
let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)
let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)
data.getBytes(&result, length: result.count * sizeof(Float))
print(result)
is my Swift code,
my shader is :
kernel void calculateSum(const device float *inFloat [[buffer(0)]],
device float *result [[buffer(1)]],
uint id [[ thread_position_in_grid ]]) {
float * f = inFloat[id];
float sum = 0;
for (int i = 0 ; i < 3 ; ++i) {
sum = sum + f[i];
}
result = sum;
}
I don't know how to defined that inFloat is array of array.
I don't know exactly what is threadGroupSize and threadGroups.
I don't know what is device and uint in shader properties.
Is this right approach?

I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:
let count = 10_000_000
let elementsPerSum = 10_000
// Data type, has to be the same as in the shader
typealias DataType = CInt
let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
var start, end : UInt64
var result : DataType = 0
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
result += elem
}
end = mach_absolute_time()
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0
start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
for elem in buffer {
result += elem
}
}
end = mach_absolute_time()
print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
I used my Mac to test it, but it should work just fine on iOS.
Output:
Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018
The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

The accepted answer is annoyingly missing the kernel that was written for it. The source is here, but here is the full program and shader that can be run as a swift command line application.
/*
* Command line Metal Compute Shader for data processing
*/
import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
result += elem
}
end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0
start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
for elem in buffer {
result += elem
}
}
end = mach_absolute_time()
print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
#include <metal_stdlib>
using namespace metal;
typedef unsigned int uint;
typedef int DataType;
kernel void parsum(const device DataType* data [[ buffer(0) ]],
const device uint& dataLength [[ buffer(1) ]],
device DataType* sums [[ buffer(2) ]],
const device uint& elementsPerSum [[ buffer(3) ]],
const uint tgPos [[ threadgroup_position_in_grid ]],
const uint tPerTg [[ threads_per_threadgroup ]],
const uint tPos [[ thread_position_in_threadgroup ]]) {
uint resultIndex = tgPos * tPerTg + tPos;
uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
for (; dataIndex < endIndex; dataIndex++)
sums[resultIndex] += data[dataIndex];
}
Objective-C
The same Swift command-line programme, but in Objective-C
#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
typedef int DataType;
int main(int argc, const char * argv[]) {
#autoreleasepool {
unsigned int count = 10000000;
unsigned int elementsPerSum = 10000;
//----------------------------------------------------------------------
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary>library = [device newDefaultLibrary];
id<MTLFunction>parsum = [library newFunctionWithName:#"parsum"];
id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:parsum error:nil];
//----------------------------------------------------------------------
DataType* data = (DataType*) malloc(sizeof(DataType) * count);
for (int i = 0; i < count; i++){
data[i] = arc4random_uniform(100);
}
unsigned int dataCount = count;
unsigned int elementsPerSumC = elementsPerSum;
unsigned int resultsCount = (count + elementsPerSum - 1) / elementsPerSum;
//------------------------------------------------------------------------------
id<MTLBuffer>dataBuffer = [device newBufferWithBytes:data
length:(sizeof(int) * count)
options:MTLResourceStorageModeManaged];
id<MTLBuffer>resultsBuffer = [device newBufferWithLength:(sizeof(int) * count)
options:0];
DataType* results = resultsBuffer.contents;
//----------------------------------------------------------------------
id<MTLCommandQueue>queue = [device newCommandQueue];
id<MTLCommandBuffer>cmds = [queue commandBuffer];
id<MTLComputeCommandEncoder> encoder = [cmds computeCommandEncoder];
//----------------------------------------------------------------------
[encoder setComputePipelineState:pipeline];
[encoder setBuffer:dataBuffer offset:0 atIndex:0];
[encoder setBytes:&dataCount length:sizeof(unsigned int) atIndex:1];
[encoder setBuffer:resultsBuffer offset:0 atIndex:2];
[encoder setBytes:&elementsPerSumC length:sizeof(unsigned int) atIndex:3];
//----------------------------------------------------------------------
MTLSize threadgroupsPerGrid =
{
(resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth,
1,
1
};
MTLSize threadsPerThreadgroup =
{
pipeline.threadExecutionWidth,
1,
1
};
//----------------------------------------------------------------------
[encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
[encoder endEncoding];
//----------------------------------------------------------------------
uint64_t start, end;
DataType result = 0;
start = mach_absolute_time();
[cmds commit];
[cmds waitUntilCompleted];
for (int i = 0; i < resultsCount; i++){
result += results[i];
}
end = mach_absolute_time();
NSLog(#"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));
//----------------------------------------------------------------------
result = 0;
start = mach_absolute_time();
for (int i = 0; i < count; i++){
result += data[i];
}
end = mach_absolute_time();
NSLog(#"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));
//------------------------------------------------------------------------------
free(data);
}
return 0;
}

i've been running the app. on a gt 740 (384 cores) vs. i7-4790 with a multithreader vector sum implementation and here are my figures:
Metal lap time: 19.959092
cpu MT lap time: 4.353881
that's a 5/1 ratio for cpu,
so unless you have a powerful gpu using shaders is not worth it.
i've been testing the same code in a i7-3610qm w/ igpu intel hd 4000 and surprisely results are much better for metal: 2/1
edited: after tweaking with thread parameter i've finally improved gpu performance, now it's upto 16xcpu

Related

Metal Command Buffer Internal Error: What is Internal Error (IOAF code 2067)?

Attempting to run a compute kernel results in the following message:
Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code 2067)
To get more specific information I query the command encoder's user info and manage to extract more details. I followed instructions from this video to yield the following message:
[Metal Diagnostics] __message__: MTLCommandBuffer execution failed: The commands
associated with the encoder were affected by an error, which may or may not have been
caused by the commands themselves, and failed to execute in full __:::__
__delegate_identifier__: GPUToolsDiagnostics
The breakpoint triggered by the API Validation and Shader Validation results in a record stack frame - not a GPU backtrace. The breakpoint does not indicate any new information apart from the above message.
I cannot find any reference to the mentioned IOAF code in documentation. The additional information printed reveals nothing of assistance. The kernel is quite divergent and I am speculating that may be causing the GPU to take too much time to complete. That may be to blame but I have nothing supporting this apart from a gut feeling.
Here is the thread setup for the group:
let threadExecutionWidth = pipeline.threadExecutionWidth
let threadgroupsPerGrid = MTLSize(width: (Int(pixelCount) + threadExecutionWidth - 1) / threadExecutionWidth, height: 1, depth: 1)
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
The GPU commands are being committed and waited upon for completion:
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
Here is my application side code in it's entirety:
import Metal
import Foundation
import simd
typealias Float4 = SIMD4<Float>
struct SimpleFileWriter {
var fileHandle: FileHandle
init(filePath: String, append: Bool = false) {
if !FileManager.default.fileExists(atPath: filePath) {
FileManager.default.createFile(atPath: filePath, contents: nil, attributes: nil)
}
fileHandle = FileHandle(forWritingAtPath: filePath)!
if !append {
fileHandle.truncateFile(atOffset: 0)
}
}
func write(content: String) {
fileHandle.seekToEndOfFile()
guard let data = content.data(using: String.Encoding.ascii) else {
fatalError("Could not convert \(content) to ascii data!")
}
fileHandle.write(data)
}
}
var imageWidth = 480
var imageHeight = 270
var sampleCount = 16
var bounceCount = 3
let device = MTLCreateSystemDefaultDevice()!
let library = try! device.makeDefaultLibrary(bundle: Bundle.module)
let primaryRayFunc = library.makeFunction(name: "ray_trace")!
let pipeline = try! device.makeComputePipelineState(function: primaryRayFunc)
var pixelData: [Float4] = (0..<(imageWidth * imageHeight)).map{ _ in Float4(0, 0, 0, 0)}
var pixelCount = UInt(pixelData.count)
let pixelDataBuffer = device.makeBuffer(bytes: &pixelData, length: Int(pixelCount) * MemoryLayout<Float4>.stride, options: [])!
let pixelDataMirrorPointer = pixelDataBuffer.contents().bindMemory(to: Float4.self, capacity: Int(pixelCount))
let pixelDataMirrorBuffer = UnsafeBufferPointer(start: pixelDataMirrorPointer, count: Int(pixelCount))
let commandQueue = device.makeCommandQueue()!
let commandBufferDescriptor = MTLCommandBufferDescriptor()
commandBufferDescriptor.errorOptions = MTLCommandBufferErrorOption.encoderExecutionStatus
let commandBuffer = commandQueue.makeCommandBuffer(descriptor: commandBufferDescriptor)!
let commandEncoder = commandBuffer.makeComputeCommandEncoder()!
commandEncoder.setComputePipelineState(pipeline)
commandEncoder.setBuffer(pixelDataBuffer, offset: 0, index: 0)
commandEncoder.setBytes(&pixelCount, length: MemoryLayout<Int>.stride, index: 1)
commandEncoder.setBytes(&imageWidth, length: MemoryLayout<Int>.stride, index: 2)
commandEncoder.setBytes(&imageHeight, length: MemoryLayout<Int>.stride, index: 3)
commandEncoder.setBytes(&sampleCount, length: MemoryLayout<Int>.stride, index: 4)
commandEncoder.setBytes(&bounceCount, length: MemoryLayout<Int>.stride, index: 5)
// We have to calculate the sum `pixelCount` times
// => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up)
// because each threadgroup will process `threadExecutionWidth` threads
let threadExecutionWidth = pipeline.threadExecutionWidth;
let threadgroupsPerGrid = MTLSize(width: (Int(pixelCount) + threadExecutionWidth - 1) / threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads
// the only important thing for performance is that this number is a multiple of
// `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
if let error = commandBuffer.error as NSError? {
if let encoderInfo = error.userInfo[MTLCommandBufferEncoderInfoErrorKey] as? [MTLCommandBufferEncoderInfo] {
for info in encoderInfo {
print(info.label + info.debugSignposts.joined())
}
}
}
let sfw = SimpleFileWriter(filePath: "/Users/pprovins/Desktop/render.ppm")
sfw.write(content: "P3\n")
sfw.write(content: "\(imageWidth) \(imageHeight)\n")
sfw.write(content: "255\n")
for pixel in pixelDataMirrorBuffer {
sfw.write(content: "\(UInt8(pixel.x * 255)) \(UInt8(pixel.y * 255)) \(UInt8(pixel.z * 255)) ")
}
sfw.write(content: "\n")
Additionally, here is the shader being ran. I have not included all function definition for brevity's sake:
kernel void ray_trace(device float4 *result [[ buffer(0) ]],
const device uint& dataLength [[ buffer(1) ]],
const device int& imageWidth [[ buffer(2) ]],
const device int& imageHeight [[ buffer(3) ]],
const device int& samplesPerPixel [[ buffer(4) ]],
const device int& rayBounces [[ buffer (5)]],
const uint index [[thread_position_in_grid]]) {
if (index >= dataLength) {
return;
}
const float3 origin = float3(0.0);
const float aspect = float(imageWidth) / float(imageHeight);
const float3 vph = float3(0.0, 2.0, 0.0);
const float3 vpw = float3(2.0 * aspect, 0.0, 0.0);
const float3 llc = float3(-(vph / 2.0) - (vpw / 2.0) - float3(0.0, 0.0, 1.0));
float3 accumulatedColor = float3(0.0);
thread float seed = getSeed(index, index % imageWidth, index / imageWidth);
float row = float(index / imageWidth);
float col = float(index % imageWidth);
for (int aai = 0; aai < samplesPerPixel; ++aai) {
float ranX = fract(rand(seed));
float ranY = fract(rand(seed));
float u = (col + ranX) / float(imageWidth - 1);
float v = 1.0 - (row + ranY) / float(imageHeight - 1);
Ray r(origin, llc + u * vpw + v * vph - origin);
float3 color = float3(0.0);
HitRecord hr = {0.0, 0.0, false};
float attenuation = 1.0;
for (int bounceIndex = 0; bounceIndex < rayBounces; ++bounceIndex) {
testForHit(sceneDistance, r, hr);
if (hr.h) {
float3 target = hr.p + hr.n + random_f3_in_unit_sphere(seed);
attenuation *= 0.5;
r = Ray(hr.p, target - hr.p);
} else {
color = default_atmosphere_color(r) * attenuation;
break;
}
}
accumulatedColor += color / samplesPerPixel;
}
result[index] = float4(sqrt(accumulatedColor), 1.0);
}
Oddly enough, it occasionally shall run. Changing the number of samples to 16 or above will always results in the mention IOAF code. Less than 16 samples, the code will run ~25% of the time. The more samples, the more likely it is to results in the error code.
Is there anyway to get additional on IOAF code 2067?

Determining the error code with Metal API + Shader Validation was not possible.
By testing individual portions of the kernel, the particular error was narrowed down to a while loop that caused the GPU to hang.
The problem can essentially be boiled down to code that looks like:
while(true) {
// ad infinitum
}
or, in the case of the code above in the call to random_f3_in_unit_sphere(seed):
while(randNum(seed) < threshold) {
// the while loop is not "bounded"
// in any sense. Whoops.
++seed;
}

Why are these simple Metal GPU compute kernels slower or equal than a CPU implementation?

I wrote a lattice dynamics simulation using Metal/Swift on macOS. It contains only highly parallel multiply-and-adds, but I still can't get the Metal/GPU to beat the CPU. (6-core i5 vs Radeon Pro 5300).
The code should execute the kernel 78k times over a dataset consisting of 46080 floats.
EDIT: The 78k iterations should be executed sequentially, as they correspond to time steps of a simulation, and each of them involves ~500k (highly-parallel) floating point operations.
Is there anything basic that I'm missing?
GPU code:
kernel void eom(const device float *mtK [[ buffer(0) ]],
const device float *mtKnl [[ buffer(1) ]],
const device float *mtB [[ buffer(2) ]],
const device float *mtKh [[ buffer(3) ]],
const device float *mtKv [[ buffer(4) ]],
const device float *exm [[ buffer(5) ]],
const device float *exwfm [[ buffer(6) ]],
const device float *sourceX [[ buffer(7) ]],
const device float *sourceV [[ buffer(8) ]],
device float *dest [[ buffer(9) ]],
uint3 id [[ thread_position_in_grid ]]) {
uint materialpoint = basisOffset + id.x + linestride*id.y;
uint samplepoint = basisOffset + id.x + linestride*id.y + blockstride*id.z;
float k = mtK [materialpoint];
float b = mtB [materialpoint];
float knl = mtKnl [materialpoint];
float kh = mtKh [materialpoint];
float kv = mtKv [materialpoint];
float khp = mtKh [materialpoint+linestride];
float kvp = mtKv [materialpoint+1];
float ex = exwfm [id.z]*exm[materialpoint];
dest[samplepoint] = -sourceV[samplepoint]*b -sourceX[samplepoint]*(k + knl*sourceX[samplepoint]*sourceX[samplepoint]) + kv*sourceX[samplepoint-1] + kh*sourceX[samplepoint-linestride] + kvp*sourceX[samplepoint+1] + khp*sourceX[samplepoint+linestride] + ex;
}
CPU code:
let threadgroup_dx = 24
let threadgroup_dy = 32
let threadgroup_dz = 1
let red_widthG = 1
let red_heightG = 2
let waveformStrideG = 30
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = try device.makeLibrary(filepath: "compute.metallib")
// Initialize buffers
let dydxV = device.makeBuffer(length: totalExtendedSize*MemoryLayout<Float>.stride, options: MTLResourceOptions.storageModeManaged)!
[...] // More buffers and loading code go here
// Create pipeline state
let computeDescriptor = MTLComputePipelineDescriptor()
computeDescriptor.threadGroupSizeIsMultipleOfThreadExecutionWidth = true
computeDescriptor.computeFunction = library.makeFunction(name: "eom")
let pipEOM = try device.makeComputePipelineState(descriptor: computeDescriptor, options: [], reflection: nil)
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder(dispatchType: MTLDispatchType.serial)!
// Computation
for i in 1...78000 {
let offset = i*numSamples
encoder.setComputePipelineState(pipEOM)
encoder.setBuffer(mtrK, offset: 0, index: 0)
encoder.setBuffer(mtrKnl, offset: 0, index: 1)
encoder.setBuffer(mtrB, offset: 0, index: 2)
encoder.setBuffer(mtrKh, offset: 0, index: 3)
encoder.setBuffer(mtrKv, offset: 0, index: 4)
encoder.setBuffer(exm, offset: 0, index: 5)
encoder.setBuffer(wvf, offset: offset, index: 6)
encoder.setBuffer(yX, offset: 0, index: 7)
encoder.setBuffer(yV, offset: 0, index: 8)
encoder.setBuffer(dydxV, offset: 0, index: 9)
let numThreadGroups = MTLSize(width: red_widthG, height: red_heightG, depth: waveformStrideG)
let threadsPerThreadgroup = MTLSize(width: threadgroup_dx, height: threadgroup_dy, depth: threadgroup_dz)
encoder.dispatchThreadgroups(numThreadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
}
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

Get all sound frequencies of a WAV-file using Swift and AVFoundation

I would like to capture all frequencies between given timespans in a Wav-file. The intent is to do some audio analysis in a later step. For test, I’ve used the application “Sox” to generate a 1 second long Wav-file which includes only a single tone at 13000Hz. I want to read the file and find that frequency.
I’m using AVFoundation (which is important) to read the file. Since the input data is in PCM, I need to use FFT to get the actual frequencies which I do using the Accelerate framework. However, I don’t get the expected result (13000Hz), but rather a lot of values I don’t understand. I’m new to audio development, so any hint about where my code is failing is appreciated. The code includes a few comments where the issue occurs.
Thanks in advance!
Code:
import AVFoundation
import Accelerate
class Analyzer {
// This function is implemented using the code from the following tutorial:
// https://developer.apple.com/documentation/accelerate/vdsp/fast_fourier_transforms/finding_the_component_frequencies_in_a_composite_sine_wave
func fftTransform(signal: [Float], n: vDSP_Length) -> [Int] {
let observed: [DSPComplex] = stride(from: 0, to: Int(n), by: 2).map {
return DSPComplex(real: signal[$0],
imag: signal[$0.advanced(by: 1)])
}
let halfN = Int(n / 2)
var forwardInputReal = [Float](repeating: 0, count: halfN)
var forwardInputImag = [Float](repeating: 0, count: halfN)
var forwardInput = DSPSplitComplex(realp: &forwardInputReal,
imagp: &forwardInputImag)
vDSP_ctoz(observed, 2,
&forwardInput, 1,
vDSP_Length(halfN))
let log2n = vDSP_Length(log2(Float(n)))
guard let fftSetUp = vDSP_create_fftsetup(
log2n,
FFTRadix(kFFTRadix2)) else {
fatalError("Can't create FFT setup.")
}
defer {
vDSP_destroy_fftsetup(fftSetUp)
}
var forwardOutputReal = [Float](repeating: 0, count: halfN)
var forwardOutputImag = [Float](repeating: 0, count: halfN)
var forwardOutput = DSPSplitComplex(realp: &forwardOutputReal,
imagp: &forwardOutputImag)
vDSP_fft_zrop(fftSetUp,
&forwardInput, 1,
&forwardOutput, 1,
log2n,
FFTDirection(kFFTDirection_Forward))
let componentFrequencies = forwardOutputImag.enumerated().filter {
$0.element < -1
}.map {
return $0.offset
}
return componentFrequencies
}
func run() {
// The frequencies array is a array of frequencies which is then converted to points on sinus curves (signal)
let n = vDSP_Length(4*4096)
let frequencies: [Float] = [1, 5, 25, 30, 75, 100, 300, 500, 512, 1023]
let tau: Float = .pi * 2
let signal: [Float] = (0 ... n).map { index in
frequencies.reduce(0) { accumulator, frequency in
let normalizedIndex = Float(index) / Float(n)
return accumulator + sin(normalizedIndex * frequency * tau)
}
}
// These signals are then restored using the fftTransform function above, giving the exact same values as in the "frequencies" variable
let frequenciesRestored = fftTransform(signal: signal, n: n).map({Float($0)})
assert(frequenciesRestored == frequencies)
// Now I want to do the same thing, but reading the frequencies from a file (which includes a constant tone at 13000 Hz)
let file = { PATH TO A WAV-FILE WITH A SINGLE TONE AT 13000Hz RUNNING FOR 1 SECOND }
let asset = AVURLAsset(url: URL(fileURLWithPath: file))
let track = asset.tracks[0]
do {
let reader = try AVAssetReader(asset: asset)
let sampleRate = 48000.0
let outputSettingsDict: [String: Any] = [
AVFormatIDKey: kAudioFormatLinearPCM,
AVSampleRateKey: Int(sampleRate),
AVLinearPCMIsNonInterleaved: false,
AVLinearPCMBitDepthKey: 16,
AVLinearPCMIsFloatKey: false,
AVLinearPCMIsBigEndianKey: false,
]
let output = AVAssetReaderTrackOutput(track: track, outputSettings: outputSettingsDict)
output.alwaysCopiesSampleData = false
reader.add(output)
reader.startReading()
typealias audioBuffertType = Int16
autoreleasepool {
while (reader.status == .reading) {
if let sampleBuffer = output.copyNextSampleBuffer() {
var audioBufferList = AudioBufferList(mNumberBuffers: 1, mBuffers: AudioBuffer(mNumberChannels: 0, mDataByteSize: 0, mData: nil))
var blockBuffer: CMBlockBuffer?
CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
sampleBuffer,
bufferListSizeNeededOut: nil,
bufferListOut: &audioBufferList,
bufferListSize: MemoryLayout<AudioBufferList>.size,
blockBufferAllocator: nil,
blockBufferMemoryAllocator: nil,
flags: kCMSampleBufferFlag_AudioBufferList_Assure16ByteAlignment,
blockBufferOut: &blockBuffer
);
let buffers = UnsafeBufferPointer<AudioBuffer>(start: &audioBufferList.mBuffers, count: Int(audioBufferList.mNumberBuffers))
for buffer in buffers {
let samplesCount = Int(buffer.mDataByteSize) / MemoryLayout<audioBuffertType>.size
let samplesPointer = audioBufferList.mBuffers.mData!.bindMemory(to: audioBuffertType.self, capacity: samplesCount)
let samples = UnsafeMutableBufferPointer<audioBuffertType>(start: samplesPointer, count: samplesCount)
let myValues: [Float] = samples.map {
let value = Float($0)
return value
}
// Here I would expect my array to include multiple "13000" which is the frequency of the tone in my file
// I'm not sure what the variable 'n' does in this case, but changing it seems to change the result.
// The value should be twice as high as the highest measurable frequency (Nyquist frequency) (13000),
// but this crashes the application:
let mySignals = fftTransform(signal: myValues, n: vDSP_Length(2 * 13000))
assert(mySignals[0] == 13000)
}
}
}
}
}
catch {
print("error!")
}
}
}
The test clip can be generated using:
sox -G -n -r 48000 ~/outputfile.wav synth 1.0 sine 13000

How to use explicit serialization to get data from metal data to Swift struct

I'm trying to get data from a Metal Mesh Vertex array to a Swift struct.
I know that Swift structs and C struct may not be equal.
I also know I could do this with a c briding header but that introduces some other issues.
How do I do this with explicit serialization from the metal data to an array of Swift structs?
It would be helpful to also see the deserialization of the same data.
import PlaygroundSupport
import MetalKit
import simd
struct Vertex {
var position: vector_float3
var normal: vector_float3
var texture: vector_float2
init() {
position = float3(repeating: 0.0)
normal = float3(repeating: 0.0)
texture = float2(repeating: 0.0)
}
init(pos: vector_float3, nor: vector_float3, text: vector_float2) {
position = pos
normal = nor
texture = text
}
}
guard let device = MTLCreateSystemDefaultDevice() else {
fatalError("GPU is not supported")
}
let allocator = MTKMeshBufferAllocator(device: device)
let mesh = MDLMesh.init(boxWithExtent: [0.75, 0.75, 0.75], segments: [1, 1, 1], inwardNormals: false, geometryType: .quads, allocator: allocator)
// Details about the vertex
print("Details about the vertex")
print( mesh.vertexDescriptor )
// Look at the vertex detail
// This doesn't work because, I think, of the way Swift pads the memory in structs.
print("Number of vertex: ", mesh.vertexCount)
let buff = mesh.vertexBuffers[0]
// This gives the wrong result but is nice code
let count = buff.length / MemoryLayout<Vertex>.stride
print("Wrong result!")
print("Space in buffer for vertex: ", count)
let wrongResult = buff.map().bytes.bindMemory(to: Vertex.self, capacity: count)
for i in 0 ..< count {
print( "Vertex: ", i, wrongResult[i])
}
// This gives the correct result but is really ugly code
print("Correct result!")
let tempResult = buff.map().bytes.bindMemory(to: Float.self, capacity: mesh.vertexCount*8)
var result = Array(repeating: Vertex(), count: mesh.vertexCount)
for i in 0 ..< mesh.vertexCount {
result[i].position.x = tempResult[i*8 + 0]
result[i].position.y = tempResult[i*8 + 1]
result[i].position.z = tempResult[i*8 + 2]
result[i].normal.x = tempResult[i*8 + 3]
result[i].normal.y = tempResult[i*8 + 4]
result[i].normal.z = tempResult[i*8 + 5]
result[i].texture.x = tempResult[i*8 + 6]
result[i].texture.y = tempResult[i*8 + 7]
}
for i in 0 ..< mesh.vertexCount {
print( "Vertex: ", i, result[i])
}

How to convert Data of Int16 audio samples to array of float audio samples

I'm currently working with audio samples.
I get them from AVAssetReader and have a CMSampleBuffer with something like this:
guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else {
guard reader.status == .completed else { return nil }
// Completed
// samples is an array of Int16
let samples = sampleData.withUnsafeBytes {
Array(UnsafeBufferPointer<Int16>(
start: $0, count: sampleData.count / MemoryLayout<Int16>.size))
}
// The only way I found to convert [Int16] -> [Float]...
return samples.map { Float($0) / Float(Int16.max)}
}
guard let blockBuffer = CMSampleBufferGetDataBuffer(sampleBuffer) else {
return nil
}
let length = CMBlockBufferGetDataLength(blockBuffer)
let sampleBytes = UnsafeMutablePointer<UInt8>.allocate(capacity: length)
CMBlockBufferCopyDataBytes(blockBuffer, 0, length, sampleBytes)
sampleData.append(sampleBytes, count: length)
}
As you can see the only I found to convert [Int16] -> [Float] issamples.map { Float($0) / Float(Int16.max) but by doing this my processing time is increasing. Does it exist an other way to cast a pointer of Int16 to a pointer of Float?

"Casting" or "rebinding" a pointer only changes the way how memory is
interpreted. You want to compute floating point values from integers,
the new values have a different memory representation (and also a different
size).
Therefore you somehow have to iterate over all input values
and compute the new values. What you can do is to omit the Array
creation:
let samples = sampleData.withUnsafeBytes {
UnsafeBufferPointer<Int16>(start: $0, count: sampleData.count / MemoryLayout<Int16>.size)
}
return samples.map { Float($0) / Float(Int16.max) }
Another option would be to use the vDSP functions from the
Accelerate framework:
import Accelerate
// ...
let numSamples = sampleData.count / MemoryLayout<Int16>.size
var factor = Float(Int16.max)
var floats: [Float] = Array(repeating: 0.0, count: numSamples)
// Int16 array to Float array:
sampleData.withUnsafeBytes {
vDSP_vflt16($0, 1, &floats, 1, vDSP_Length(numSamples))
}
// Scaling:
vDSP_vsdiv(&floats, 1, &factor, &floats, 1, vDSP_Length(numSamples))
I don't know if that is faster, you'll have to check.
(Update: It is faster, as ColGraff demonstrated in his answer.)
An explicit loop is also much faster than using map:
let factor = Float(Int16.max)
let samples = sampleData.withUnsafeBytes {
UnsafeBufferPointer<Int16>(start: $0, count: sampleData.count / MemoryLayout<Int16>.size)
}
var floats: [Float] = Array(repeating: 0.0, count: samples.count)
for i in 0..<samples.count {
floats[i] = Float(samples[i]) / factor
}
return floats
An additional option in your case might be to use CMBlockBufferGetDataPointer() instead of CMBlockBufferCopyDataBytes()
into allocated memory.

You can do considerably better if you use the Accelerate Framework for the conversion:
import Accelerate
// Set up random [Int]
var randomInt = [Int16]()
randomInt.reserveCapacity(10000)
for _ in 0..<randomInt.capacity {
let value = Int16(Int32(arc4random_uniform(UInt32(UInt16.max))) - Int32(UInt16.max / 2))
randomInt.append(value)
}
// Time elapsed helper: https://stackoverflow.com/a/25022722/887210
func printTimeElapsedWhenRunningCode(title:String, operation:()->()) {
let startTime = CFAbsoluteTimeGetCurrent()
operation()
let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
print("Time elapsed for \(title): \(timeElapsed) s.")
}
// Testing
printTimeElapsedWhenRunningCode(title: "vDSP") {
var randomFloat = [Float](repeating: 0, count: randomInt.capacity)
vDSP_vflt16(randomInt, 1, &randomFloat, 1, vDSP_Length(randomInt.capacity))
}
printTimeElapsedWhenRunningCode(title: "map") {
randomInt.map { Float($0) }
}
// Results
//
// Time elapsed for vDSP : 0.000429034233093262 s.
// Time elapsed for flatMap: 0.00233501195907593 s.
It's an improvement of about 5 times faster.
(Edit: Added some changes suggested by Martin R)

#MartinR and #ColGraff gave really good answers, and thank you for everybody and the fast replies.
however I found an easier way to do that without any computation. AVAssetReaderAudioMixOutput requires an audio settings dictionary. Inside we can set the key AVLinearPCMIsFloatKey: true. This way I will read my data like this
let samples = sampleData.withUnsafeBytes {
UnsafeBufferPointer<Float>(start: $0,
count: sampleData.count / MemoryLayout<Float>.size)
}

for: Xcode 8.3.3 • Swift 3.1
extension Collection where Iterator.Element == Int16 {
var floatArray: [Float] {
return flatMap{ Float($0) }
}
}
usage:
let int16Array: [Int16] = [1, 2, 3 ,4]
let floatArray = int16Array.floatArray

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Compute sum of array values in parallel with metal swift - swift

Related

Metal Command Buffer Internal Error: What is Internal Error (IOAF code 2067)?

Why are these simple Metal GPU compute kernels slower or equal than a CPU implementation?

Get all sound frequencies of a WAV-file using Swift and AVFoundation

How to use explicit serialization to get data from metal data to Swift struct

How to convert Data of Int16 audio samples to array of float audio samples

Categories

Resources