Espresso exception: "Invalid argument":general shape kernel while loading mlmodel - swift

I converted my mlmodel from tf.keras. The goal is to recognize handwritten text from the image
When I run it using this code:
func performCoreMLImageRecognition(_ image: UIImage) {
let model = try! HTRModel()
// process input image
let scale = image.scaledImage(200)
let sized = scale?.resize(size: CGSize(width: 200, height: 50))
let gray = sized?.rgb2GrayScale()
guard let pixelBuffer = sized?.pixelBufferGray(width: 200, height: 50) else { fatalError("Cannot convert image to pixelBufferGray")}
UIImageWriteToSavedPhotosAlbum(gray! ,
let mlArray = try! MLMultiArray(shape: [1, 1], dataType: MLMultiArrayDataType.float32)
let htrinput = HTRInput(image: pixelBuffer, label: mlArray)
if let prediction = try? model.prediction(input: htrinput) {
I get the following error:
[espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "Invalid argument": generic_reshape_kernel: Invalid bottom shape (64 12 1 1 1) for reshape to (768 50 -1 1 1) status=-6
2021-01-21 20:23:50.712585+0900 Guided Camera[7575:1794819] [coreml] Error computing NN outputs -6
2021-01-21 20:23:50.712611+0900 Guided Camera[7575:1794819]
[coreml] Failure in -executePlan:error:.
Here is the model configuration
The model ran perfectly fine. Where am I going wrong in this. I am not well versed with swift and need help.
What does this error mean and How do I resolve this error?

Sometimes during the conversion from Keras (or whatever) to Core ML, the converter doesn't understand how to handle certain operations, which results in a model that doesn't work.
In your case, there is a layer that outputs a tensor with shape (64, 12, 1, 1, 1) while there is a reshape layer that expects something that can be reshaped to (768, 50, -1, 1, 1).
You'll need to find out which layer does this reshape and then examine the Core ML model why it gets an input tensor that is not the correct size. Just because it works OK in Keras does not mean the conversion to Core ML was flawless.
You can examine the Core ML model with Netron, an open source model viewer.
(Note that 64x12 = 768, so the issue appears to be with the 50 in that tensor.)


How to populate a pixel buffer much faster?

As part of a hobby project, I'm working on a 2D game engine that will draw each pixel every frame, using a color from a palette.
I am looking for a way to do that while maintaining a reasonable frame rate (60fps being the minimum).
Without any game-logic in place, I am updating the values of my pixels with some value form the palette.
I'm currently taking the mod of an index, to (hopefully) prevent the compiler from doing some loop-optimisation it could do with a fixed value.
Below is my (very naive?) implementation of updating the bytes in the pixel array.
On an iPhone 12 Pro, each run of updating all pixel values takes on average 43 ms, while on a simulator running on an M1 mac, it takes 15 ms. Both unacceptable, as that would leave not for any additional game logic (which would be much more operations than taking the mod of an Int).
I was planning to look into Metal and set up a surface, but clearly the bottleneck here is the CPU, so if I can optimize this code, I could go for a higher-level framework.
Any suggestions on a performant way to write this many bytes much, much faster (parallelisation is not an option)?
Instruments shows that most of the time is being spent in Swifts function. Maybe there is way to reduce the time spent there, there is quite a substantial subtree inside it.
struct BGRA
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
let BGRAPallet =
BGRA(blue: 124, green: 124, red: 124, alpha: 0xff),
BGRA(blue: 252, green: 0, red: 0, alpha: 0xff),
// ... 62 more values in my code, omitting here for brevity
private func test()
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start =
for _ in 0 ..< runCount
for index in 0 ..< pixelBufferPtr.count
pixelBufferPtr[index] = BGRAPallet[index % BGRAPallet.count]
let elapsed =
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
First of all, I don't believe you're testing an optimized build for two reasons:
You say “This was measured with optimization set to Fastest [-O3].” But the Swift compiler doesn't recognize -O3 as a command-line flag. The C/C++ compiler recognizes that flag. For Swift the flags are -Onone, -O, -Osize, and -Ounchecked.
I ran your code on my M1 Max MacBook Pro in Debug configuration and it reported 15ms. Then I ran it in Release configuration and it reported 0ms. I had to increase the screen size to 2560x2400 (100x the pixels) to get it to report a time of 3ms.
Now, looking at your code, here are some things that stand out:
You're picking a color using BGRAPalette[index % BGRAPalette.count]. Since your palette size is 64, you can say BGRAPalette[index & 0b0011_1111] for the same result. I expected Swift to optimize that for me, but apparently it didn't, because making that change reduced the reported time to 2ms.
Indexing into BGRAPalette incurs a bounds check. You can avoid the bounds check by grabbing an UnsafeBufferPointer for the palette. Adding this optimization reduced the reported time to 1ms.
Here's my version:
public struct BGRA {
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
func rng() -> UInt8 { UInt8.random(in: .min ... .max) }
let BGRAPalette = (0 ..< 64).map { _ in
BGRA(blue: rng(), green: rng(), red: rng(), alpha: rng())
public func test() {
let screenWidth: Int = 2560
let screenHeight: Int = 2400
let pixelCount = screenWidth * screenHeight
let pixelBuffer = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: pixelCount)
let runCount = 1000
let start = SuspendingClock().now
BGRAPalette.withUnsafeBufferPointer { paletteBuffer in
for _ in 0 ..< runCount
for index in 0 ..< pixelCount
pixelBuffer[index] = paletteBuffer[index & 0b0011_1111]
let end = SuspendingClock().now
let elapsed = end - start
let msElapsed = elapsed.components.seconds * 1000 + elapsed.components.attoseconds / 1_000_000_000_000_000
print("Average time per run: \(msElapsed / Int64(runCount)) ms")
// return pixelBuffer
struct MyMain {
static func main() {
In addition to the two optimizations I described, I removed the dependency on Foundation (so I could paste the code into the compiler explorer) and corrected the spelling of ‘palette‘.
But realistically, even this isn't probably isn't particularly good test of your fill rate. You didn't say what kind of game you want to write, but given your screen size of 256x240, it's likely to use a tile-based map and sprites. If so, you shouldn't copy a pixel at a time. You can write a blitter that copies blocks of pixels at a time, using CPU instructions that operate on more than 32 bits at a time. ARM64 has 128-bit (16-byte) registers.
But even more realistically, you should use learn to use the GPU for your blitting. Not only is it faster for this sort of thing, it's probably more power-efficient too. Even though you're lighting up more of the chip, you're lighting it up for shorter intervals.
Okay so i gave this a shot. You can probably move to using single input (or instruction), multiple output. This would probably speed up the whole process by 10 - 20 percent or so.
Here is the code link (same code will be pasted below for brevity):
I made two versions just in case your systems architecture doesn't support simd_uint4. Let me know if this is what you were looking for.
import simd
private func test() {
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start =
for _ in 0 ..< runCount {
var index = 0
var palletIndex = 0
let palletCount = BGRAPallet.count
while index < pixelBufferPtr.count {
let bgra = BGRAPallet[palletIndex]
let bgraVector = simd_uint4(,,, bgra.alpha)
let maxCount = min(pixelBufferPtr.count - index, 4)
let pixelBuffer = pixelBufferPtr.baseAddress! + index
pixelBuffer.storeBytes(of: bgraVector, as: simd_uint4.self)
palletIndex += 1
if palletIndex == palletCount {
palletIndex = 0
index += maxCount
let elapsed =
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")

memory issue: realtime eye localization

I'm currently working on a iOS-application, which should be able to detect the localization. I've created an tflite which comprises some CNN layers. In order to use the tflite in XCode/Swift I've created a helper class in which the tflite calculates the output. Whenever I run the predict-function once, it works. But apparently the predict function doesn't work in real-time camera-thread.
After about 7 seconds, XCode is throwing the following error:
Thread 1: EXC_BAD_ACCESS (code=1, address=0x123424001). This error must be evoken by looping through the image. Since I need each pixel value I'm using the solution suggested by Firebase.But apparently this solution is not waterproofed. Can anybody help to resolve this memory issue?
func creatInputForCNN(resizedImage: UIImage?) -> Data{
// In this section of the code I loop through the image (150,200,3)
// in order to fetch each pixel value (RGB).
let image: CGImage = resizedImage.cgImage!
guard let context = CGContext(
data: nil,
width: image.width, height: image.height,
bitsPerComponent: 8, bytesPerRow: image.width * 4,
space: CGColorSpaceCreateDeviceRGB(),
bitmapInfo: CGImageAlphaInfo.noneSkipFirst.rawValue
) else {return nil}
context.draw(image, in: CGRect(x: 0, y: 0, width: image.width, height: image.height))
guard let imageData = else {return nil}
let size_w = 150
let size_H = 200
var inputData:Data?
inputData = Data()
for row in 0 ..< size_H{
for col in 0 ..< size_w {
let offset = 4 * (row * context.width + col)
// (Ignore offset 0, the unused alpha channel)
let red = imageData.load(fromByteOffset: offset+1, as: UInt8.self)
let green = imageData.load(fromByteOffset: offset+2, as: UInt8.self)
let blue = imageData.load(fromByteOffset: offset+3, as: UInt8.self)
// Normalize channel values to [0.0, 1.0]. This requirement varies
// by model. For example, some models might require values to be
// normalized to the range [-1.0, 1.0] instead, and others might
// require fixed-point values or the original bytes.
var normalizedRed:Float32 = Float32(red) / 255
var normalizedGreen:Float32 = (Float32(green) / 255
var normalizedBlue:Float32 = Float32(blue) / 255
// Append normalized values to Data object in RGB order.
let elementSize = MemoryLayout.size(ofValue: normalizedRed)
var bytes = [UInt8](repeating: 0, count: elementSize)
memcpy(&bytes, &normalizedRed, elementSize)
inputData!.append(&bytes, count: elementSize)
memcpy(&bytes, &normalizedGreen, elementSize)
inputData!.append(&bytes, count: elementSize)
memcpy(&bytes, &normalizedBlue, elementSize)
inputData!.append(&bytes, count: elementSize)
return inputData
This is the code

Get predict in TensorFlowLite for Swift

I launch code in this instruction:
Everything works, but I don't understand how to get the prediction values. I tried: print(outputTensor but got:
Tensor(name: "Identity", dataType: TensorFlowLite.Tensor.DataType.float32, shape: TensorFlowLite.Tensor.Shape(rank: 2, dimensions: [1, 3]), data: 12 bytes, quantizationParameters: nil)
Did you try this part from the example guide
// Copy output to `Data` to process the inference results.
let outputSize = outputTensor.shape.dimensions.reduce(1, {x, y in x * y})
let outputData =
UnsafeMutableBufferPointer<Float32>.allocate(capacity: outputSize) outputData)
Then just print outputData buffer

Perform normalization using Accelerate framework

I need to perform simple math operation on Data that contains RGB pixels data. Currently Im doing this like so:
let imageMean: Float = 127.5
let imageStd: Float = 127.5
let rgbData: Data // Some data containing RGB pixels
let floats = (0..<rgbData.count).map {
(Float(rgbData[$0]) - imageMean) / imageStd
return Data(bytes: floats, count: floats.count * MemoryLayout<Float>.size)
This works, but it's too slow. I was hoping I could use the Accelerate framework to calculate this faster, but have no idea how to do this. I reserved some space so that it's not allocated every time this function starts, like so:
inputBufferDataNormalized = malloc(width * height * 3) // 3 channels RGB
I tried few functions, like vDSP_vasm, but I couldn't make it work. Can someone direct me to how to use it? Basically I need to replace this map function, because it takes too long time. And probably it would be great to use pre-allocated space all the time.
Following up on my comment on your other related question. You can use SIMD to parallelize the operation, but you'd need to split the original array into chunks.
This is a simplified example that assumes that the array is exactly divisible by 64, for example, an array of 1024 elements:
let arr: [Float] = (0 ..< 1024).map { _ in Float.random(in: 0...1) }
let imageMean: Float = 127.5
let imageStd: Float = 127.5
var chunks = [SIMD64<Float>]()
chunks.reserveCapacity(arr.count / 64)
for i in stride(from: 0, to: arr.count, by: 64) {
let v = SIMD64.init(arr[i ..< i+64])
chunks.append((v - imageMean) / imageStd) // same calculation using SIMD
You can now access each chunk with a subscript:
var results: [Float] = []
for chunk in chunks {
for i in chunk.indices {
Of course, you'd need to deal with a remainder if the array isn't exactly divisible by 64.
I have found a way to do this using Accelerate. First I reserve space for converted buffer like so
var inputBufferDataRawFloat = [Float](repeating: 0, count: width * height * 3)
Then I can use it like so:
let rawBytes = [UInt8](rgbData)
vDSP_vfltu8(rawBytes, 1, &inputBufferDataRawFloat, 1, vDSP_Length(rawBytes.count))
vDSP.add(inputBufferDataRawScalars.mean, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
vDSP.multiply(inputBufferDataRawScalars.std, inputBufferDataRawFloat, result: &inputBufferDataRawFloat)
return Data(bytes: inputBufferDataRawFloat, count: inputBufferDataRawFloat.count * MemoryLayout<Float>.size)
Works very fast. Maybe there is better function in Accelerate, if anyone know of it, please let me know. It need to perform function (A[n] + B) * C (or to be exact (A[n] - B) / C but the first one could be converted to this).

Getting pixel format from CGImage

I understand bitmap layout and pixel format subject pretty well, but getting an issue when working with png / jpeg images loaded through NSImage – I can't figure out if what I get is the intended behaviour or a bug.
let nsImage:NSImage = NSImage(byReferencingURL: …)
let cgImage:CGImage = nsImage.CGImageForProposedRect(nil, context: nil, hints: nil)!
let bitmapInfo:CGBitmapInfo = CGImageGetBitmapInfo(cgImage)
Swift.print(bitmapInfo.contains(CGBitmapInfo.ByteOrderDefault)) // True
My kCGBitmapByteOrder32Host is little endian, which implies that the pixel format is also little endian – BGRA in this case. But… png format is big endian by specification, and that's how the bytes are actually arranged in the data – opposite from what bitmap info tells me.
Does anybody knows what's going on? Surely the system somehow knows how do deal with this, since pngs are displayed correctly. Is there a bullet-proof way detecting pixel format of CGImage? Complete demo project is available at GitHub.
P. S. I'm copying raw pixel data via CFDataGetBytePtr buffer into another library buffer, which is then gets processed and saved. In order to do so, I need to explicitly specify pixel format. Actual images I'm dealing with (any png / jpeg files that I've checked) display correctly, for example:
But bitmap info of the same images gives me incorrect endianness information, resulting in bitmap being handled as BGRA pixel format instead of actual RGBA, when I process it the result looks like this:
The resulting image demonstrates the colour swapping between red and blue pixels, if RGBA pixel format is specified explicitly, everything works out perfectly, but I need this detection to be automated.
P. P. S. Documentation briefly mentions that CGColorSpace is another important variable that defines pixel format / byte order, but I found no mentions how to get it out of there.
Some years later and after testing my findings in production I can share them with good confidence, but hoping someone with theory knowledge will explain things better here? Good places to refresh memory:
Wikipedia: RGBA color space – Representation
Apple Lists: Byte Order in CGBitmapContextCreate
Apple Lists: kCGImageAlphaPremultiplied First/Last
Based on that you can use following extensions:
public enum PixelFormat
case abgr
case argb
case bgra
case rgba
extension CGBitmapInfo
public static var byteOrder16Host: CGBitmapInfo {
return CFByteOrderGetCurrent() == Int(CFByteOrderLittleEndian.rawValue) ? .byteOrder16Little : .byteOrder16Big
public static var byteOrder32Host: CGBitmapInfo {
return CFByteOrderGetCurrent() == Int(CFByteOrderLittleEndian.rawValue) ? .byteOrder32Little : .byteOrder32Big
extension CGBitmapInfo
public var pixelFormat: PixelFormat? {
// AlphaFirst – the alpha channel is next to the red channel, argb and bgra are both alpha first formats.
// AlphaLast – the alpha channel is next to the blue channel, rgba and abgr are both alpha last formats.
// LittleEndian – blue comes before red, bgra and abgr are little endian formats.
// Little endian ordered pixels are BGR (BGRX, XBGR, BGRA, ABGR, BGR).
// BigEndian – red comes before blue, argb and rgba are big endian formats.
// Big endian ordered pixels are RGB (XRGB, RGBX, ARGB, RGBA, RGB).
let alphaInfo: CGImageAlphaInfo? = CGImageAlphaInfo(rawValue: self.rawValue & type(of: self).alphaInfoMask.rawValue)
let alphaFirst: Bool = alphaInfo == .premultipliedFirst || alphaInfo == .first || alphaInfo == .noneSkipFirst
let alphaLast: Bool = alphaInfo == .premultipliedLast || alphaInfo == .last || alphaInfo == .noneSkipLast
let endianLittle: Bool = self.contains(.byteOrder32Little)
// This is slippery… while byte order host returns little endian, default bytes are stored in big endian
// format. Here we just assume if no byte order is given, then simple RGB is used, aka big endian, though…
if alphaFirst && endianLittle {
return .bgra
} else if alphaFirst {
return .argb
} else if alphaLast && endianLittle {
return .abgr
} else if alphaLast {
return .rgba
} else {
return nil
Note, that you should always pay attention to colour space – it directly affects how raw pixel data is stored. CGColorSpace(name: CGColorSpace.sRGB) is probably the safest one – it stores colours in plain format, for example, if you deal with red RGB it will be stored just like that (255, 0, 0) while device colour space will give you something like (235, 73, 53).
To see this in practice drop above and the following into a playground. You'll need two one-pixel red images with alpha and without, this and this should work.
import AppKit
import CoreGraphics
extension CFData
public var pixelComponents: [UInt8] {
let buffer: UnsafeMutablePointer<UInt8> = UnsafeMutablePointer.allocate(capacity: 4)
defer { buffer.deallocate(capacity: 4) }
CFDataGetBytes(self, CFRange(location: 0, length: CFDataGetLength(self)), buffer)
return Array(UnsafeBufferPointer(start: buffer, count: 4))
let color: NSColor = .red
Thread.sleep(forTimeInterval: 2)
// Must flip coordinates to capture what we want…
let screen: NSScreen = NSScreen.screens.first(where: { $0.frame.contains(NSEvent.mouseLocation) })!
let rect: CGRect = CGRect(origin: CGPoint(x: NSEvent.mouseLocation.x - 10, y: screen.frame.height - NSEvent.mouseLocation.y), size: CGSize(width: 1, height: 1))
Swift.print("Will capture image with \(rect) frame.")
let screenImage: CGImage = CGWindowListCreateImage(rect, [], kCGNullWindowID, [])!
let urlImageWithAlpha: CGImage = NSImage(byReferencing: URL(fileURLWithPath: "/Users/ianbytchek/Downloads/red-pixel-with-alpha.png")).cgImage(forProposedRect: nil, context: nil, hints: nil)!
let urlImageNoAlpha: CGImage = NSImage(byReferencing: URL(fileURLWithPath: "/Users/ianbytchek/Downloads/red-pixel-no-alpha.png")).cgImage(forProposedRect: nil, context: nil, hints: nil)!
Swift.print(screenImage.colorSpace!, screenImage.bitmapInfo, screenImage.bitmapInfo.pixelFormat!, screenImage.dataProvider!.data!.pixelComponents)
Swift.print(urlImageWithAlpha.colorSpace!, urlImageWithAlpha.bitmapInfo, urlImageWithAlpha.bitmapInfo.pixelFormat!, urlImageWithAlpha.dataProvider!.data!.pixelComponents)
Swift.print(urlImageNoAlpha.colorSpace!, urlImageNoAlpha.bitmapInfo, urlImageNoAlpha.bitmapInfo.pixelFormat!, urlImageNoAlpha.dataProvider!.data!.pixelComponents)
let formats: [CGBitmapInfo.RawValue] = [
for format in formats {
// This "paints" and prints out components in the order they are stored in data.
let context: CGContext = CGContext(data: nil, width: 1, height: 1, bitsPerComponent: 8, bytesPerRow: 32, space: CGColorSpace(name: CGColorSpace.sRGB)!, bitmapInfo: format)!
let components: UnsafeBufferPointer<UInt8> = UnsafeBufferPointer(start:!.assumingMemoryBound(to: UInt8.self), count: 4)
context.setFillColor(red: 1 / 0xFF, green: 2 / 0xFF, blue: 3 / 0xFF, alpha: 1)
context.fill(CGRect(x: 0, y: 0, width: 1, height: 1))
Swift.print(context.colorSpace!, context.bitmapInfo, context.bitmapInfo.pixelFormat!, Array(components))
This will output the following. Pay attention how screen-captured image differs from ones loaded from disk.
Will capture image with (285.7734375, 294.5, 1.0, 1.0) frame.
<CGColorSpace 0x7fde4e9103e0> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; iMac) CGBitmapInfo(rawValue: 8194) bgra [27, 13, 252, 255]
<CGColorSpace 0x7fde4d703b20> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; Color LCD) CGBitmapInfo(rawValue: 3) rgba [235, 73, 53, 255]
<CGColorSpace 0x7fde4e915dc0> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; Color LCD) CGBitmapInfo(rawValue: 5) rgba [235, 73, 53, 255]
<CGColorSpace 0x7fde4d60d390> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; sRGB IEC61966-2.1) CGBitmapInfo(rawValue: 2) argb [255, 1, 2, 3]
<CGColorSpace 0x7fde4d60d390> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; sRGB IEC61966-2.1) CGBitmapInfo(rawValue: 6) argb [255, 1, 2, 3]
<CGColorSpace 0x7fde4d60d390> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; sRGB IEC61966-2.1) CGBitmapInfo(rawValue: 1) rgba [1, 2, 3, 255]
<CGColorSpace 0x7fde4d60d390> (kCGColorSpaceICCBased; kCGColorSpaceModelRGB; sRGB IEC61966-2.1) CGBitmapInfo(rawValue: 5) rgba [1, 2, 3, 255]
Could you use NSBitmapFormat?
I wrote a class to source color schemes from images, and that's what I used to determine the bitmap format. Here's a snippet of how I used it:
var averageColorImage: CIImage?
var averageColorImageBitmap: NSBitmapImageRep
//... core image filter code
averageColorImage = filter?.outputImage
averageColorImageBitmap = NSBitmapImageRep(CIImage: averageColorImage!)
let red, green, blue: Int
switch averageColorImageBitmap.bitmapFormat {
case NSBitmapFormat.NSAlphaFirstBitmapFormat:
red = Int(averageColorImageBitmap.bitmapData.advancedBy(1).memory)
green = Int(averageColorImageBitmap.bitmapData.advancedBy(2).memory)
blue = Int(averageColorImageBitmap.bitmapData.advancedBy(3).memory)
red = Int(averageColorImageBitmap.bitmapData.memory)
green = Int(averageColorImageBitmap.bitmapData.advancedBy(1).memory)
blue = Int(averageColorImageBitmap.bitmapData.advancedBy(2).memory)
Check out the answer to How to keep NSBitmapImageRep from creating lots of intermediate CGImages?.
The gist is that the NSImage/NSBitmapImageRepresentation implementation automatically handles the input format.
Apple's docs fail to note that the format parameter (for example in CIRenderDestination) specifies the desired output space.
If you want it in a particular format, the docs recommend drawing into that format (example in linked answer).
If you just need particular information, NSBitmapImageRepresentation provides easy access to individual parameters. I could not find a clear and direct route to a CIFormat without setting up cascading manual tests. I assume a way exists somewhere.