How improve tflite_flutter performance - flutter

I am using tflite_flutter in my app for
a Dog Detector, which returns a Rect if there is a dog in the camera view
a Dog Classifier, which crops the image using the Rect and returns the breed.
Both use tflite_flutter. Each takes 50 - 70 milliseconds on my Samsung Galaxy S10e.
I want to improve the performance. I have tried
varying the number of threads
using ..useNnApiForAndroid = true
using ..addDelegate(gpuDelegateV2) and ..addDelegate(NnApiDelegate() after install.bat -d
running the detector in an isolate
Nothing helps. What else can I try? Any ideas, anyone?

Inference/Latency of lite operations depends on multiple factors like delegate compatibility, whether delegates has been enabled by manufacturer or not etc.
Attached few factors to optimize the models for low latency and high inference time.
1. Quantization and the delegates compatibility :
For eg. GPU delgates support all ranges of quantization but Hexagon delegates are more compatible with integer and quantization aware modesl
2. GPU compatibility of layers inside model.
All ops are not supported for GPU delegates by default, so you have to put alternate apis to use GPU delegates. for e.g leakyrelu is not supported but relu is supported for GPU.
you can check GPU compatibility of your lite model using model_analyzer , inference timing depends on how model is leveraging the GPU/NNAPI/Other delegates efficiently. You can use Benchamarking apk to check performance of lite models against different delegates , optimize the model accordingly by disabling quantization of suspected layers.
3. User Serialization and on-device Training:
You can use serialization/ on device training** to reduce the warm-up time or improve inference time .
4. Same input shape during inference:
Please make sure you have used same input shape as lite model input during inference to avoid dimension mismatch issues.
Thank you!

Related

Using Pytorch model trained on RTX2080 on RTX3060

I try to run my PyTorch model (trained on a Nvidia RTX2080) on the newer Nvidia RTX3060 with CUDA support. It is possible to load the model and to execute it. If I run it on the CPU with the --no_cuda flag it runs smootly and gives back the correct predictions, but if I want to run it with CUDA, it only returns wrong predictions which make no sense.
Does the different GPU-architecture of the cards affect the prediction?
Ok it seemed that the problem was the different floating point of the two architectures. The flag torch.backends.cuda.matmul.allow_tf32 = false needs to be set, to provide a stable execution of the model of a different architecture.

How to run TensorRT based deep learning model as real time?

I have optimized my deep learning model with TensorRT. A C++ interface is inferencing images by optimized model on Jetson TX2. This interface is providing average 60 FPS (But it is not stable. Inferences are in range 50 and 160 FPS). I need to run this system as real time on real time patched Jetson.
So what is your thoughts on real time inference with TensorRT? Is it possible to develop real time inferencing system with TensorRT and how?
I have tried set high priorities to process and threads to provide preemption. I expect appoximatly same FPS value on every inference. So I need deterministic inference time. But system could not output deterministicaly.
Have you tried to set the clock on Jetson: sudo nvpmodel -m 0
Here is some links for more information:
https://elinux.org/Jetson/Performance
https://devtalk.nvidia.com/default/topic/999915/jetson-tx2/how-do-you-switch-between-max-q-and-max-p/post/5109507/#5109507
https://devtalk.nvidia.com/default/topic/1000345/jetson-tx2/two-cores-disabled-/post/5110960/#5110960

Object detection for a single object only

I have been working with object detection. But these methods consist of very deep neural networks and require lots of memory to store the trained models. E.g. I once tried to train a Mask R-CNN model, and the weights take 200 MB.
However, my focus is on detecting a single object only. So, I guess these methods are not suitable. Are there any object detection method that can do this job with a low memory requirement?
You can try SSD or faster RCNN they are easily available in Tensorflow object detection API
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
here you can get pre-trained models and config file
you can select your model by taking look on speed and mAP(accuracy) column as per your requirement.
Following mukul's answer, I specifically recommend you check out SSDLite-MobileNetV2.
It's a lite-weight model, which is still enough expressive for good results.
Especially when you're restricting yourself to a single class, as you can see in the example of FaceSSD-MobileNetV2 as in here (Note however this is vanilla SSD).
So you can simply Take the pre-trained model of SSDLite-MobileNetV2 with the corresponding config file, and modify it for a single class.
This means changing num_classes to 1, modifying the label_map.pbtxt, and of course - preparing the dataset with the single class you want.
If you want a more robust model, but which has no pre-trained mode, you can use an FPN version.
Checkout this config file, which is with MobileNetV1, and modify it for your needs (e.g. switching to MobileNetV2, switching to use_depthwise, etc).
On one hand, there's no detection pre-trained model, but on the other the detection head is shared over all (relevant) scales, so it's somewhat easier to train.
So simply fine-tune it from the corresponding classification checkpoint from here.

What kind of features are extracted with the AlexNet layers?

Question is regarding this method, which extracts features from the FC7 layer of AlexNet.
What kind of features is it actually extracting?
I used this method on images of paintings done by two artists. The training set is about 150 training images from each artist (so that the features are stored in a 300x4096 matrix); the validation set is 40 images. This works really well, 85-90% correct classification. I would like to know why it works so well.
WHAT FEATURES ?
FC8 is the classification layer; FC7 is the one before it, where all of the prior kernel pixels are linearised and concatenated. These represent the abstract, top-level features that the model training has "discovered". To examine these features, try one of the many layer visualization tools available on line (don't ask for references here; SO bans requests for resources).
The features vary from one training to another, depending on the kernel initialization (usually random) and very dependent on the training set. However, the features tend to be simple in the early layers, with greater variety and detail in the later ones. For instance, on the original AlexNet target (ILSVRC 2012, a.k.a. ImageNet data set), the FC7 features often include vehicle tires, animal faces, various types of flower petals, green leaves and stems, two-legged animal torsos, airplane sections, car/truck/bus grill work, etc.
Does that help?
WHY DOES IT WORK SO WELL ?
That depends on the data set and training parameters. How different are the images from the artists? There are plenty of features to extract: choice of subject, palette, compositional complexity, hard/soft edges, even direction of brush strokes. For instance, differentiating any two early cubists could be a little tricky; telling Rembrandt from Jackson Pollack should hit 100%.

Use GPU for speech NSSpeechSynthesis and NSSpeechRecogniers on OS X

I just did an interesting test of running a speech recogniser service and using NSSpeechSynthesis to echo what I said using NSSpeechSynthesizer.
However, NSSpeechSynthesizer is notorious for being slow and unresponsive, and I wanted to know if anyone has tried optimising this by specifying either a core, a thread or the GPU (using metal) to process both recognition and synthesis.
I've been checking the following article to understand better pipelining values through the metal buffer:
http://memkite.com/blog/2014/12/30/example-of-sharing-memory-between-gpu-and-cpu-with-swift-and-metal-for-ios8/
The author has used Metal for off loading the sigmoid function used in ML which makes complete sense as vector maths is what GPUs do best.
However, I would like to know if anyone has explored the possibility of sending other type of data, floats values from a wave form or other (render synthesis through the GPU).
Particularly, has anyone tried this for NSSpeechRecogniser or NSSpeechSynthesizer?
As it goes now, I have a full 3D scene with 3D HRTF sound, and both recognition and synthesis work but sometimes there's a noticeable lag, so maybe dedicating a buffer pipeline through the GPU MTLDevice then back to play the file might work?