Using Pytorch model trained on RTX2080 on RTX3060

Using Pytorch model trained on RTX2080 on RTX3060 - neural-network

I try to run my PyTorch model (trained on a Nvidia RTX2080) on the newer Nvidia RTX3060 with CUDA support. It is possible to load the model and to execute it. If I run it on the CPU with the --no_cuda flag it runs smootly and gives back the correct predictions, but if I want to run it with CUDA, it only returns wrong predictions which make no sense.
Does the different GPU-architecture of the cards affect the prediction?

Ok it seemed that the problem was the different floating point of the two architectures. The flag torch.backends.cuda.matmul.allow_tf32 = false needs to be set, to provide a stable execution of the model of a different architecture.

Related

MultiCore Programming on Raspberry Pi via Simulink

i am currently developing a model in simulink with three different main functions (let's call them A,B,C for now), where one of them is running at a different sample time as the other ones. However, I tried to simulate this system on the Raspberry Pi via external mode but got a lot of overruns and a high cpu load. Now, I am trying to split the model, so that for example functions A and B are executed on one core and function C is executed on another core.
For this, I used this article from Mathworks, but I think that you can't actually assign one task to a core but just specify the periodic execution. As a result I could reduce the cpu load to a maximum of 40% but still get a lot of overruns (imo, this also contradicts itself).
As a second approach, I tried this article but I think this is not possible for Raspberry Pis since I can not add and assign cores in the concurrent execution tab.
My goal is to assign each task to a core on the raspberry and see the cpu load on the raspberry pi.
Many thanks in advance!

How improve tflite_flutter performance

I am using tflite_flutter in my app for
a Dog Detector, which returns a Rect if there is a dog in the camera view
a Dog Classifier, which crops the image using the Rect and returns the breed.
Both use tflite_flutter. Each takes 50 - 70 milliseconds on my Samsung Galaxy S10e.
I want to improve the performance. I have tried
varying the number of threads
using ..useNnApiForAndroid = true
using ..addDelegate(gpuDelegateV2) and ..addDelegate(NnApiDelegate() after install.bat -d
running the detector in an isolate
Nothing helps. What else can I try? Any ideas, anyone?

Inference/Latency of lite operations depends on multiple factors like delegate compatibility, whether delegates has been enabled by manufacturer or not etc.
Attached few factors to optimize the models for low latency and high inference time.
1. Quantization and the delegates compatibility :
For eg. GPU delgates support all ranges of quantization but Hexagon delegates are more compatible with integer and quantization aware modesl
2. GPU compatibility of layers inside model.
All ops are not supported for GPU delegates by default, so you have to put alternate apis to use GPU delegates. for e.g leakyrelu is not supported but relu is supported for GPU.
you can check GPU compatibility of your lite model using model_analyzer , inference timing depends on how model is leveraging the GPU/NNAPI/Other delegates efficiently. You can use Benchamarking apk to check performance of lite models against different delegates , optimize the model accordingly by disabling quantization of suspected layers.
3. User Serialization and on-device Training:
You can use serialization/ on device training** to reduce the warm-up time or improve inference time .
4. Same input shape during inference:
Please make sure you have used same input shape as lite model input during inference to avoid dimension mismatch issues.
Thank you!

How the calculation works on anylogic?

i’m currently working on project using anylogic. I’m making a system dynamic to modelling a SIR model. and I make a manual calculation of each stock in excel (using euler method) , but the results in excel are different from the result in anylogic. I’m curious about how anylogic calculate the model that I build on it. anyone know how the calculation works on anylogic?

if your SD model is mixed with discrete-events or agent-based, the time step that you set up in the configuration for your model is not considered anymore and a different time step is used for which you have no access, unless you run the simulation in virtual mode (at least it's more likely to behave as you expect that way)
I have extensively tested this, and as long as your model is 100% system dynamics, your euler equations should work as expected, in which case the reason is that your excel is incorrect.
On the other hand if you use RK4 approximation in anylogic, it doesn't really work properly, so I don't even know why they still have it as an option.
I suggest you Vensim and make some tests to see the difference in results and be sure you are calculating correctly in Excel..
In my course i talk in detail about this topic: noorjax.teachable.com

How to run TensorRT based deep learning model as real time?

I have optimized my deep learning model with TensorRT. A C++ interface is inferencing images by optimized model on Jetson TX2. This interface is providing average 60 FPS (But it is not stable. Inferences are in range 50 and 160 FPS). I need to run this system as real time on real time patched Jetson.
So what is your thoughts on real time inference with TensorRT? Is it possible to develop real time inferencing system with TensorRT and how?
I have tried set high priorities to process and threads to provide preemption. I expect appoximatly same FPS value on every inference. So I need deterministic inference time. But system could not output deterministicaly.

Have you tried to set the clock on Jetson: sudo nvpmodel -m 0
Here is some links for more information:
https://elinux.org/Jetson/Performance
https://devtalk.nvidia.com/default/topic/999915/jetson-tx2/how-do-you-switch-between-max-q-and-max-p/post/5109507/#5109507
https://devtalk.nvidia.com/default/topic/1000345/jetson-tx2/two-cores-disabled-/post/5110960/#5110960

Using a subset of a SUMO scenario for OMNeT++ network simulation (with VEINS)

I'm trying to evaluate an application that runs on a vehicular network using OMNeT++, Veins and SUMO. Because the application relies on realistic traffic behavior, so I decided to use the LuST Scenario, which seems to be the state of the art for such data. However, I'd like to use specific parts of this scenario instead of the entire scenario (e.g., a high and a low traffic load fragment, perhaps others). It'd be nice to keep the bidirectional functionality that VEINS offers, although I'm mostly interested in getting traffic data from SUMO into my simulation.
One obvious way to implement this would be to use a warm-up period. However, I'm wondering if there is a more efficient way -- simulating 8 hours of traffic just to get a several-minute fragment feels inefficient and may be problematic for simulations with sufficient repetitions.
Does VEINS have a built-in mechanism for warm-up periods, primarily one that avoids sending messages (which is by far the most time consuming part in the simulation), or does it have a way to wait for SUMO to advance, e.g., to a specific time stamp (which also avoids creating vehicle objects in OMNeT++ and thus all the initiation code)?
In case it's relevant -- I'm using the latest stable versions of OMNeT++ and SUMO (OMNeT++ 4.6 with SUMO 0.25.0) and my code base is based on VEINS 4a2 (with some changes, notably accepting the TraCI API version 10).

There are two things you can do here for reducing the number of sent messages in Veins:
Use the OMNeT++ Warm-Up Period as described here in the manual. Basically it means to set warmup-period in your .ini file and make sure your code checks this with if (simTime() >= simulation.getWarmupPeriod()). The OMNeT++ signals for result collection are aware of this.
The TraCIScenarioManager offers a variable double firstStepAt #unit("s") which you can use to delay the start of it. Again this can be set in the .ini file.
As the VEINS FAQ states, the TraCIScenarioManagerLaunchd offers two variables to configure the region of interest, based on rectangles or roads (string roiRoads and string roiRects). To reduce the simulated area, you can restrict simulation to a specific rectangle; for example, *.manager.rioRects="1000,1000-3000,3000" simulates a 2x2km area between the two supplied coordinates.
With both solutions (best used in combination) you still have to run SUMO - but Veins barely consums any of the time.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using Pytorch model trained on RTX2080 on RTX3060 - neural-network

Ok it seemed that the problem was the different floating point of the two architectures. The flag torch.backends.cuda.matmul.allow_tf32 = false needs to be set, to provide a stable execution of the model of a different architecture.

Related

MultiCore Programming on Raspberry Pi via Simulink

How improve tflite_flutter performance

How the calculation works on anylogic?

How to run TensorRT based deep learning model as real time?

Using a subset of a SUMO scenario for OMNeT++ network simulation (with VEINS)

Categories

Resources