Slowdown on tensorflow convolutional network with custom parameter update - convolution

I'm trying to implement a custom parameter update on a convolutional network, but every mini batch executed gets slower and slower.
I realize that there's no need to go through this trouble with a fixed learning rate, but I plan to update this later.
I call this in a loop where the feed_dict is the mini_batch.,.1,1),feed_dict = feed_dict)
def layered_optimizer(cost,base_rate, rate_multiplier):
gradients = tf.gradients(cost, [*weights, *biases])
#update parameters based on gradients: var = var - gradient * base_rate * multiplier
for i in range(len(weights)-1):
weights[i].assign(tf.subtract(weights[i], tf.multiply(gradients[i], base_rate * rate_multiplier)))
biases[i].assign(tf.subtract(biases[i], tf.multiply(gradients[len(weights)+i], base_rate * rate_multiplier)))
I'm not sure if this is has to do with the problem, but after trying to run the code a second time I get the following errors and have to restart.
could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

What happens is that every time this gets called
gradients = tf.gradients(cost, [*weights, *biases])
a new instance of tf.gradients gets created, taking up unnecessary memory.


How to solve RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0

I had done a model training on Densenet161 and I saved my model,'/content/drive/My Drive/Stanford40/densenet161.pth')
and follow by this
model = models.densenet161(pretrained=False,num_classes=11)
model.load_state_dict(torch.load('/content/drive/My Drive/Stanford40/densenet161.pth'))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Then, when I want to proceed to the model evaluation
test_tuple=datasets.ImageFolder('/content/drive/My Drive/Stanford40/body/valid',transform=data_transforms['valid']),batch_size=1,shuffle=True)
for inputs ,labels in test_dataloader:
maxvlaues,indices = torch.max(outputs, 1)
and I face the error as in the picture:
when I check whether my model was moved to the device with this code
The result is True.
How can I modify the code to get away from this error?
My model training part can be found at How to solve TypeError: can’t convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first
You have moved model to the GPU, but not model_ft. The runtime error is at outputs = model_ft(inputs). Could it be a case of mixed-up variable names?

Flutter compute function takes time to start execute

I am trying to use Flutters compute function to do some real time heavy image processing using a C++ code and dart ffi.
I tried wrapping the call to the heavy function in a compute to avoid messing with the ui thread and I took some time measurements to see what takes the most time to execute.
the code looks like this:
double _work(CheckPhotoData p) {
DateTime s =;
Pointer<Double> rPointer = Pointer.fromAddress(p.rPointerAddress);
Pointer<Double> gPointer = Pointer.fromAddress(p.gPointerAddress);
Pointer<Double> bPointer = Pointer.fromAddress(p.bPointerAddress);
final a = NativeCCode.checkPhoto(rPointer, gPointer, bPointer, p.w, 1);
print("ACTUAL NativeCCode.checkPhoto took: " +;
return a;
class CheckPhotoWrapper {
static Future<double> checkPhotoWrapper(Uint8List photo) async {
final CheckPhotoData deconstructData = _deconstructData(photo);
DateTime s =;
double res = await compute(_work, deconstructData);
print("compute took: " +;
return res;
After running the code I got this output:
ACTUAL NativeCCode.checkPhoto took: 106
compute took: 514
(this means that compute took 408ms more than the code it runs)
From what I understand from these results, the actual compute method from dart:async is taking much more time then the actual code its executing and causes a big overhead impacting the performance.
Even worse, my app UI is stuck when the processing starts.
Is there a way to reduce the overhead that compute introduces or a different approach this issue that I couldn't figure out?
Thanks for any idea or a solution to my problem.
I ran the test on debug mode on a physical device.
CheckPhotoData is a simple class containing the parameters to my _work function.
I am using flutter version 2.2.3, Channel stable
The overhead seems to be caused by debug mode. I saw a similar compute delay of several hundred milliseconds in my app (using Flutter 2.10.2), but when running in release mode it's less than 10 milliseconds.

How do I use system time as a trigger in codesys ladder?

Programming a raspberry pi with codesys, using mostly ladder, basically I need to write all data that is currently in a couple arrays to a csv file at midnight, so i'd like to be able to use a dt value as a trigger. I can't figure out how to use that value in ladder, however. I can display the local time on visualizer, but if i wanted something like "if localTime=#value" then coil 'Write' turns on, where is the actual variable for system time?
As far as I know, you need to read the clock from local system using function blocks, for example GetDateAndTime from CAA DTUtil Extern Library. Then you need to keep it up-to-date by using a function block, for example RTC from Standard libary
The following reads the system local time and then updates it with a RTC function block. Works at least on Windows, couldn't test with Raspberry. Please note that if the local time changes for some reason, this won't update it again. So you need to run the GetDateAndTime call every now and then, for example.
First, a program that updates and provides the local time:
PROGRAM PRG_UpdateSystemTime
SystemDateTime : DT;
ReadLocalTime : DTU.GetDateAndTime; //Reads local time from system
RtcBlock : RTC; //Real-time clock - updates the previously received local time
//NOTE: Output is UTC time
//The block that reads local time. NOTE: Error handling is missing
ReadLocalTime(xExecute:= TRUE);
//Running real-time clock
EN := ReadLocalTime.xDone AND NOT ReadLocalTime.xError,
PDT := ReadLocalTime.dtDateAndTime,
CDT => SystemDateTime
And then one example for ladder. I think there are millions of ways. Note that the "DoSomething" will be TRUE for the whole second, so you should probably use rising edge detection.

In Application Programming issue

I'm working on project on STM32L152RCT6, where i have to build a mechanism to self update the code from the newly gated file(HEX file).
For that i have implemented such mechanism like boot loader where it checks for the new firmware if there it it has to cross verify and if found valid it has to store on "Application location".
I'm taking following steps.
Boot loader address = 0x08000000
Application address = 0x08008000
Somewhere on specified location it has to check for new file through Boot loader program.
If found valid it has to be copy all the HEX on location(as per the guide).
Than running the application code through jump on that location.
Now problem comes from step 5, all the above steps I've done even storing of data has been done properly(verify in STM32 utility), but when i'm jump to the application code it won't work.
Is there i have to cross check or something i'm missing?
Unlike other ARM controllers that directly jump to address 0 at reset, the Cortex-M series takes the start address from a vector table. If the program is loaded directly (without a bootloader), the vector table is at the start of the binary (loaded or mapped to address 0). First entry at offset 0 is the initial value of the stack pointer, second entry at address 4 is called the reset vector, it contains the address of the first instruction to be executed.
Programs loaded with a bootloader usually preserve this arrangement, and put the vector table at the start of the binary, 0x08008000 in your case. Then the reset vector would be at 0x08008004. But it's your application, you should check where did you put your vector table. Hint: look at the .map file generated by the linker to be sure. If it's indeed at 0x08008000, then you can transfer control to the application reset vector so:
void (*app)(void); // declare a pointer to a function
app = *(void (**)(void))0x08008004; // see below
app(); // invoke the function through the pointer
The complicated cast in the second line converts the physical address to a pointer to a pointer to a function, takes the value pointed to it, which is now a pointer to a function, and assigns it to app.
Then you should manage the switchover to the application vector table. You can do it either in the bootloader or in the application, or divide the steps between them.
Disable all interrupts and stop SysTick. Note that SysTick is not an interrupt, don't call NVIC_DisableIRQ() on it. I'd do this step in the bootloader, so it gets responsible to disable whatever it has enabled.
Assign the new vector table address to SCB->VTOR. Beware that the boilerplate SystemInit() function in system_stm32l1xx.c unconditionally changes SCB->VTOR back to the start of the flash, i.e. to 0x08000000, you should edit it to use the proper offset.
You can load the stack pointer value from the vector table too, but it's tricky to do it properly, and not really necessary, the application can just continue to use the stack that was set up in the bootloader. Just check it to make sure it's reasonable.
Have you changed the application according to the new falsh position?
For example the Vector Table has to be set correctl via
SCB->VTOR = ...
When your bootloader starts the app it has to configure everything back to the reset state as the application may relay on the default reset values. Espessially you need to:
Return values of all hardware registers to its reset values
Switch off all peripheral clocks (do not forget about the SysTick)
Disable all enabled interrupts
Return all clock domains to its reset values.
Set the vector table address
Load the stack pointer from the beginning of the APP vector table.
Call the APP entry point.(vertor table start + 4)
Your app has to be compiled and linked using the custom linker script where the FLASH start point is 0x8008000
for example:
FLASH (rx) : ORIGIN = 0x8000000 + 32K, LENGTH = 512K - 32K
where FLASH_BASE's value must be equal to the address of your IROM's value in KEIL
#define FLASH_BASE 0x08004000
Keil configuration

How to tell PyCUDA to reuse the memory from an earlier kernel?

My program has two kernels and the second kernel should use the already uploaded input data and the results from the first kernel, so I can save the memory transfers. How would I archive this?
This is how I launch my kernels:
result = gpuarray.zeros(points, dtype=np.float32)
grid = (blocks,1),
block = (block_size, 1, 1),
In pycuda you won't transfer data to and from the device unless you explicitly request it.
For example, if you allocate memory and transfer some data to the GPU with:
result = float64(zeros( (height,width) )
result_device = gpuarray.to_gpu(result)
The variable result_device is a reference to the data in the GPU. You can pass result_device to any other kernel without incurring a memory transfer back to the CPU.
In this case a memory transfer will happen again when you call:
result = result_device.get()