When I add workers to neutral network I get an error, pytorch - neural-network

I have checked a lot of post and none of them seem to work for me. But when I try to add workers to the dataloader in pytorch it just feeds me an error back. I have tired reading it and figuring it out but I can't seem to find a solution. I assume there is something I'm supposed to add to make the workers able to do their job.
I have 64GB of ram, i9-9900k, and a 3080ti. So I don't think its a memory error is it?
I included the error code with 1 worker and 4 workers because they seem to be different.
Also it works with zero workers.
here is the error with 4 workers:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\queue.py", line 172, in get raise Empty
queue.Empty
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 202, in <module>
training()
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__ data = self._next_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1186, in _next_data idx, data = self._get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1142, in _get_data success, data = self._try_get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 23204, 7668, 13636, 6132) exited unexpectedly
Error with 1 worker:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\Users\14055\Desktop\Class 1 Project\Chegg.py", line 202, in <module>
training()
File "c:\Users\14055\Desktop\Class 1 Project\Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 359, in __iter__
return self._get_iterator()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 918, in __init__
w.start()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\queue.py", line 172, in get
raise Empty
queue.Empty
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 202, in <module>
training()
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
data = self._next_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1142, in _get_data
success, data = self._try_get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3372) exited unexpectedly
Code:
from numpy import testing
import torch.cuda
import numpy as np
import time
import array as arr
import os
from datetime import date, datetime
from torchvision import datasets
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
torch.cuda.set_device(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def load_data():
num_workers = 1
load_data.batch_size = 20
transform = transforms.ToTensor()
train_data = datasets.MNIST(root='data', train=True, download=True, transform=transform)
load_data.train_loader = torch.utils.data.DataLoader(train_data,
batch_size=load_data.batch_size, num_workers=num_workers, pin_memory=True,
shuffle=True)
test_data = datasets.MNIST(root='data', train=False, download=True, transform=transform)
load_data.test_loader = torch.utils.data.DataLoader(test_data,
batch_size=load_data.batch_size, num_workers=num_workers, pin_memory=True,
shuffle=True)
def visualize():
dataiter = iter(load_data.train_loader)
visualize.images, labels = dataiter.next()
visualize.images = visualize.images.numpy()
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(load_data.batch_size):
ax = fig.add_subplot(2, load_data.batch_size/2, idx+1, xticks=[], yticks=[])
ax.imshow(np.squeeze(visualize.images[idx]), cmap='gray')
ax.set_title(str(labels[idx].item()))
#plt.show()
def fig_values():
img = np.squeeze(visualize.images[1])
fig = plt.figure(figsize = (12,12))
ax = fig.add_subplot(111)
ax.imshow(img, cmap='gray')
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
for y in range(height):
val = round(img[x][y],2) if img[x][y] !=0 else 0
ax.annotate(str(val), xy=(y,x),
horizontalalignment='center',
verticalalignment='center',
color='white' if img[x][y]<thresh else 'black')
#plt.show()
load_data()
#visualize()
#fig_values()
class NeuralNet(nn.Module):
def __init__(self, gpu = True):
super(NeuralNet, self ).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=128, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(num_features=128)
self.tns1 = nn.Conv2d(in_channels=128, out_channels=4, kernel_size=1, padding=1)
self.conv2 = nn.Conv2d(in_channels=4, out_channels=16, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(num_features=16)
self.pool1 = nn.MaxPool2d(2,2)
self.conv3 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(num_features=16)
self.conv4 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(num_features=32)
self.pool2 = nn.MaxPool2d(2,2)
self.tns2 = nn.Conv2d(in_channels=32, out_channels=16, kernel_size=1, padding=1)
self.conv5 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
self.bn5 = nn.BatchNorm2d(num_features=16)
self.conv6 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.bn6 = nn.BatchNorm2d(num_features=32)
self.conv7 = nn.Conv2d(in_channels=32, out_channels=10, kernel_size=1, padding=1)
self.gpool = nn.AvgPool2d(kernel_size=7)
self.drop = nn.Dropout2d(0.1)
def forward(self, x):
x = self.tns1(self.drop(self.bn1(F.relu(self.conv1(x)))))
x = self.drop(self.bn2(F.relu(self.conv2(x))))
x = self.pool1(x)
x = self.drop(self.bn3(F.relu(self.conv3(x))))
x = self.drop(self.bn4(F.relu(self.conv4(x))))
x = self.tns2(self.pool2(x))
x = self.drop(self.bn5(F.relu(self.conv5(x))))
x = self.drop(self.bn6(F.relu(self.conv6(x))))
x = self.conv7(x)
x = self.gpool(x)
x = x.view(-1, 10)
return F.log_softmax(x).to(device)
#has antioverfit
def training():
model.to(device)
optimizer= torch.optim.SGD(model.parameters(), lr=0.003, weight_decay= 0.00005, momentum = .9, nesterov = True)
n_epochs = 20000
a = np.float64([9,9,9,9,9]) #antioverfit
testing_loss = 0.0
for epoch in range(n_epochs) :
if(testing_loss <= a[4]): # part of anti overfit
train_loss = 0.0
testing_loss = 0.0
model.train().to(device)
for data, target in load_data.train_loader:
optimizer.zero_grad()
data = data.to(device) #gpu
target = target.to(device) #gpu
output = model(data).to(device)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()*data.size(0)
train_loss = train_loss/len(load_data.train_loader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, train_loss))
model.eval().to(device) # Gets Validation loss
train_loss = 0.0
with torch.no_grad():
for data, target in load_data.test_loader:
data = data.to(device)
target = target.to(device)
output = model(data).to(device)
loss =F.nll_loss(output, target)
testing_loss += loss.item()*data.size(0)
testing_loss = testing_loss / len(load_data.test_loader.dataset)
print('Validation loss = ' , testing_loss)
a = np.insert(a,0,testing_loss) # part of anti overfit
a = np.delete(a,5)
print('Validation loss = ' , testing_loss)
def evalution():
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
model.eval().to(device)
for data, target in load_data.test_loader:
data = data.to(device)
target = target.to(device)
output = model(data).to(device)
loss =F.nll_loss(output, target)
test_loss += loss.item()*data.size(0)
_, pred = torch.max(output, 1)
correct = np.squeeze(pred.eq(target.data.view_as(pred))).to(device)
for i in range(load_data.batch_size):
try:
label = target.data[i]
class_correct[label] += correct[i].item()
class_total[label] += 1
except IndexError:
break
# calculate and print avg test loss
test_loss = test_loss/len(load_data.test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))
for i in range(10):
if class_total[i] > 0:
print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
str(i), 100 * class_correct[i] / class_total[i],
np.sum(class_correct[i]), np.sum(class_total[i])))
else:
print('Test Accuracy of %5s: N/A (no training examples)' )
print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
100. * np.sum(class_correct) / np.sum(class_total),
np.sum(class_correct), np.sum(class_total)))
acc = (
100. * np.sum(class_correct) / np.sum(class_total),
np.sum(class_correct), np.sum(class_total))
name = f"model-{acc}.pt"
name2 = f"model-{acc}.pth"
save_path = os.path.join("models", name)
save_path2 = os.path.join("models", name2)
torch.save(model, save_path)
torch.save(model, save_path2)
model = NeuralNet().to(device)
summary(model, input_size=(1, 28, 28))
training()
evalution()

Related

Convert pointcloud csv to hdf5 to train on PointCNN network

I am trying to train my point cloud data on PointCNN so I need to convert my dataset to hdf5 as used in PointCNN. PointCNN used the modelnet40_ply_hdf5_2048 dataset.
I have tried converting my custom dataset but I am having issues with the label.
I tried this to get the label/shape_names
shape_ids = {}
shape_ids = [line.rstrip() for line in open(os.path.join(PATH, 'filelist1.txt'))]
shape_names = ['_'.join(x.split('_')[0:-1]) for x in shape_ids]
datapath = [(shape_names[i], os.path.join(PATH, shape_names[i], shape_ids[i])) for i
in range(len(shape_ids))]
Convert to h5py file
import numpy as np
from tqdm import tqdm
import h5py
filenames = [line.rstrip() for line in open(os.path.join(PATH))]
f = h5py.File("filename", 'w')
data = np.zeros((len(filenames), 1024, 3))
for i in range(0, len(datapath)):
fn = datapath[i]
cls = classes[datapath[i][0]]
label = np.array([cls]).astype(np.int32)
csvreader = np.genfromtxt("data1/" + filenames[i] + ".csv", delimiter=",").astype(np.float32)
for j in range(0,1024):
data[i,j] = [csvreader[j][0], csvreader[j][1], csvreader[j][2]]
label
dset1 = f.create_dataset("data", data=data, compression="gzip", compression_opts=4)
dset2 = f.create_dataset("label", data=label, compression="gzip", compression_opts=1)
f.close()
It did convert successfully but when I tried to train on PointCNN
PointCNN training
------Building model-------
------Successfully Built model-------
Traceback (most recent call last):
File "train_pytorch.py", line 174, in <module>
current_data, current_label, _ = provider.shuffle_data(current_data, np.squeeze(current_label))
File "provider.py", line 28, in shuffle_data
idx = np.arange(len(labels))
TypeError: len() of unsized object

"ValueError: max_evals=500 is too low for the Permutation explainer" shap answers me do I have to give more data (photos)?

I want to test the explainability of a multiclass semantic segmentation model, deeplab_v3plus with shap to know which features contribute the most to semantic classification. However I have a ValueError: max_evals=500 is too low when running my file, and I struggle to understand the reason.
import glob
from PIL import Image
import torch
from torchvision import transforms
from torchvision.utils import make_grid
import torchvision.transforms.functional as tf
from deeplab import deeplab_v3plus
import shap
def test(args):
# make a video prez
model = deeplab_v3plus('resnet101', num_classes=args.nclass, output_stride=16, pretrained_backbone=True)
model.load_state_dict(torch.load(args.seg_file,map_location=torch.device('cpu'))) # because no gpu available on sandbox environnement
model = model.to(args.device)
model.eval()
explainer = shap.Explainer(model)
with torch.no_grad():
for i, file in enumerate(args.img_folder):
img = img2tensor(file, args)
pred = model(img)
print(explainer(img))
if __name__ == '__main__':
class Arguments:
def __init__(self):
self.device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
self.seg_file = "Model_Woodscape.pth"
self.img_folder = glob.glob("test_img/*.png")
self.mean = [0.485, 0.456, 0.406]
self.std = [0.229, 0.224, 0.225]
self.h, self.w = 483, 640
self.nclass = 10
self.cmap = {
1: [128, 64, 128], # "road",
2: [69, 76, 11], # "lanemarks",
3: [0, 255, 0], # "curb",
4: [220, 20, 60], # "person",
5: [255, 0, 0], # "rider",
6: [0, 0, 142], # "vehicles",
7: [119, 11, 32], # "bicycle",
8: [0, 0, 230], # "motorcycle",
9: [220, 220, 0], # "traffic_sign",
0: [0, 0, 0] # "void"
}
args = Arguments()
test(args)
But it returns:
(dee_env) jovyan#jupyter:~/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+$ python test_shap.py
BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
Traceback (most recent call last):
File "/home/jovyan/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+/test_shap.py", line 85, in <module>
test(args)
File "/home/jovyan/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+/test_shap.py", line 37, in test
print(explainer(img))
File "/home/jovyan/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+/dee_env/lib/python3.9/site-packages/shap/explainers/_permutation.py", line 82, in __call__
return super().__call__(
File "/home/jovyan/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+/dee_env/lib/python3.9/site-packages/shap/explainers/_explainer.py", line 266, in __call__
row_result = self.explain_row(
File "/home/jovyan/use-cases/Scene_understanding/Code_Woodscape/deeplab_v3+/dee_env/lib/python3.9/site-packages/shap/explainers/_permutation.py", line 164, in explain_row
raise ValueError(f"max_evals={max_evals} is too low for the Permutation explainer, it must be at least 2 * num_features + 1 = {2 * len(inds) + 1}!")
ValueError: max_evals=500 is too low for the Permutation explainer, it must be at least 2 * num_features + 1 = 1854721!
In the source code it looks like it's because I don't give enough arguments. I only have three images in my test_img/* folder, is that why?
I have the same issue. A possible solution I found which seems to be working for my case is to replace this line
explainer = shap.Explainer(model)
With this line
explainer = shap.explainers.Permutation(model, max_evals = 1854721)
shap.Explainer by default has algorithm='auto'. From the documentation: shape.Explainer
By default the “auto” options attempts to make the best choice given
the passed model and masker, but this choice can always be overriden
by passing the name of a specific algorithm.
Since 'permutation' has been selected you can directly use shap.explainers.Permutation and set max_evals to the value suggested in the error message above.
Given the high number of your use case, this might take a really long time. I would suggest to use an easier model just for testing the above solution.

How to fix incorrect channel size in pytorch neural network?

I'm working with the Google utterance dataset in spectrogram form. Each data point has dimension (160, 101). In my data loader, I used batch_size=128. Therefore, each batch has dimension (128, 160, 101).
I use a LeNet model with the following code as the model:
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16*5*5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 30)
def forward(self, x):
out = F.relu(self.conv1(x))
out = F.max_pool2d(out, 2)
out = F.relu(self.conv2(out))
out = F.max_pool2d(out, 2)
out = out.view(out.size(0), -1)
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
out = self.fc3(out)
return out
I tried unsqueezing the data with dim=3, but got this error:
Traceback (most recent call last):
File "train_speech.py", line 359, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 258, in train
outputs = (net(inputs))['out']
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/gdrive/My Drive/Colab Notebooks/mixup_erm-master/models/lenet.py", line 15, in forward
out = F.relu(self.conv1(x))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 443, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [6, 1, 5, 5], expected input[128, 160, 101, 1] to have 1 channels, but got 160 channels instead
How do I fix this issue?
EDIT: New Error Message Below
torch.Size([128, 160, 101])
torch.Size([128, 1, 160, 101])
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "train_speech.py", line 363, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 262, in train
outputs = (net(inputs))['out']
IndexError: too many indices for tensor of dimension 2
I'm unsqueezing the data in each batch. The relevant section of my training code is below. inputs is analogous to x.
print(inputs.shape)
inputs = inputs.unsqueeze(1)
print(inputs.shape)
outputs = (net(inputs))['out']
Edit 2: New Error
Traceback (most recent call last):
File "train_speech.py", line 361, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 270, in train
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function AddmmBackward returned an invalid gradient at index 1 - got [128, 400] but expected shape compatible with [128, 13024]
Edit 3: Train Loop Below
def train(epoch):
print('\nEpoch: %d' % epoch)
net.train()
train_loss = 0
reg_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
if use_cuda:
inputs, targets = inputs.cuda(), targets.cuda()
inputs, targets_a, targets_b, lam,layer, cost = mixup_data(inputs, targets,
args.alpha,args.mixupBatch, use_cuda)
inputs, targets_a, targets_b = map(Variable, (inputs,
targets_a, targets_b))
outputs = net(inputs)
loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
train_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += targets.size(0)
correct += (lam * predicted.eq(targets_a.data).cpu().sum().float()
+ (1 - lam) * predicted.eq(targets_b.data).cpu().sum().float())
optimizer.zero_grad()
loss.backward()
optimizer.step()
return (train_loss/batch_idx, reg_loss/batch_idx, 100.*correct/total, cost/batch_idx)
You should expand on axis=1 a.k.a. the channel axis:
>>> x = x.unsqueeze(1)
If you're inside the dataset __getitem__, then it corresponds to axis=0.

using dataloader to interface kafka data

i use dataloader to inferface the data in kafka and it doesnt work
here is my code
class kfkdataset(Dataset):
def __init__(self,consumer,image_size):
super(kfkdataset).__init__()
self.image_size=image_size
self.consumer = consumer
def __getitem__(self, index):
info = json.loads(next(self.consumer).value)
image_osspath = info['path']
image = prep_image_batch(image_osspath,self.image_size)
return image,image_osspath
def __len__(self):
# You should change 0 to the total size of your dataset.
return 9000000
consumer = KafkaConsumer('my-topic',bootstrap_servers=[])
prodataset = kfkdataset(consumer,image_size=608)#)
k = DataLoader(prodataset,
batch_size=batch_size,
num_workers=16)
for inputimage,osspath in k:
inputimage = inputimage.to(device)
detections,_ = model(inputimage)
detections = non_max_suppression(detections, 0.98, 0.4)
it works when num_workers is 1
when num_workers >1:
errors came out
File "batch_upload.py", line 80, in
for inputimage,osspath in k:
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 801, in__next__
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 846,in_process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
FileExistsError: Caught FileExistsError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/appbatch/utils/utils.py", line 49, in getitem
info = json.loads(next(self.consumer).value)
File "/usr/local/lib/python3.6/dist-packages/kafka/consumer/group.py", line 1192, in next
return self.next_v2()
File "/usr/local/lib/python3.6/dist-packages/kafka/consumer/group.py", line 1200, in next_v2
return next(self._iterator)
File "/usr/local/lib/python3.6/dist-packages/kafka/consumer/group.py", line 1115, in _message_generator_v2
record_map = self.poll(timeout_ms=timeout_ms, update_offsets=False)
File "/usr/local/lib/python3.6/dist-packages/kafka/consumer/group.py", line 654, in poll
records = self._poll_once(remaining, max_records, update_offsets=update_offsets)
File "/usr/local/lib/python3.6/dist-packages/kafka/consumer/group.py", line 701, in _poll_once
self._client.poll(timeout_ms=timeout_ms)
File "/usr/local/lib/python3.6/dist-packages/kafka/client_async.py", line 600, in poll
self._poll(timeout / 1000)
File "/usr/local/lib/python3.6/dist-packages/kafka/client_async.py", line 629, in _poll
self._register_send_sockets()
File "/usr/local/lib/python3.6/dist-packages/kafka/client_async.py", line 619, in _register_send_sockets
self._selector.modify(key.fileobj, events, key.data)
File "/usr/lib/python3.6/selectors.py", line 261, in modify
key = self.register(fileobj, events, data)
File "/usr/lib/python3.6/selectors.py", line 412, in register
self._epoll.register(key.fd, epoll_events)
FileExistsError: [Errno 17] File exists
i want know how to make it works
Basically, setting num_workers > 1 in PyTorch's DataLoader is creating several worker processes which are in turn biding to the same socket port as there is only one consumer.
One approach to parallelize and improve importing data from Kafka is to create several consumers in the same consumer group for that topic.

Loading a pretrained model fails when multiple GPU was used for training

I have trained a network model and saved its weights and architecture via checkpoint = ModelCheckpoint(filepath='weights.hdf5') callback. During training, I am using multiple GPUs by calling the funtion below:
def make_parallel(model, gpu_count):
def get_slice(data, idx, parts):
shape = tf.shape(data)
size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
start = stride * idx
return tf.slice(data, start, size)
outputs_all = []
for i in range(len(model.outputs)):
outputs_all.append([])
#Place a copy of the model on each GPU, each getting a slice of the batch
for i in range(gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
inputs = []
#Slice each input into a piece for processing on this GPU
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
inputs.append(slice_n)
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
#Save all the outputs for merging back together later
for l in range(len(outputs)):
outputs_all[l].append(outputs[l])
# merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs in outputs_all:
merged.append(merge(outputs, mode='concat', concat_axis=0))
return Model(input=model.inputs, output=merged)
With the testing code:
from keras.models import Model, load_model
import numpy as np
import tensorflow as tf
model = load_model('cpm_log/deneme.hdf5')
x_test = np.random.randint(0, 255, (1, 368, 368, 3))
output = model.predict(x = x_test, batch_size=1)
print output[4].shape
I got the error below:
Traceback (most recent call last):
File "cpm_test.py", line 5, in <module>
model = load_model('cpm_log/Jun5_1000/deneme.hdf5')
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 240, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 301, in model_from_config
return layer_module.deserialize(config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/layers/__init__.py", line 46, in deserialize
printable_module_name='layer')
File "/usr/local/lib/python2.7/dist-packages/keras/utils/generic_utils.py", line 140, in deserialize_keras_object
list(custom_objects.items())))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2378, in from_config
process_layer(layer_data)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2373, in process_layer
layer(input_tensors[0], **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 578, in __call__
output = self.call(inputs, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 659, in call
return self.function(inputs, **arguments)
File "/home/muhammed/DEV_LIBS/developments/mocap/pose_estimation/training/cpm/multi_gpu.py", line 12, in get_slice
def get_slice(data, idx, parts):
NameError: global name 'tf' is not defined
By inspecting the error output, I decide that the problem is with the parallelization code. However, I can't resolve the issue.
You may need to use custom_objects to enable loading of the model.
import tensorflow as tf
model = load_model('model.h5', custom_objects={'tf': tf,})