I was having an issue when trying to create a convolution-deconvolution network. The original image dimensions are 565 * 584 and I'm trying to produce a segmentation of 565 * 584.
While I didn't have an issue before with my network with 1024*1024 images, I have been having some issues with these dimensions. I am getting this issue when computing the gradient:
segmentation_result.shape: (?, 565, 584, 1), targets.shape: (?, 565, 584, 1)
Process Process-1:
Traceback (most recent call last):
\Python\Python35\lib\site-packages\tensorflow\python\framework\tensor_shape.py", line 558, in merge_with
self.assert_is_compatible_with(other)
\Python\Python35\lib\site-packages\tensorflow\python\framework\tensor_shape.py", line 106, in assert_is_compatible_with
other))
ValueError: Dimensions 565 and 566 are not compatible
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
\Python\Python35\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
\Python\Python35\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
.py", line 418, in train
network = Network(net_id = count, weight=pos_weight)
.py", line 199, in __init__
self.train_op = tf.train.AdamOptimizer().minimize(self.cost)
\Python\Python35\lib\site-packages\tensorflow\python\training\optimizer.py", line 315, in minimize
grad_loss=grad_loss)
\Python\Python35\lib\site-packages\tensorflow\python\training\optimizer.py", line 386, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
\Python\Python35\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 560, in gradients
in_grad.set_shape(t_in.get_shape())
\Python\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 443, in set_shape
self._shape = self._shape.merge_with(shape)
\Python\Python35\lib\site-packages\tensorflow\python\framework\tensor_shape.py", line 561, in merge_with
raise ValueError("Shapes %s and %s are not compatible" % (self, other))
ValueError: Shapes (?, 565, 584, 64) and (?, 566, 584, 64) are not compatible
The entire network has 10 convolutional layers and 10 deconvolutional layers. Each deconvolutional layer is a reversed version of the forward layer. Here is an example of the code to produce the convolutional layer:
def create_layer_reversed(self, input, prev_layer=None):
net_id = self.net_id
print(net_id)
with tf.variable_scope('conv', reuse=False):
W = tf.get_variable('W{}_{}_'.format(self.name[-3:], net_id),
shape=(self.kernel_size, self.kernel_size, self.input_shape[3], self.output_channels))
b = tf.Variable(tf.zeros([W.get_shape().as_list()[2]]))
output = tf.nn.conv2d_transpose(
input, W,
tf.stack([tf.shape(input)[0], self.input_shape[1], self.input_shape[2], self.input_shape[3]]),
strides=[1,1,1,1], padding='SAME')
Conv2d.layer_index += 1
output.set_shape([None, self.input_shape[1], self.input_shape[2], self.input_shape[3]])
output = lrelu(tf.add(tf.contrib.layers.batch_norm(output), b))
return output
Related
I try to solve a series of ODEs by scipy integrate.ode module, what does this error message mean?
create_cb_arglist: Failed to build argument list (siz) with enough arguments (tot-opt) required by user-supplied function (siz,tot,opt=6,7,0).
Traceback (most recent call last):
File "D:/DeepSpillModel/api.py", line 49, in <module>
model.solver(start_time, end_time)
File "D:\DeepSpillModel\far_field.py", line 17, in solver
far_model.simulate(self.parcels, self.initial_location, start_time,
File "D:\DeepSpillModel\single_parcel_model.py", line 15, in simulate
self.t, self.y = calculate_underwater(self.profile, parcel, t0, y0, diff_factor, self.p, delta_t_sub)
File "D:\DeepSpillModel\single_parcel_model.py", line 44, in calculate_underwater
r.integrate(t[-1] + delta_t, step=True)
File "D:\Miniconda3\envs\gnome\lib\site-packages\scipy\integrate\_ode.py", line 433, in integrate
self._y, self.t = mth(self.f, self.jac or (lambda: None),
File "D:\Miniconda3\envs\gnome\lib\site-packages\scipy\integrate\_ode.py", line 1024, in step
r = self.run(*args)
File "D:\Miniconda3\envs\gnome\lib\site-packages\scipy\integrate\_ode.py", line 1009, in run
y1, t, istate = self.runner(*args)
_vode.error: failed in processing argument list for call-back f.
Process finished with exit code 1
I have created a model using diabetes dataset for prediction. I have trained, evaluated, logged and registered it as a new model in ML flow. Now I am trying to load the registered model and trying to predict on new data. All though I was able to predict the results. I am not able to display it. When I try to display using command .show() or display() it is throwing an error. What is the cause of the error? and How do I display the results?
Note: I have programmed using pure pyspark and all the ML flow operation was done on Data bricks
Code:-
model_details = mlflow.tracking.MlflowClient().get_latest_versions('model1',stages=['staging'])[0]
model = mlflow.pyfunc.spark_udf(spark,model_details.source)
input_df = sdf.drop('progression')
columns = list(map(lambda c: f"{c}", input_df.columns))
df = input_df.withColumn("progression", model(*columns))
df.show(truncate=False)
Error :-
PythonException: An exception was thrown from a UDF: 'Exception: Java gateway process exited before sending its port number'. Full traceback below:
PythonException Traceback (most recent call last)
<command-1343735193245452> in <module>
34 df = input_df.withColumn("progression", model(*columns))
35
---> 36 df.show(truncate=False)
/databricks/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
441 print(self._jdf.showString(n, 20, vertical))
442 else:
--> 443 print(self._jdf.showString(n, int(truncate), vertical))
444
445 def __repr__(self):
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
131 # Hide where the exception came from that shows a non-Pythonic
132 # JVM exception message.
--> 133 raise_from(converted)
134 else:
135 raise
/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
PythonException: An exception was thrown from a UDF: 'Exception: Java gateway process exited before sending its port number'. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 654, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 646, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in dump_stream
timely_flush_timeout_ms=self.timely_flush_timeout_ms)
File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in dump_stream
for batch in iterator:
File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in init_stream_yield_batches
for series in iterator:
File "/databricks/spark/python/pyspark/worker.py", line 467, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/databricks/spark/python/pyspark/worker.py", line 467, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/databricks/spark/python/pyspark/worker.py", line 111, in <lambda>
verify_result_type(f(*a)), len(a[0])), arrow_return_type)
File "/databricks/spark/python/pyspark/util.py", line 109, in wrapper
return f(*args, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 827, in predict
model = SparkModelCache.get_or_load(archive_path)
File "/databricks/python/lib/python3.7/site-packages/mlflow/pyfunc/spark_model_cache.py", line 64, in get_or_load
SparkModelCache._models[archive_path] = load_pyfunc(temp_dir)
File "/databricks/python/lib/python3.7/site-packages/mlflow/utils/annotations.py", line 43, in deprecated_func
return func(*args, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 693, in load_pyfunc
return load_model(model_uri, suppress_warnings)
File "/databricks/python/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 667, in load_model
model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
File "/databricks/python/lib/python3.7/site-packages/mlflow/spark.py", line 707, in _load_pyfunc
.master("local[1]")
File "/databricks/spark/python/pyspark/sql/session.py", line 189, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/databricks/spark/python/pyspark/context.py", line 384, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/databricks/spark/python/pyspark/context.py", line 134, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/databricks/spark/python/pyspark/context.py", line 333, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/databricks/spark/python/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
I have checked a lot of post and none of them seem to work for me. But when I try to add workers to the dataloader in pytorch it just feeds me an error back. I have tired reading it and figuring it out but I can't seem to find a solution. I assume there is something I'm supposed to add to make the workers able to do their job.
I have 64GB of ram, i9-9900k, and a 3080ti. So I don't think its a memory error is it?
I included the error code with 1 worker and 4 workers because they seem to be different.
Also it works with zero workers.
here is the error with 4 workers:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\queue.py", line 172, in get raise Empty
queue.Empty
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 202, in <module>
training()
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__ data = self._next_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1186, in _next_data idx, data = self._get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1142, in _get_data success, data = self._try_get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 23204, 7668, 13636, 6132) exited unexpectedly
Error with 1 worker:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\Users\14055\Desktop\Class 1 Project\Chegg.py", line 202, in <module>
training()
File "c:\Users\14055\Desktop\Class 1 Project\Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 359, in __iter__
return self._get_iterator()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 918, in __init__
w.start()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\queue.py", line 172, in get
raise Empty
queue.Empty
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 202, in <module>
training()
File "c:/Users/14055/Desktop/Class 1 Project/Chegg.py", line 122, in training
for data, target in load_data.train_loader:
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
data = self._next_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1142, in _get_data
success, data = self._try_get_data()
File "C:\Users\14055\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3372) exited unexpectedly
Code:
from numpy import testing
import torch.cuda
import numpy as np
import time
import array as arr
import os
from datetime import date, datetime
from torchvision import datasets
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
torch.cuda.set_device(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def load_data():
num_workers = 1
load_data.batch_size = 20
transform = transforms.ToTensor()
train_data = datasets.MNIST(root='data', train=True, download=True, transform=transform)
load_data.train_loader = torch.utils.data.DataLoader(train_data,
batch_size=load_data.batch_size, num_workers=num_workers, pin_memory=True,
shuffle=True)
test_data = datasets.MNIST(root='data', train=False, download=True, transform=transform)
load_data.test_loader = torch.utils.data.DataLoader(test_data,
batch_size=load_data.batch_size, num_workers=num_workers, pin_memory=True,
shuffle=True)
def visualize():
dataiter = iter(load_data.train_loader)
visualize.images, labels = dataiter.next()
visualize.images = visualize.images.numpy()
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(load_data.batch_size):
ax = fig.add_subplot(2, load_data.batch_size/2, idx+1, xticks=[], yticks=[])
ax.imshow(np.squeeze(visualize.images[idx]), cmap='gray')
ax.set_title(str(labels[idx].item()))
#plt.show()
def fig_values():
img = np.squeeze(visualize.images[1])
fig = plt.figure(figsize = (12,12))
ax = fig.add_subplot(111)
ax.imshow(img, cmap='gray')
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
for y in range(height):
val = round(img[x][y],2) if img[x][y] !=0 else 0
ax.annotate(str(val), xy=(y,x),
horizontalalignment='center',
verticalalignment='center',
color='white' if img[x][y]<thresh else 'black')
#plt.show()
load_data()
#visualize()
#fig_values()
class NeuralNet(nn.Module):
def __init__(self, gpu = True):
super(NeuralNet, self ).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=128, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(num_features=128)
self.tns1 = nn.Conv2d(in_channels=128, out_channels=4, kernel_size=1, padding=1)
self.conv2 = nn.Conv2d(in_channels=4, out_channels=16, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(num_features=16)
self.pool1 = nn.MaxPool2d(2,2)
self.conv3 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(num_features=16)
self.conv4 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(num_features=32)
self.pool2 = nn.MaxPool2d(2,2)
self.tns2 = nn.Conv2d(in_channels=32, out_channels=16, kernel_size=1, padding=1)
self.conv5 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
self.bn5 = nn.BatchNorm2d(num_features=16)
self.conv6 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.bn6 = nn.BatchNorm2d(num_features=32)
self.conv7 = nn.Conv2d(in_channels=32, out_channels=10, kernel_size=1, padding=1)
self.gpool = nn.AvgPool2d(kernel_size=7)
self.drop = nn.Dropout2d(0.1)
def forward(self, x):
x = self.tns1(self.drop(self.bn1(F.relu(self.conv1(x)))))
x = self.drop(self.bn2(F.relu(self.conv2(x))))
x = self.pool1(x)
x = self.drop(self.bn3(F.relu(self.conv3(x))))
x = self.drop(self.bn4(F.relu(self.conv4(x))))
x = self.tns2(self.pool2(x))
x = self.drop(self.bn5(F.relu(self.conv5(x))))
x = self.drop(self.bn6(F.relu(self.conv6(x))))
x = self.conv7(x)
x = self.gpool(x)
x = x.view(-1, 10)
return F.log_softmax(x).to(device)
#has antioverfit
def training():
model.to(device)
optimizer= torch.optim.SGD(model.parameters(), lr=0.003, weight_decay= 0.00005, momentum = .9, nesterov = True)
n_epochs = 20000
a = np.float64([9,9,9,9,9]) #antioverfit
testing_loss = 0.0
for epoch in range(n_epochs) :
if(testing_loss <= a[4]): # part of anti overfit
train_loss = 0.0
testing_loss = 0.0
model.train().to(device)
for data, target in load_data.train_loader:
optimizer.zero_grad()
data = data.to(device) #gpu
target = target.to(device) #gpu
output = model(data).to(device)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()*data.size(0)
train_loss = train_loss/len(load_data.train_loader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, train_loss))
model.eval().to(device) # Gets Validation loss
train_loss = 0.0
with torch.no_grad():
for data, target in load_data.test_loader:
data = data.to(device)
target = target.to(device)
output = model(data).to(device)
loss =F.nll_loss(output, target)
testing_loss += loss.item()*data.size(0)
testing_loss = testing_loss / len(load_data.test_loader.dataset)
print('Validation loss = ' , testing_loss)
a = np.insert(a,0,testing_loss) # part of anti overfit
a = np.delete(a,5)
print('Validation loss = ' , testing_loss)
def evalution():
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
model.eval().to(device)
for data, target in load_data.test_loader:
data = data.to(device)
target = target.to(device)
output = model(data).to(device)
loss =F.nll_loss(output, target)
test_loss += loss.item()*data.size(0)
_, pred = torch.max(output, 1)
correct = np.squeeze(pred.eq(target.data.view_as(pred))).to(device)
for i in range(load_data.batch_size):
try:
label = target.data[i]
class_correct[label] += correct[i].item()
class_total[label] += 1
except IndexError:
break
# calculate and print avg test loss
test_loss = test_loss/len(load_data.test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))
for i in range(10):
if class_total[i] > 0:
print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
str(i), 100 * class_correct[i] / class_total[i],
np.sum(class_correct[i]), np.sum(class_total[i])))
else:
print('Test Accuracy of %5s: N/A (no training examples)' )
print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
100. * np.sum(class_correct) / np.sum(class_total),
np.sum(class_correct), np.sum(class_total)))
acc = (
100. * np.sum(class_correct) / np.sum(class_total),
np.sum(class_correct), np.sum(class_total))
name = f"model-{acc}.pt"
name2 = f"model-{acc}.pth"
save_path = os.path.join("models", name)
save_path2 = os.path.join("models", name2)
torch.save(model, save_path)
torch.save(model, save_path2)
model = NeuralNet().to(device)
summary(model, input_size=(1, 28, 28))
training()
evalution()
I'm working with the Google utterance dataset in spectrogram form. Each data point has dimension (160, 101). In my data loader, I used batch_size=128. Therefore, each batch has dimension (128, 160, 101).
I use a LeNet model with the following code as the model:
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16*5*5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 30)
def forward(self, x):
out = F.relu(self.conv1(x))
out = F.max_pool2d(out, 2)
out = F.relu(self.conv2(out))
out = F.max_pool2d(out, 2)
out = out.view(out.size(0), -1)
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
out = self.fc3(out)
return out
I tried unsqueezing the data with dim=3, but got this error:
Traceback (most recent call last):
File "train_speech.py", line 359, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 258, in train
outputs = (net(inputs))['out']
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/gdrive/My Drive/Colab Notebooks/mixup_erm-master/models/lenet.py", line 15, in forward
out = F.relu(self.conv1(x))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 443, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [6, 1, 5, 5], expected input[128, 160, 101, 1] to have 1 channels, but got 160 channels instead
How do I fix this issue?
EDIT: New Error Message Below
torch.Size([128, 160, 101])
torch.Size([128, 1, 160, 101])
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "train_speech.py", line 363, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 262, in train
outputs = (net(inputs))['out']
IndexError: too many indices for tensor of dimension 2
I'm unsqueezing the data in each batch. The relevant section of my training code is below. inputs is analogous to x.
print(inputs.shape)
inputs = inputs.unsqueeze(1)
print(inputs.shape)
outputs = (net(inputs))['out']
Edit 2: New Error
Traceback (most recent call last):
File "train_speech.py", line 361, in <module>
train_loss, reg_loss, train_acc, cost = train(epoch)
File "train_speech.py", line 270, in train
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function AddmmBackward returned an invalid gradient at index 1 - got [128, 400] but expected shape compatible with [128, 13024]
Edit 3: Train Loop Below
def train(epoch):
print('\nEpoch: %d' % epoch)
net.train()
train_loss = 0
reg_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
if use_cuda:
inputs, targets = inputs.cuda(), targets.cuda()
inputs, targets_a, targets_b, lam,layer, cost = mixup_data(inputs, targets,
args.alpha,args.mixupBatch, use_cuda)
inputs, targets_a, targets_b = map(Variable, (inputs,
targets_a, targets_b))
outputs = net(inputs)
loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
train_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += targets.size(0)
correct += (lam * predicted.eq(targets_a.data).cpu().sum().float()
+ (1 - lam) * predicted.eq(targets_b.data).cpu().sum().float())
optimizer.zero_grad()
loss.backward()
optimizer.step()
return (train_loss/batch_idx, reg_loss/batch_idx, 100.*correct/total, cost/batch_idx)
You should expand on axis=1 a.k.a. the channel axis:
>>> x = x.unsqueeze(1)
If you're inside the dataset __getitem__, then it corresponds to axis=0.
I'm training my first model with TensorFlow, but I keep having this error:
Expected binary or unicode string, got 0.0
I followed TensorFlow linear model tutorial (https://www.tensorflow.org/tutorials/wide) and applied it on my own dataset.
This is what I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 953, in _train_model
features, labels = input_fn()
File "<stdin>", line 2, in train_input_fn
File "<stdin>", line 5, in input_fn
File "<stdin>", line 5, in <dictcomp>
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 473, in make_tensor_proto
append_fn(tensor_proto, proto_values)
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 109, in SlowAppendObjectArrayToTensorProto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 109, in <listcomp>
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/nick/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got 0.0
Any suggestion?
Thanks