Accuracy is not increasing, though loss is decreasing - gpflow

I am feeding cnn features into gpflow model. I am writing the chunks of code from my program here. I am using tape.gradient with Adam optimizer (scheduled lr). My accuracy gets stuck on 47% and surprisingly , my loss still gets reducing. Its very weird. I have debugged the program. CNN features are ok but gp model is not learning .Please can you check the training loop and let me know where am I wrong.
def optimization_step(gp_model: gpflow.models.SVGP, image_data,labels):
with tf.GradientTape(watch_accessed_variables=False)as tape:
tape.watch(gp_model.trainable_variables)
cnn_feat = cnn_model(image_data,training=False)
cnn_feat=tf.cast(cnn_feat,dtype=default_float())
labels=tf.cast(labels,dtype=np.int64)
data=(cnn_feat, labels)
loss = gp_model.training_loss(data)
gp_grads=tape.gradient(loss, gp_model.trainable_variables)
gp_optimizer.apply_gradients(zip(gp_grads, gp_model.trainable_variables))
return loss, cnn_feat
the loop for training is
def simple_training_loop(gp_model: gpflow.models.SVGP, epochs: int = 3, logging_epoch_freq: int = 10):
total_loss = []
features=[]
tf_optimization_step = tf.function(optimization_step, autograph=False)
for epoch in range(epochs):
lr.assign(max(args.learning_rate_clip, args.learning_rate * (args.decay_rate ** epoch)))
data_loader.shuffle_data(args.is_training)
for b in range(data_loader.n_batches):
batch_x, batch_y= data_loader.next_batch(b)
batch_x=tf.convert_to_tensor(batch_x)
batch_y=tf.convert_to_tensor(batch_y)
loss,features_CNN=tf_optimization_step(gp_model, batch_x,batch_y)
I am restoring weights for CNN from checkpoints saved during transfer learning.
With more epochs , loss continue to decrease but accuracy starts decreasing as well.
The gp model declaration is as follows
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
invlink = gpflow.likelihoods.RobustMax(C)
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink)
the test Function
cnn_feat=cnn_model(test_x,training=False)
cnn_feat = tf.cast(cnn_feat, dtype=default_float())
mean, var = gp_model.predict_f(cnn_feat)
preds = np.argmax(mean, 1).reshape(test_labels.shape)
correct = (preds == test_labels.numpy().astype(int))
acc = np.average(correct.astype(float)) * 100

Can you please just check that whether the training loop is correctly written
The training loop looks fine. However, there are bits that should be modified for clarity and for optimisation sake.
def simple_training_loop(gp_model: gpflow.models.SVGP, epochs: int = 3, logging_epoch_freq: int = 10):
total_loss = []
features=[]
#tf.function
def compute_cnn_feat(x: tf.Tensor) -> tf.Tensor:
return tf.cast(cnn_model(x, training=False), dtype=default_float())
#tf.function
def optimization_step(cnn_feat: tf.Tensor, labels: tf.Tensor): # **Change 1.**
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch(gp_model.trainable_variables)
data = (cnn_feat, labels)
loss = gp_model.training_loss(data)
gp_grads = tape.gradient(loss, gp_model.trainable_variables) # **Change 2.**
gp_optimizer.apply_gradients(zip(gp_grads, gp_model.trainable_variables))
return loss
for epoch in range(epochs):
lr.assign(max(args.learning_rate_clip, args.learning_rate * (args.decay_rate ** epoch)))
data_loader.shuffle_data(args.is_training)
for b in range(data_loader.n_batches):
batch_x, batch_y= data_loader.next_batch(b)
batch_x = tf.convert_to_tensor(batch_x)
batch_y = tf.convert_to_tensor(batch_y, dtype=default_float())
cnn_feat = compute_cnn_feat(batch_x) # **Change 3.**
loss = optimization_step(cnn_feat, batch_y)
Change 1. Signature of a function that you wrap with tf.function should not have mutable objects.
Change 2. The gradient tape will track all computations inside the context manager, including the computation of the gradients i.e. tape.gradient(...). In turn, that means your code performs an unnecessary calculation.
Change 3. For the same reason as in "Change 2." I moved the CNN feature extraction outside of the gradient tape.

Related

Issue while solving Gym Environment ; ValueError: Weights for model sequential have not yet been created

I am new to stackoverflow, so I apologize for any errors while asking a question. I am trying to solve the cartpole-v1 gym environment using a dqn agent. I am facing an issue as follows ValueError: Weights for model sequential have not yet been created. Weights are created when the Model is first called on inputs or build() is called with an input_shape. I've searched how to fix this but to no success. My tensorflow version is 2.8.0. My code for my agent is as follows. I believe, the problem is most probably due to my build_model and in the model.fit line. This is the error that I am facing
class DQNAgent0:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount factor
self.epsilon = 1.0 # 100% exploration at the start
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
'''model = tf.keras.Sequential([
tf.keras.layers.Dense(1),
#tf.keras.Input((self.state_size,)),
tf.keras.layers.Dense(24, activation="relu"),
tf.keras.layers.Dense(24, activation="relu"),
tf.keras.layers.Dense(self.action_size, activation="linear"),
])
model.compile(loss=tf.keras.losses.mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=self.learning_rate))'''
#model = tf.keras.Sequential()
model = tf.keras.Sequential([tf.keras.layers.Dense(1)])
model.add(tf.keras.Input(shape = self.state_size))
model.add(tf.keras.layers.Dense(24, activation = 'relu'))
model.add(tf.keras.layers.Dense(24, activation = 'relu'))
model.add(tf.keras.layers.Dense(self.action_size, activation = 'linear'))
#opt = tf.keras.optimizers.Adam(learning_rate = self.learning_rate)
#model.compile(loss = 'mse', optimizer = opt)
model.compile(loss = tf.keras.losses.mse, optimizer = tf.keras.optimizers.Adam(learning_rate = self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random .randrange(self.action_size) # exploratory action
act_values = self.model.predict(state)
return np.argmax(act_values[0])
def replay(self, batch_size):
#creating a random sample from our memory
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma * np.amax(self.model.predict(next_state[0]))) # reward at current timestep + discounted future reward
target_f = self.model.predict(state)
target_f[0][action] = target #mapping future reward to the current reward
self.model.fit(tf.expand_dims(state, axis=-1), target_f, epochs = 1, verbose = 0) # fitting a model to train with state as input x and target_f as y (predicted future reward)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
def load(self, name):
self.model.load_weights(name)
def save(self, name):
self.model.save_weights(name)

Where the weights get updated in this code?

I want to train a model in distributed system. I have found a code in github for distributed training where the worker node send gradient to the parameter server and the parameter server sends the average gradient to the workers. But in client/worker side code, i couldn't understand where the received gradient updates the weights and biases.
Here is client/worker side the code, it receives initial gradients from the parameter server and then calculates loss, gradients and sends the gradient value to the server again.
from __future__ import division
from __future__ import print_function
import numpy as np
import sys
import pickle as pickle
import socket
from datetime import datetime
import time
import tensorflow as tf
import cifar10
TCP_IP = 'some IP'
TCP_PORT = 5014
port = 0
port_main = 0
s = 0
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('train_dir', '/home/ubuntu/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
tf.app.flags.DEFINE_integer('max_steps', 5000,
"""Number of batches to run.""")
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
tf.app.flags.DEFINE_integer('log_frequency', 10,
"""How often to log results to the console.""")
#gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.30)
def safe_recv(size, server_socket):
data = ""
temp = ""
data = bytearray()
recv_size = 0
while 1:
try:
temp = server_socket.recv(size-len(data))
data.extend(temp)
recv_size = len(data)
if recv_size >= size:
break
except:
print("Error")
data = bytes(data)
return data
def train():
"""Train CIFAR-10 for a number of steps."""
g1 = tf.Graph()
with g1.as_default():
global_step = tf.Variable(-1, name='global_step',
trainable=False, dtype=tf.int32)
increment_global_step_op = tf.assign(global_step, global_step+1)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate loss.
loss = cifar10.loss(logits, labels)
grads = cifar10.train_part1(loss, global_step)
only_gradients = [g for g, _ in grads]
class _LoggerHook(tf.train.SessionRunHook):
"""Logs loss and runtime."""
def begin(self):
self._step = -1
self._start_time = time.time()
def before_run(self, run_context):
self._step += 1
return tf.train.SessionRunArgs(loss) # Asks for loss value.
def after_run(self, run_context, run_values):
if self._step % FLAGS.log_frequency == 0:
current_time = time.time()
duration = current_time - self._start_time
self._start_time = current_time
loss_value = run_values.results
examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration
sec_per_batch = float(duration / FLAGS.log_frequency)
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), self._step, loss_value,
examples_per_sec, sec_per_batch))
with tf.train.MonitoredTrainingSession(
checkpoint_dir=FLAGS.train_dir,
hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
tf.train.NanTensorHook(loss),
_LoggerHook()],
config=tf.ConfigProto(
# log_device_placement=FLAGS.log_device_placement, gpu_options=gpu_options)) as mon_sess:
log_device_placement=FLAGS.log_device_placement)) as mon_sess:
global port
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((TCP_IP, port_main))
recv_size = safe_recv(17, s)
recv_size = pickle.loads(recv_size)
recv_data = safe_recv(recv_size, s)
var_vals = pickle.loads(recv_data)
s.close()
feed_dict = {}
i = 0
for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
feed_dict[v] = var_vals[i]
i = i+1
print("Received variable values from ps")
# Opening the socket and connecting to server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((TCP_IP, port))
while not mon_sess.should_stop():
gradients, step_val = mon_sess.run(
[only_gradients, increment_global_step_op], feed_dict=feed_dict)
# sending the gradients
send_data = pickle.dumps(gradients, pickle.HIGHEST_PROTOCOL)
to_send_size = len(send_data)
send_size = pickle.dumps(to_send_size, pickle.HIGHEST_PROTOCOL)
s.sendall(send_size)
s.sendall(send_data)
# receiving the variable values
recv_size = safe_recv(17, s)
recv_size = pickle.loads(recv_size)
recv_data = safe_recv(recv_size, s)
var_vals = pickle.loads(recv_data)
feed_dict = {}
i = 0
for v in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
feed_dict[v] = var_vals[i]
i = i+1
s.close()
def main(argv=None): # pylint: disable=unused-argument
global port
global port_main
global s
if(len(sys.argv) != 3):
print("<port> <worker-id> required")
sys.exit()
port = int(sys.argv[1]) + int(sys.argv[2])
port_main = int(sys.argv[1])
print("Connecting to port ", port)
cifar10.maybe_download_and_extract()
if tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.DeleteRecursively(FLAGS.train_dir)
tf.gfile.MakeDirs(FLAGS.train_dir)
total_start_time = time.time()
train()
print("--- %s seconds ---" % (time.time() - total_start_time))
if __name__ == '__main__':
tf.app.run()
EDIT:
Here is the train_part1() code:
def train_part1(total_loss, global_step):
"""Train CIFAR-10 model.
Create an optimizer and apply to all trainable variables. Add moving
average for all trainable variables.
Args:
total_loss: Total loss from loss().
global_step: Integer Variable counting the number of training steps
processed.
Returns:
train_op: op for training.
"""
# Variables that affect learning rate.
num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size
decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
tf.summary.scalar('learning_rate', lr)
# Generate moving averages of all losses and associated summaries.
loss_averages_op = _add_loss_summaries(total_loss)
# Compute gradients.
with tf.control_dependencies([loss_averages_op]):
opt = tf.train.GradientDescentOptimizer(lr)
grads = opt.compute_gradients(total_loss)
return grads
To me it seems that line
gradients, step_val = mon_sess.run(
[only_gradients, increment_global_step_op], feed_dict=feed_dict)
receieves new values for variables in feed_dict, assign these values to variables, and makes a training step, during which it only calculates and returns the gradients, that are later sent to the parameter server. I would expect cifar10.train_part1 (the one that returns only_gradients) to depend on variable values and define the update.
Update: I looked into the code and changed my mind. Had to google and found next answer that shed some light on what is happening.
Gradients are actually not applied in this code anywhere implicitly. Instead, gradients are sent to the parameter server, parameter server averages gradients and applies them to weights, it returns the weights to the local worker, * recieved weights are used instead of local weights during session run through feed_dict* i.e. local weights are never actually updated and do not actually matter at all. The key, is that feed_dict allows to rewrite any tensor output of the session run and this code rewrites variables.

Correct data loading, splitting and augmentation in Pytorch

The tutorial doesn't seem to explain how we should load, split and do proper augmentation.
Let's have a dataset consisting of cars and cats. The folder structure would be:
data
cat
0101.jpg
0201.jpg
...
dogs
0101.jpg
0201.jpg
...
At first, I loaded the dataset by datasets.ImageFolder function. Image Function has command "TRANSFORM" where we can set some augmentation commands, but we don't want to apply augmentation to test dataset! So let's stay with transform=None.
data = datasets.ImageFolder(root='data')
Apparently, we don't have folder structure train and test and therefore I assume a good approach would be to use split_dataset function
train_size = int(split * len(data))
test_size = len(data) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(data, [train_size, test_size])
Now let's load the data the following way.
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=8,
shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset,
batch_size=8,
shuffle=True)
How can I apply transformations (data augmentation) to the "train_loader" images?
Basically I need to: 1. load data from the folder structure explained above
2. split the data into test/train parts
3. apply augmentations on train part.
I am not sure if there is a recommended way of doing this, but this is how I would workaround this problem:
Given that torch.utils.data.random_split() returns Subset, we cannot (can we? not 100% sure here I double-checked, we cannot) exploit their inner datasets, because they are the same (the only diference is in the indices). In this context, I would implement a simple class to apply transformations, something like this:
from torch.utils.data import Dataset
class ApplyTransform(Dataset):
"""
Apply transformations to a Dataset
Arguments:
dataset (Dataset): A Dataset that returns (sample, target)
transform (callable, optional): A function/transform to be applied on the sample
target_transform (callable, optional): A function/transform to be applied on the target
"""
def __init__(self, dataset, transform=None, target_transform=None):
self.dataset = dataset
self.transform = transform
self.target_transform = target_transform
# yes, you don't need these 2 lines below :(
if transform is None and target_transform is None:
print("Am I a joke to you? :)")
def __getitem__(self, idx):
sample, target = self.dataset[idx]
if self.transform is not None:
sample = self.transform(sample)
if self.target_transform is not None:
target = self.target_transform(target)
return sample, target
def __len__(self):
return len(self.dataset)
And then use it before passing the dataset to the dataloader:
import torchvision.transforms as transforms
train_transform = transforms.Compose([
transforms.ToTensor(),
# ...
])
train_dataset = ApplyTransform(train_dataset, transform=train_transform)
# continue with DataLoaders...
I think you can see this https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb
def get_train_valid_loader(data_dir,
batch_size,
augment,
random_seed,
valid_size=0.1,
shuffle=True,
show_sample=False,
num_workers=4,
pin_memory=False):
"""
Utility function for loading and returning train and valid
multi-process iterators over the CIFAR-10 dataset. A sample
9x9 grid of the images can be optionally displayed.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Params
------
- data_dir: path directory to the dataset.
- batch_size: how many samples per batch to load.
- augment: whether to apply the data augmentation scheme
mentioned in the paper. Only applied on the train split.
- random_seed: fix seed for reproducibility.
- valid_size: percentage split of the training set used for
the validation set. Should be a float in the range [0, 1].
- shuffle: whether to shuffle the train/validation indices.
- show_sample: plot 9x9 sample grid of the dataset.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- train_loader: training set iterator.
- valid_loader: validation set iterator.
"""
error_msg = "[!] valid_size should be in the range [0, 1]."
assert ((valid_size >= 0) and (valid_size <= 1)), error_msg
normalize = transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010],
)
# define transforms
valid_transform = transforms.Compose([
transforms.ToTensor(),
normalize,
])
if augment:
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
else:
train_transform = transforms.Compose([
transforms.ToTensor(),
normalize,
])
# load the dataset
train_dataset = datasets.CIFAR10(
root=data_dir, train=True,
download=True, transform=train_transform,
)
valid_dataset = datasets.CIFAR10(
root=data_dir, train=True,
download=True, transform=valid_transform,
)
num_train = len(train_dataset)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))
if shuffle:
np.random.seed(random_seed)
np.random.shuffle(indices)
train_idx, valid_idx = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, sampler=train_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
valid_loader = torch.utils.data.DataLoader(
valid_dataset, batch_size=batch_size, sampler=valid_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
# visualize some images
if show_sample:
sample_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=9, shuffle=shuffle,
num_workers=num_workers, pin_memory=pin_memory,
)
data_iter = iter(sample_loader)
images, labels = data_iter.next()
X = images.numpy().transpose([0, 2, 3, 1])
plot_images(X, labels)
return (train_loader, valid_loader)
Seems that he use sampler=train_sampler to do the split.

How to simplify DataLoader for Autoencoder in Pytorch

Is there any easier way to set up the dataloader, because input and target data is the same in case of an autoencoder and to load the data during training? The DataLoader always requires two inputs.
Currently I define my dataloader like this:
X_train = rnd.random((300,100))
X_val = rnd.random((75,100))
train = data_utils.TensorDataset(torch.from_numpy(X_train).float(), torch.from_numpy(X_train).float())
val = data_utils.TensorDataset(torch.from_numpy(X_val).float(), torch.from_numpy(X_val).float())
train_loader= data_utils.DataLoader(train, batch_size=1)
val_loader = data_utils.DataLoader(val, batch_size=1)
and train like this:
for epoch in range(50):
for batch_idx, (data, target) in enumerate(train_loader):
data, target = Variable(data), Variable(target).detach()
optimizer.zero_grad()
output = model(data, x)
loss = criterion(output, target)
Why not subclassing TensorDataset to make it compatible with unlabeled data ?
class UnlabeledTensorDataset(TensorDataset):
"""Dataset wrapping unlabeled data tensors.
Each sample will be retrieved by indexing tensors along the first
dimension.
Arguments:
data_tensor (Tensor): contains sample data.
"""
def __init__(self, data_tensor):
self.data_tensor = data_tensor
def __getitem__(self, index):
return self.data_tensor[index]
And something along these lines for training your autoencoder
X_train = rnd.random((300,100))
train = UnlabeledTensorDataset(torch.from_numpy(X_train).float())
train_loader= data_utils.DataLoader(train, batch_size=1)
for epoch in range(50):
for batch in train_loader:
data = Variable(batch)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, data)
I believe this is as simple as it gets. Other than that, I guess you will have to implement your own dataset. A sample code is below.
class ImageLoader(torch.utils.data.Dataset):
def __init__(self, root, tform=None, imgloader=PIL.Image.open):
super(ImageLoader, self).__init__()
self.root=root
self.filenames=sorted(glob(root))
self.tform=tform
self.imgloader=imgloader
def __len__(self):
return len(self.filenames)
def __getitem__(self, i):
out = self.imgloader(self.filenames[i]) # io.imread(self.filenames[i])
if self.tform:
out = self.tform(out)
return out
You can then use it as follows.
source_dataset=ImageLoader(root='/dldata/denoise_ae/clean/*.png', tform=source_depth_transform)
target_dataset=ImageLoader(root='/dldata/denoise_ae/clean_cam_n9dmaps/*.png', tform=target_depth_transform)
source_dataloader=torch.utils.data.DataLoader(source_dataset, batch_size=32, shuffle=False, drop_last=True, num_workers=15)
target_dataloader=torch.utils.data.DataLoader(target_dataset, batch_size=32, shuffle=False, drop_last=True, num_workers=15)
To test the 1st batch go as follows.
dataiter = iter(source_dataloader)
images = dataiter.next()
print(images.size())
And finally you can enumerate on the loaded data in the batch training loop as follows.
for i, (source, target) in enumerate(zip(source_dataloader, target_dataloader), 0):
source, target = Variable(source.float().cuda()), Variable(target.float().cuda())
Have fun.
PS. The code samples I shared so not load validation data.

Clarification between Epoch and iteration

This answer points to the difference between an Epoch and an iteration while training a neural network. However, when I look at the source code for the solver API in the Stanford CS231n course (and I'm assuming this is the case for most libraries out there as well), during each iteration, batch_size number of examples are randomly selected with replacement. Thus, there is no guarantee that all examples would been seen during each epoch is there?
Does an epoch then mean that all examples would be seen in expectation? Or am I understanding this wrong?
Relevant Source Code:
def _step(self):
"""
Make a single gradient update. This is called by train() and should not
be called manually.
"""
# Make a minibatch of training data
num_train = self.X_train.shape[0]
batch_mask = np.random.choice(num_train, self.batch_size)
X_batch = self.X_train[batch_mask]
y_batch = self.y_train[batch_mask]
# Compute loss and gradient
loss, grads = self.model.loss(X_batch, y_batch)
self.loss_history.append(loss)
# Perform a parameter update
for p, w in self.model.params.iteritems():
dw = grads[p]
config = self.optim_configs[p]
next_w, next_config = self.update_rule(w, dw, config)
self.model.params[p] = next_w
self.optim_configs[p] = next_config
def train(self):
"""
Run optimization to train the model.
"""
num_train = self.X_train.shape[0]
iterations_per_epoch = max(num_train / self.batch_size, 1)
num_iterations = self.num_epochs * iterations_per_epoch
for t in xrange(num_iterations):
self._step()
# Maybe print training loss
if self.verbose and t % self.print_every == 0:
print '(Iteration %d / %d) loss: %f' % (
t + 1, num_iterations, self.loss_history[-1])
# At the end of every epoch, increment the epoch counter and decay the
# learning rate.
epoch_end = (t + 1) % iterations_per_epoch == 0
if epoch_end:
self.epoch += 1
for k in self.optim_configs:
self.optim_configs[k]['learning_rate'] *= self.lr_decay
# Check train and val accuracy on the first iteration, the last
# iteration, and at the end of each epoch.
first_it = (t == 0)
last_it = (t == num_iterations + 1)
if first_it or last_it or epoch_end:
train_acc = self.check_accuracy(self.X_train, self.y_train,
num_samples=1000)
val_acc = self.check_accuracy(self.X_val, self.y_val)
self.train_acc_history.append(train_acc)
self.val_acc_history.append(val_acc)
if self.verbose:
print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (
self.epoch, self.num_epochs, train_acc, val_acc)
# Keep track of the best model
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
self.best_params = {}
for k, v in self.model.params.iteritems():
self.best_params[k] = v.copy()
# At the end of training swap the best params into the model
self.model.params = self.best_params
Thanks.
I believe, as you say, that in the Stanford course they are effectively using "epoch" with the less strict meaning of "expected number of times each example is seen during training". However, in my experience, most implementations consider an epoch as running through every example in the training set once, and I'd say they only chose the sampling with replacement for simplicity. If you have a good amount of data, chances are that you will not see a difference, but still, it is more correct to sample without replacement until there are no more examples.
You can check, for example, how Keras does the training in its source code; it's a bit complicated, but the important point is that make_batches is called to split the (possibly shuffled) examples into batches, which matches your initial idea of "epoch".