Like the title says, is there any difference between the two? I think the original inception v1 model does not have Res Blocks, but maybe I'm wrong. Are they the same thing?
No, they are not the same?
Inception-ResNet is hybrid module inspired both by inception and the performance of resnet. There are further two sub-versions of Inception-ResNet, namely v1 and v2. Inception-ResNet v2 has higher accuracy and computational cost as compared to Inception-ResNet v1 (https://ai.googleblog.com/2016/08/improving-inception-and-image.html).
They are different. Training with residual connections accelerates the training of Inception and may outperform similarly expensive Inception without residual connections.
Related
I am a beginner with very less knowledge about CNN & RNN.
For Eg: RNN works better for time series and CNN for spacial features, knowing this it might make easy for me to select between RNN and CNN.
Though, if I am made to make a choice between ResNet, InceptionNet, etc for particular application, How do I get an intution of which would work better?
Please state your particular application if you want your answer in detail.
But I if I want to answer you by considering your general question, I must state that:
- It depends on your dataset (number of items and size of data), feature engineering and type of your features.
- The evaluation measure of your particular application: like as accuracy, precision, recall, RMSE, F-measure and etc.
So, If you want to get intution about your network, it is better to run it on your data, if not, read the paper which has the same dataset as your own, and read the analyze part of paper.
But every neural network acts better in some kind of data. For example it is typical to use LSTM for sequential data.
Experimentation Experimentation Experimentation
Get your hands dirty, you'll automatically get an intuition of which would work better.
so to be more clear lets consider the problem of loan default prediction.
Let's say I have trained and tested off-line multiple classifiers and ensembled them. Then I gave this model to production.
But because people change, data and many other factors change as well. And performance of our model eventually will decrease. So then it needs to be replaced with the new, better model.
What are the common techniques, model stability tests, model performance tests, metrics after deployment? How to decide when to replace current model with newer one?
It depend wich problem (classification, regression or clustering), let say you have a classification problem and you learned and tested a model with 75% of accuracy (or other metric) , once in production, if the accuracy is Significantly less than 75% , than you can stope your model and see what is happening.
In my case, I note the accuracy of the model once in production each day for a week, after that I count a mean and a variance of accuracy and I apply a T-test of mean, to see if this accuracy is Significantly diverge or not from the accuracy desired.
Hope that will help
I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?
It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.
But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).
It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.
In my experience it usually not necessary to do learning rate decay with Adam optimizer.
The theory is that Adam already handles learning rate optimization (check reference) :
"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."
As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.
Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.
decayed_lr = tf.train.exponential_decay(learning_rate,
global_step, 10000,
0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)
Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.
Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.
Edit: People have fairly recently started using one-cycle learning rate policies in conjunction with Adam with great results.
Be careful when using weight decay with the vanilla Adam optimizer, as it appears that the vanilla Adam formula is wrong when using weight decay, as pointed out in the article Decoupled Weight Decay Regularization https://arxiv.org/abs/1711.05101 .
You should probably use the AdamW variant when you want to use Adam with weight decay.
A simple alternative is to increase the batch size. A larger number of samples per update will force the optimizer to be more cautious with the updates. If GPU memory limits the number of samples that can be tracked per update, you may have to resort to CPU and conventional RAM for training, which will obviously further slow down training.
From another point of view
All stochastic gradient descent (SGD) optimizers, including Adam, have randomization built and have no guarantees of reaching a global minima
After several
times of reduction, a satisfying local extremum will be obtained.
so using learning decay will not help reach global minima as it is supposed to help.
Also if you used it the learning rate will eventually become very
small, and the algorithm will become ineffective.
I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.
The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.
Two Strategies
There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.
1. Inclusion Principle
By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).
Basically:
aCardinality + bCardinality - intersectionCardinality
or, in the case of multiple sets...
aCardinality + (bCardinality x cCardinality) - intersectionCardinality
seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...
2. Jaccard Index Intersection/MinHash
This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).
With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.
Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.
thus, my question:
How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?
Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.
http://tech.adroll.com/media/hllminhash.pdf
There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.
For more details see the paper available at
http://oertl.github.io/hyperloglog-sketch-estimation-paper/.