Different schemes of two-bit branch prediction - cpu-architecture

Why do we have two versions of 2-bit branch prediction as shown in the figures below?
First Scheme
Alternate Scheme
In the first scheme, the transition is from weakly not taken to weakly taken and weakly taken to weakly not taken if it is misprediction but in the alternative scheme, the transition is from weakly not taken to strongly taken and from weakly taken to strongly not taken if it is misprediction. How does one scheme compared to the other or does both give the same accuracy?

The first scheme seems to be Strategy 7 described by James Smiths Paper "A study about branch prediction strategies" (here). Figure 8 and 10 show the interesting numbers. The accuracy for this scheme ranges from 80,1% to 99,4%.
Both schemes are described in "Branch Prediction Strategies and Branch Target Buffer Design" from J.K.F. Lee and A.J. Smith. Also both schemes are compared and they don't differ much in their accuracy.
Even without reading the papers you can see that there are some patterns that favor one scheme over the other. E.g. with the pattern taken, not taken, taken, not taken, the first scheme will always be wrong if it begins in the state WN. The second scheme has no problem with the pattern but with the pattern taken, taken, not taken, not taken, taken, taken, etc.

Related

Implementating spell drawing/casting mechanism in Luau (Roblox)

I am coding a spell-casting system where you draw a symbol with your wand (mouse), and it can recognize said symbol.
There are two methods I believe might work; neural networking and an "invisible grid system"
The problem with the neural networking system is that It would be (likely) suboptimal in Roblox Luau, and not be able to match the performance nor speed I wish for. (Although, I may just be lacking in neural networking knowledge. Please let me know whether I should continue to try implementing it this way)
For the invisible grid system, I thought of converting the drawing into 1s and 0s (1 = drawn, 0 = blank), then seeing if it is similar to one of the symbols. I create the symbols by making a dictionary like:
local Symbol = { -- "Answer Key" shape, looks like a tilted square
00100,
01010,
10001,
01010,
00100,
}
The problem is that user error will likely cause it to be inaccurate, like this "spell"'s blue boxes, showing user error/inaccuracy. I'm also sure that if I have multiple Symbols, comparing every value in every symbol will surely not be quick.
Do you know an algorithm that could help me do this? Or just some alternative way of doing this I am missing? Thank you for reading my post.
I'm sorry if the format on this is incorrect, this is my first stack-overflow post. I will gladly delete this post if it doesn't abide to one of the rules. ( Let me know if there are any tags I should add )
One possible approach to solving this problem is to use a template matching algorithm. In this approach, you would create a "template" for each symbol that you want to recognize, which would be a grid of 1s and 0s similar to what you described in your question. Then, when the user draws a symbol, you would convert their drawing into a grid of 1s and 0s in the same way.
Next, you would compare the user's drawing to each of the templates using a similarity metric, such as the sum of absolute differences (SAD) or normalized cross-correlation (NCC). The template with the lowest SAD or highest NCC value would be considered the "best match" for the user's drawing, and therefore the recognized symbol.
There are a few advantages to using this approach:
It is relatively simple to implement, compared to a neural network.
It is fast, since you only need to compare the user's drawing to a small number of templates.
It can tolerate some user error, since the templates can be designed to be tolerant of slight variations in the user's drawing.
There are also some potential disadvantages to consider:
It may not be as accurate as a neural network, especially for complex or highly variable symbols.
The templates must be carefully designed to be representative of the expected variations in the user's drawings, which can be time-consuming.
Overall, whether this approach is suitable for your use case will depend on the specific requirements of your spell-casting system, including the number and complexity of the symbols you want to recognize, the accuracy and speed you need, and the resources (e.g. time, compute power) that are available to you.

Rare question about interaction terms, main effects, and mean-centering - mix & match?

This question is not about the longstanding discussion of 'to mean-center, or not mean-center' interaction terms, or what mean-centering the variables in an interaction gets you (or doesn't get you).
The question is if it is reasonable to have a model that includes an uncentered predictor serving to model a main effect (e.g. Education's effect on income), and an interaction term that is computed by multiplying a mean-centered version of that variable with another term (say, gender, to see if the education effect on income is conditional on gender), while leaving the 'standalone' education variable on its original scale.
For the sake of keeping the focus on this rare combination of an uncentered version of a variable with an interaction that is based on a centered version of the same variable in the same model, let's ignore the reasons for doing this (i.e. interpretability vs. collinearity).
Everything I can find about these issues seems to always assume that either all versions of the variables (as an individual variable/main effect, and as a part of the interaction/product term) are either centered or uncentered. Hence the labeling of this as a 'rare question' about mean-centering and interactions.
My instinct is that this (mixing and matching centered and uncentered) is problematic because, despite the linear similarities between the centered and uncentered versions, you end up with a model where one of the components of the interaction is technically absent. But this may also be just because I am not a fan of arguments - still common in a lot of places - that collinearity is the reason to mean-center.
What do people here think?

How to predict word using trained CBOW

I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?
Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)

How best to deal with "None of the above" in Image Classification?

This seems to be a fundamental question which some of you out there must have an opinion on. I have an image classifier implemented in CNTK with 48 classes. If the image does not match any of the 48 classes very well, then I'd like to be able to conclude that it was not among these 48 image types. My original idea was simply that if the highest output of the final Softmax layer was low, I would be able to conclude that the test image matched none well. While I occasionally see this occur, in most testing, Softmax still produces a very high (and mistaken) result when handed an 'unknown image type'. But maybe my network is 'over fit' and if it wasn't, my original idea would work fine. What do you think? Any way to define a 49-th class called 'none-of-the-above'?
You really have these two options indeed--thresholding the posterior probabilities (softmax values), and adding a garbage class.
In my area (speech), both approaches are their place:
If "none of the above" inputs are of the same nature as the "above" (e.g. non-grammatical inputs), thresholding works fine. Note that the posterior probability for a class is equal to one minus an estimate of the error rate for choosing this class. Rejecting anything with posterior < 50% would be rejecting all cases where you are more likely wrong than right. As long as your none-of-the-above classes are of similar nature, the estimate may be accurate enough to make this correct for them as well.
If "none of the above" inputs are of similar nature but your number of classes is very small (e.g. 10 digits), or if the inputs are of a totally different nature (e.g. a sound of a door slam or someone coughing), thresholding typically fails. Then, one would train a "garbage model." In our experience, it is OK to include the training data for the correct classes. Now the none-of-the-above class may match a correct class as well. But that's OK as long as the none-of-the-above class is not overtrained--its distribution will be much flatter, and thus even if it matches a known class, it will match it with a lower score and thus not win against the actual known class' softmax output.
In the end, I would use both. Definitely use a threshold (to catch the cases that the system can rule out) and use a garbage model, which I would just train it on whatever you have. I would expect that including the correct examples in training will not harm, even if it is the only data you have (please check the paper Anton posted for whether that applies to image as well). It may also make sense to try to synthesize data, e.g. by randomly combining patches from different images.
I agree with you that this is a key question, but I am not aware of much work in that area either.
There's one recent paper by Zhang and LeCun, that addresses the question for image classification in particular. They use large quantities of unlabelled data to create an additional "none of the above" class. The catch though is that, in some cases, their unlabelled data is not completely unlabelled, and they have means of removing "unlabelled" images that are actually in one of their labelled classes. Having said that, the authors report that apart from solving the "none of the above" problem, they even see performance gains even on their test sets.
As for fitting something post-hoc, just by looking at the outputs of the softmax, I can't provide any pointers.

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
PS.
My last question was closed. Is there another SE site where these kinds of questions are on topic?
You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.
More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on comp.ai.neural-nets's FAQ for more information.