I have a set of say, 100, genomic features for which I've created a fasta file with a 500 bp window around each. I've searched these windows for a DNA sequence and found an average of 1.5 sequences per individual 500 bp window in the feature set. By chance, I expect the sequence to be present once every 1024 bp, or on average ~0.49 of my sequence per 500 bp window.
My question is how can I determine whether the 1.5 binding sites per individual feature I've uncovered is significant or not, and obtain a p-value?
And as a follow up, if I use the same set of 100 windows and search for a different sequence with the same probability (1/1024) and determine that there are now an average of 0.9 sequences per individual window, how can I determine whether this is significantly different than the 1.5 for the sequence for which I searched above?
As a second follow up, if I search for the same two sequences above (both found on average 1/1024 base pairs) in a different set of 500 bp windows for a different feature type (say, n=50), how can I determine if the results of this search are significantly different than the results above (particularly if the difference between sequence A and sequence B in feature set 1 and feature set 2 is significant)?
Thank you in advance.
I ended up using simulations to answer all of the above questions. Generate windows of desired size, 500 bp in this case, of random genomic sequence. Search for motifs in X windows (where X = number of individuals in a feature set) and compare with results obtained searching for motifs in the features of interest. To Repeat with sample size equal to that of the second feature set being analyzed. To compare features with one another, do the analogous simulation and compare results.
Related
I have a design for an experiment that I would like to code in psychtoolbox using MATLAB. I understand the basics of how to use this toolbox however it would be extremely helpful if someone has designed a similar experiment before and could provide me with some code that could help me to carry out the following:
The experimental procedure will be composed of 80 trials divided into 5 blocks with 16 trials in each block. The experiment consists of the participant selecting a number from the screen. There is a desirable number (target number) and an associated less desirable number (lure number). I won't go into further detail about the reasoning behind this experiment as it is not relevant to my question.
I have attached an image that shows 1 block of trials (16 trials). The other 4 blocks are the same as this block.
Target and lure numbers will be presented on the screen to choose from (an example can be seen in image below).
In some of the trials as can be seen from the trials table only one target number and one lure number are presented for the participant to choose from (instead of two targets and two lures).
The lure(s) that appears with each target(s) should not always be the same. I want the lure that is shown with the target to be randomly selected in each trial (as can be seen in the trials image there is more than one possible lure). In the trials image I have attached the trial numbers are presented just for clarity, in each block the presentation of the targets needs to be randomized.
You can use the BalanceTrials function that comes with psychtoolbox. You use all the possible lures and targets as inputs and it returns a random order of all possible combinations. You can also specify a minimum length of the list it returns, but if there are more combinations it will make the list longer to make it balanced. Here is an example:
numberOfTrials = 80;
targetNumbers = {'3','4','5','6','7','4 5','4 6','4 7'};
lureNumbers = {'3','4','5','6','7','4 7'};
[targets, lures] = BalanceTrials(numberOfTrials, 1, targetNumbers, lureNumbers);
You can split this up into 5 blocks or you do it each time for each block.
I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?
Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)
I am relatively new to vowpal wabbit and would like to find out about the -b parameter (feature bits in feature table).
My training data are something like this. There are a total of about 1 million words.
1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"
However, each row would only have 5 features. how many bits should i use? I tried to run it and it seems with an increasing number of examples, the number of features set seem to increase. I do not seem to understand why is this so.
If you use -b 18 (which is the default), the features will be hashed into a table with 2^18 items, so if the number of unique features in your dataset is close to 2^18 (or even higher), you should increase the parameter -b, so there are not so many hash collisions. There is no easy way how to detect the number of collisions, but the common practice is to tune the parameter -b for a best progressive validation loss (or holdout loss if you use more passes). Of course, it also depends on the available memory on your machine.
1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"
Note that this example is wrong (not what you intended) because of the spaces around =. The equal sign has no special meaning (unlike colon which is used for separating the feature value). Features cannot contain space in their name. There is no need to enclose feature names in quotes. So the example should look like
1 | word word131232 word1233 word123124 word23145
If the prefix a, b, c, d, e has some special meaning (i.e. a=word42 should be a different feature than b=word42) you can use:
1 | a=word b=word131232 c=word1233 d=word123124 e=word23145
If all your words are already mapped to integers (within the range 0-2^b), you can use them directly as feature names and no hashing will be done (unless you specify --hash=all):
1 | 0 131232 1233 123124 23145
See the wiki page about input format.
the number of features set seem to increase
In the progress report (by default each 2^x th example), in the last column you can see current features, which is the number of features for the current example (including the constant feature and quadratic/cubic/... features if you use them) and it should not be increasing (unless you have such strange data).
In the final report, vw prints total feature number, which is the average number of features per example times the number of examples times the number of passes (so it is not the number of unique features in the dataset).
How to use Morton Order in range search?
From the wiki, In the paragraph "Use with one-dimensional data structures for range searching",
it says
"the range being queried (x = 2, ..., 3, y = 2, ..., 6) is indicated
by the dotted rectangle. Its highest Z-value (MAX) is 45. In this
example, the value F = 19 is encountered when searching a data
structure in increasing Z-value direction. ......BIGMIN (36 in the
example).....only search in the interval between BIGMIN and MAX...."
My questions are:
1) why the F is 19? Why the F should not be 16?
2) How to get the BIGMIN?
3) Are there any web blogs demonstrate how to do the range search?
EDIT: The AWS Database Blog now has a detailed introduction to this subject.
This blog post does a reasonable job of illustrating the process.
When searching the rectangular space x=[2,3], y=[2,6]:
The minimum Z Value (12) is found by interleaving the bits of the lowest x and y values: 2 and 2, respectively.
The maximum Z value (45) is found by interleaving the bits of the highest x and y values: 3 and 6, respectively.
Having found the min and max Z values (12 and 45), we now have a linear range that we can iterate across that is guaranteed to contain all of the entries inside of our rectangular space. The data within the linear range is going to be a superset of the data we actually care about: the data in the rectangular space. If we simply iterate across the entire range, we are going to find all of the data we care about and then some. You can test each value you visit to see if it's relevant or not.
An obvious optimization is to try to minimize the amount of superfluous data that you must traverse. This is largely a function of the number of 'seams' that you cross in the data -- places where the 'Z' curve has to make large jumps to continue its path (e.g. from Z value 31 to 32 below).
This can be mitigated by employing the BIGMIN and LITMAX functions to identify these seams and navigate back to the rectangle. To minimize the amount of irrelevant data we evaluate, we can:
Keep a count of the number of consecutive pieces of junk data we've visited.
Decide on a maximum allowable value (maxConsecutiveJunkData) for this count. The blog post linked at the top uses 3 for this value.
If we encounter maxConsecutiveJunkData pieces of irrelevant data in a row, we initiate BIGMIN and LITMAX. Importantly, at the point at which we've decided to use them, we're now somewhere within our linear search space (Z values 12 to 45) but outside the rectangular search space. In the Wikipedia article, they appear to have chosen a maxConsecutiveJunkData value of 4; they started at Z=12 and walked until they were 4 values outside of the rectangle (beyond 15) before deciding that it was now time to use BIGMIN. Because maxConsecutiveJunkData is left to your tastes, BIGMIN can be used on any value in the linear range (Z values 12 to 45). Somewhat confusingly, the article only shows the area from 19 on as crosshatched because that is the subrange of the search that will be optimized out when we use BIGMIN with a maxConsecutiveJunkData of 4.
When we realize that we've wandered outside of the rectangle too far, we can conclude that the rectangle in non-contiguous. BIGMIN and LITMAX are used to identify the nature of the split. BIGMIN is designed to, given any value in the linear search space (e.g. 19), find the next smallest value that will be back inside the half of the split rectangle with larger Z values (i.e. jumping us from 19 to 36). LITMAX is similar, helping us to find the largest value that will be inside the half of the split rectangle with smaller Z values. The implementations of BIGMIN and LITMAX are explained in depth in the zdivide function explanation in the linked blog post.
It appears that the quoted example in the Wikipedia article has not been edited to clarify the context and assumptions. The approach used in that example is applicable to linear data structures that only allow sequential (forward and backward) seeking; that is, it is assumed that one cannot randomly seek to a storage cell in constant time using its morton index alone.
With that constraint, one's strategy begins with a full range that is the mininum morton index (16) and the maximum morton index (45). To make optimizations, one tries to find and eliminate large swaths of subranges that are outside the query rectangle. The hatched area in the diagram refers to what would have been accessed (sequentially) if such optimization (eliminating subranges) had not been applied.
After discussing the main optimization strategy for linear sequential data structures, it goes on to talk about other data structures with better seeking capability.
Suppose that I have a data set that contains a cyclical event and I am identifying a threshold (peaks) to separate each event (to eventually find the coefficient of variation).
I have multiple trials of this data - the speed of these events is sometimes significantly faster than others. This data is also a bit noisy, so some 'false local maximas' are sometimes picked up if I don't set the 'minpeakdistance' constraint within the 'findpeaks' function.
I am trying to find a way to ensure that regardless of speed, I am finding 'true local maximas'. I have been visually inspecting each trial to ensure that I have identified only true peaks - if I have also identified false peaks, I have been adjusted the mpd value for that specific trial - but this is literally going to take days.
Any suggestions?
Example:
For most trials of my collection, the following line of code only identifies true maximas:
mpd = 'minpeakdistance';
eval(['[t' num2str(a) '.Mspine.pks(:,1),t' num2str(a) '.Mspine.locs] = findpeaks(t' num2str(a) '.Mspine.xyz(:,1), mpd,25);']);
But, for trial 11, they are moving much faster, so the mpd has to be adjusted to 9; however, if I apply an mpd value of 9 to all of the trials, it will pick up false local maximas.
theI would go over to the frequency domain to find this "cyclical event". Specifically, if you know the rate at which data is sampled/generated, using a FFT will indicate the relative strengths of all periodic events in your data. Have a look at: http://www.mathworks.se/help/matlab/ref/fft.html