NIST Test Suit- igamc: UNDERFLOW - nist

I used a 32-bit random number generator 100,000 times, and resulted in a file of 275,714 bytes. Then I typed the following line in my terminal,
./access 1024 (Here comes my first question, what should we exactly type here?)
Then fed my file as input, then it come to
"How many bitstreams?" 269
Here 269( 269= 275,714/1024). And I chose Binary as my format. Finally, I got numerous lines of "igamc: UNDERFLOW". What should I cope with this?

NIST Test Suite can work on a number of bitstreams of certain length and its result can be then displayed as proportion of the bitstreams that passed test to all bitstreams (in finalAnalysisReport in Proportion column). So when you execute ./assess length, length is the length of one bitstream.
I think that igamc underflow can be caused by too short bitstreams. In NIST document, for every test recommended input size is specified and e.g. for Binary Matrix Rank Test it's somewhere close to 40 000 and for Overlapping Template Matching Test it's 1 000 000 and both of these tests use igamc function to compute P-value.

Related

Calculate significance for DNA binding motifs found vs. expected in MATLAB

I have a set of say, 100, genomic features for which I've created a fasta file with a 500 bp window around each. I've searched these windows for a DNA sequence and found an average of 1.5 sequences per individual 500 bp window in the feature set. By chance, I expect the sequence to be present once every 1024 bp, or on average ~0.49 of my sequence per 500 bp window.
My question is how can I determine whether the 1.5 binding sites per individual feature I've uncovered is significant or not, and obtain a p-value?
And as a follow up, if I use the same set of 100 windows and search for a different sequence with the same probability (1/1024) and determine that there are now an average of 0.9 sequences per individual window, how can I determine whether this is significantly different than the 1.5 for the sequence for which I searched above?
As a second follow up, if I search for the same two sequences above (both found on average 1/1024 base pairs) in a different set of 500 bp windows for a different feature type (say, n=50), how can I determine if the results of this search are significantly different than the results above (particularly if the difference between sequence A and sequence B in feature set 1 and feature set 2 is significant)?
Thank you in advance.
I ended up using simulations to answer all of the above questions. Generate windows of desired size, 500 bp in this case, of random genomic sequence. Search for motifs in X windows (where X = number of individuals in a feature set) and compare with results obtained searching for motifs in the features of interest. To Repeat with sample size equal to that of the second feature set being analyzed. To compare features with one another, do the analogous simulation and compare results.

Elias Gamma Coding and upper bound

While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.

Design for max hash size given N-digit numerical input and collision related target

Assume a hacker obtains a data set of stored hashes, salts, pepper, and algorithm and has access to unlimited computing resources. I wish to determine a max hash size so that the certainty of determining the original input string is nominally equal to some target certainty percentage.
Constraints:
The input string is limited to exactly 8 numeric characters
uniformly distributed. There is no inter-digit relation such as a
checksum digit.
The target nominal certainty percentage is 1%.
Assume the hashing function is uniform.
What is the maximum hash size in bytes so there are nominally 100 (i.e. 1% certainty) 8-digit values that will compute to the same hash? It should be possible to generalize to N numerical digits and X% from the accepted answer.
Please include whether there are any issues with using the first N bytes of the standard 20 byte SHA1 as an acceptable implementation.
It is recognized that this approach will greatly increase susceptibility to a brute force attack by increasing the possible "correct" answers so there is a design trade off and some additional measures may be required (time delays, multiple validation stages, etc).
It appears you want to ensure collisions, with the idea that if a hacker obtained everything, such that it's assumed they can brute force all the hashed values, then they will not end up with the original values, but only a set of possible original values for each hashed value.
You could achieve this by executing a precursor step before your normal cryptographic hashing. This precursor step simply folds your set of possible values to a smaller set of possible values. This can be accomplished by a variety of means. Basically, you are applying an initial hash function over your input values. Using modulo arithmetic as described below is a simple variety of hash function. But other types of hash functions could be used.
If you have 8 digit original strings, there are 100,000,000 possible values: 00000000 - 99999999. To ensure that 100 original values hash to the same thing, you just need to map them to a space of 1,000,000 values. The simplest way to do that would be convert your strings to integers, perform a modulo 1,000,000 operation and convert back to a string. Having done that the following values would hash to the same bucket:
00000000, 01000000, 02000000, ....
The problem with that is that the hacker would not only know what 100 values a hashed value could be, but they would know with surety what 6 of the 8 digits are. If the real life variability of digits in the actual values being hashed is not uniform over all positions, then the hacker could use that to get around what you're trying to do.
Because of that, it would be better to choose your modulo value such that the full range of digits are represented fairly evenly for every character position within the set of values that map to the same hashed value.
If different regions of the original string have more variability than other regions, then you would want to adjust for that, since the static regions are easier to just guess anyway. The part the hacker would want is the highly variable part they can't guess. By breaking the 8 digits into regions, you can perform this pre-hash separately on each region, with your modulo values chosen to vary the degree of collisions per region.
As an example you could break the 8 digits thus 000-000-00. The prehash would convert each region into a separate value, perform a modulo, on each, concatenate them back into an 8 digit string, and then do the normal hashing on that. In this example, given the input of "12345678", you would do 123 % 139, 456 % 149, and 78 % 47 which produces 123 009 31. There are 139*149*47 = 973,417 possible results from this pre-hash. So, there will be roughly 103 original values that will map to each output value. To give an idea of how this ends up working, the following 3 digit original values in the first region would map to the same value of 000: 000, 139, 278, 417, 556, 695, 834, 973. I made this up on the fly as an example, so I'm not specifically recommending these choices of regions and modulo values.
If the hacker got everything, including source code, and brute forced all, he would end up with the values produced by the pre-hash. So for any particular hashed value, he would know that that it is one of around 100 possible values. He would know all those possible values, but he wouldn't know which of those was THE original value that produced the hashed value.
You should think hard before going this route. I'm wary of anything that departs from standard, accepted cryptographic recommendations.

Jpeg huffman coding procedure

A Huffman table, in JPEG standard, is generated from a collection of statistics in two steps. One of the steps is implementing function/method given by this picture: (This picture is given in Annex K of JPEG standard):
Problem is here. Previously in standard (Annex C) says this sentence:
Huffman tables are specified in terms of a 16-byte list (BITS) giving the number of codes for each code length from
1 to 16. This is followed by a list of the 8-bit symbol values (HUFFVAL), each of which is assigned a Huffman code.
Obviously BITS is list of 16 elements. But in picture above, i is first set to 32 (i=32) then we want to access BITS[i]. Probably I misunderstood something, so please let someone gives me an answer.
Here is JPEG standard description of picture:
Figure K.3 gives the procedure for adjusting the BITS list so that no code is longer than 16 bits. Since symbols are paired
for the longest Huffman code, the symbols are removed from this length category two at a time. The prefix for the pair
(which is one bit shorter) is allocated to one of the pair; then (skipping the BITS entry for that prefix length) a code word
from the next shortest non-zero BITS entry is converted into a prefix for two code words one bit longer. After the BITS
list is reduced to a maximum code length of 16 bits, the last step removes the reserved code point from the code length
count.
Here is code for picture above:
void adjustBitLengthTo16Bits(vector<char>&BITS){
int i=32,j=0;
while(1){
if(BITS[i]>0){
j=i-1;
j--;
while(BITS[j]<=0)
j--;
BITS[i]=BITS[i]-2;
BITS[i-1]=BITS[i-1]+1;
BITS[j+1]=BITS[j+1]+2;
BITS[j]=BITS[j]-1;
continue;
}
else{
i--;
if(i!=16)
continue;
while(BITS[i]==0)
i--;
BITS[i]--;
return;
}
}
}
This code is only for encoders that want to generate their own custom Huffman tables. The majority of JPEG encoders just use fixed tables that are reasonable approximations of the statistics of most images. In this particular case, the first step in generating a Huffman table for the AC coefficients produces a table up to 32 entries (bits) long. Since there are only 256 unique symbols to encode (skip/length pairs), there should never be more than 32 bits needed to specify all of the Huffman codes. After the first pass has produced a set of codes (up to 32-bits in length), the second pass takes the least frequent (longest) codes and "moves" them into shorter length slots so that the maximum code length is 16-bits. In an ideal Huffman table, the frequency distributions correspond to the code lengths. In this case, the table is being made to fit by squeezing the longest codes into slots reserved for shorter codes. This can be done because the 14/15/16 bit length Huffman codes have "room" for more permutations of bits and can "fit" the longer codes in them.
Update:
There is limited benefit to "optimizing" the Huffman tables in JPEG. Most of the compression occurs because of the quantization and DCT transform of the pixels. Switching to arithmetic coding has a measurable benefit (~10% size reduction), but then it limits the audience since most JPEG decoders don't support arithmetic coding due to past patent issues.

Why my filter output is not accurate?

I am simulating a digital filter, which is 4-stage.
Stages are:
CIC
half-band
OSR
128
Input is 4 bits and output is 24 bits. I am confused about the 24 bits output.
I use MATLAB to generate a 4 bits signed sinosoid input (using SD tool), and simulated with modelsim. So the output should be also a sinosoid. The issue is the output only contains 4 different data.
For 24 bits output, shouldn't we get a 2^24-1 different data?
What's the reason for this? Is it due to internal bit width?
I'm not familiar with Modelsim, and I don't understand the filter terminology you used, but...Are your filters linear systems? If so, an input at a given frequency will cause an output at the same frequency, though possibly different amplitude and phase. If your input signal is a single tone, sampled such that there are four values per cycle, the output will still have four values per cycle. Unless one of the stages performs sample rate conversion the system is behaving as expected. As as Donnie DeBoer pointed out, the word width of the calculation doesn't matter as long as it can represent the four values of the input.
Again, I am not familiar with the particulars of your system so if one of the stages does indeed perform sample rate conversion, this doesn't apply.
Forgive my lack of filter knowledge, but does one of the filter stages interpolate between the input values? If not, then you're only going to get a maximum of 2^4 output values (based on the input resolution), regardless of your output resolution. Just because you output to 24-bit doesn't mean you're going to have 2^24 values... imagine running a digital square wave into a D->A converter. You have all the output resolution in the world, but you still only have 2 values.
Its actually pretty simple:
Even though you have 4 bits of input, your filter coefficients may be more than 4 bits.
Every math stage you do adds bits. If you add two 4-bit values, the answer is a 5 bit number, so that adding 0xf and 0xf doesn't overflow. When you multiply two 4-bit values, you actually need 8 bits of output to hold the answer without the possibility of overflow. By the time all the math is done, your 4-bit input apparently needs 24-bits to hold the maximum possible output.