Sequence detector independent of cycle - system-verilog

How do you code a FSM that can detect 1010, but can stay '1' or '0' for multiple cycles. Typical FSMs detect 1010 patterns for consecutive clock cycles. Is it possible to use the same /similar FSM to detect 1010 patterns even though '1' can stay '1' for two cycles and '0'can stay '0' for two cycles ...

While, in principle, you can use a very similar sequence detecting FSM for detecting sequences in which each symbol stays on the line for two or more cycles rather than one, this only works if the symbol is on the line for a fixed period. For example, if either a 1 or 0 is consider a sequence character if it stays on the line for two cycles, then either you can clock your FSM using a divided clock so its half the frequency or expanding your sequence to 11001100 would solve your problem (for detecting 1010). This works for any sequence in which you define a value of 1 or 0 over a fixed number of cycles, ie all values will be held for a known period of time.
If, however, you want to detect the sequence with values defined over a variable number of cycles, this is not really possible. To say you have found the sequence 1010 is to say the line became one, then zero, then one, then zero. If the definition of the line becoming one is that it is one for one or more cycles, then it becomes impossible to determine if seeing 1 in the first cycle and then 1 in the second cycle is a single one or two ones. For example, take the short sequence below:
11001100110000
In the fixed case, with 1 symbol every clock (which with you are familiar), the sequence is just how it reads: 11001100110000. In the case I described above with a fixed period of 1 symbol every two clocks, the sequence is now 1010100. However, if we say that a symbol for the sequence can be a variable number of clocks, that sequence above can be resolved to any number of sequences:
101010
1101010,
1001010,
1011010,
1010010,
1010110,
1010100,
10101000,
101010000,
11001010,
11011010,
11010010,
...,
11001100110000.
Without defining a what a symbol is, it becomes difficult to create an FSM as any input sequence can quick permute into a very large number of possible sequences. It thus becomes impossible to determine sender intent without more information. Either you need another signal to determine when to sample the line or some other what of distinguishing what the interpretation is suppose to be.
If you do want to make an FSM can determines if any of these permutations matches a given sequence, then you can do that but boiling your sequence down to the minimum requires. In our example sequence above (11001100110000), if we wanted to see if that sequence CAN be interpreted as your sequence (1010), we can find the needed elements to make a possibly sequence 1010 by seeing that you have to see 1 for some time (ie, see at least one 1), then 0 for some time (at least one 0), then 1 for some time (at least one 1), then 0 for some time (at least one 0). Note that any alternating line would thus match.
In the case of an non-alternating sequence like 11001, we could look at the line sequence (1100110011000) looking for at least 2 ones, then at least 2 zeroes, then at least 1 one. In the FSMs youre familiar with, seeing the pattern 11000 would result in a return arc back to start, ie the pattern 11001 was not seen. However, in the variable length case, the FSM would stay in the "seen 1100" state as 11000 could be interpreted as 1100.
Im not sure how useful this would be as the intended sequence is sent on the line cannot be determined due to the problem I mentioned, you cannot determine what constitutes a symbol.

Related

Matlab random number rng: choosing a seed

I would like to know more precisely what happends when you choose a custom seed in Matlab, e.g.:
rng(101)
From my (limited, nut nevertheless existing) understanding of how pseudo-random number generators work, one can see the seed conceptually as choosing a position in a "very long list of pseudo-random numbers".
Question: lets say, (in my Matlab script), I choose rng(100) for my first computation (a sequence of instructions) and then rng(1e6) for my second. Please, note that each time I do some computations it involves generating up to about 300k random numbers (each time).
-> Does that imply that I make sure there is no overlap between the sequence in the "list" starting at 100 and ending around 300k and the one starting at 1e6 and ending at 1'300'000 ? (the idead of "no overlap" comes from the fact since the rng(100) and rng(1e6) are separated by much more than 300k)
i.e. that these are 2 "independent" sequences, (as far as I remember this 'long list' would be generated by a special PRNG algorithm, most likely involing modular arithmetic..?)
No that is not the case. The mapping between the seed and the "position" in our list of generated numbers is not linear, you could actually interpret it as a hash/one way function. It could actually happen that we get the same sequence of numbers shifted by one position (but it is very unlikely).
By default, MATLAB uses the Mersenne Twister (source).
Not quite. The seed you give to rng is the initiation point for the Mersenne Twister algorithm (by default) that is used to generate the pseudorandom numbers. If you choose two different seeds (no matter their relative non-negative integer values, except for maybe a special case or two), you will have effectively independent pseudorandom number streams.
For "99%" of people, the major uses of seeding the rng are using the 'shuffle' argument (to use a non-default seed based on the time to help ensure independence of numbers generated across multiple sessions), or to give it one particular seed (to be able to reproduce the same pseudorandom stream at a later date). If you try to finesse the seeds further without being extremely careful, you are more likely to cause issues than do anything helpful.
RandStream can be used to break off separate streams of pseudorandom numbers if that really matters for your application (it likely doesn't).

Elias Gamma Coding and upper bound

While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.

Specified Length Unique ID Generation

I need to create unique and random alphanumeric ID's of a set length. Ideally I would store a counter in my database starting at 0, and every time I need a unique ID I would get the counter value (0), run it through this hashing function giving it a set length (Probably 4-6 characters) [ID = Hash(Counter, 4);], it would return my new ID (ex. 7HU9), and then I would increment my counter (0++ = 1).
I need to keep the ID's short so they can be remembered or shared easily. Security isn't a big issue, so I'm not worried about people trying random ID's, but I don't want the ID's to be predictable, so there can't be an opportunity for a user to notice that the ID's increment by 3 every time allowing them to just work their way backwards through the ID's and download the ID data one-by-one (ex. A5F9, A5F6, A5F3, A5F0 == BAD).
I don't want to just loop through random strings checking for uniqueness since this would increase database load over time as key's are used up. The intention is that hashing a unique incrementing counter would guarantee ID uniqueness up to a certain counter value, at which point the length of the generated ID's would be increased by one and the counter reset, and continue this pattern forever.
Does anybody know of any hashing functions which would suit this need, or have any other ideas?
Edit: I do not need to be able to reverse the function to get the counter value back.
The tough part, as you realize, is getting to a no-collision sequence guaranteed.
If "not obvious" is the standard you need for guessing the algorithm, a simple mixed congruential RNG of full period - or rather a sequence of them with increasing modulus to satisfy the requirement for growth over time - might be what you want. This is not the hash approach you're asking for, but it ought to work.
This presentation covers the basics of MCRNGs and sufficient conditions for full period in a very concise form. There are many others.
You'd first use the lowest modulus MCRNG starting with an arbitrary seed until you've "used up" its cycle and then advance to the next largest modulus.
You will want to "step" the moduli to ensure uniqueness. For example if your first IDs are 12 bits and so you have a modulus M1 <= 2^12 (but not much less than), then you advance to 16 bits, you'd want to pick the second modulus M2 <= 2^16 - M1. So the second tier of id's would be M1+x_i where x_i is the i'th output of the second rng. A 32-bit third tier would have modulus 2^32-M2 and its output would be be M2+y_i, where y_i is its output, etc.
The only persistent storage required will be the last ID generated and the index of the MCRNG in the sequence.
Someone with time on their hands could guess this algorithm without too much trouble. But a casual user would be unlikely to do so.
Let's say that your counter is range from 1 to 10000. Slice [1, 10000] to 10 small unit, each unit contain 1000 number.These small unit will keep track of their last id.
unit-1 unit-2 unit-10
[1 1000], [1001, 2000], ... ,[9000, 10000]
When you need a ID, just random select from unit 1-10, and get the unit's newest ID.
e.g
At first, your counter is 1, random selection is unit-2, than you will get the ID=1001;
Second time, your counter is 2, random selection is unit-1, than you will get the ID=1;
Third time, your counter is 3, random selection is unit-2, than you will get the ID=1002;
...and so on.
(This was a while ago but I should write up what I ended up doing...)
The idea I came up with was actually pretty simple. I wanted alphanumeric pins, so that works out to 36 potential characters for each character, and I wanted to start with 4 character pins so that works out to 36^4 = 1,679,616 possible pins. I realized that all I wanted to do was take all of these possible pins and throw away a percentage of them in a random way such that a human being had a low chance of randomly finding one. So I divide 1,679,616 by 100 and then multiply my counter by a random number between 1 and 100 and then encode that number as my alphanumeric pin. Problem solved!
By guessing a random combination of 4 letters and numbers you have a 1 in 100 chance of actually guessing a real in-use pin, which is all I really wanted. In my implementation I increment the pin length once the available pin space is exhausted, and everything worked perfectly! Been running for about 2 years now!

Sieve of Eratosthenes (reducing space complexity)

I wanted to generate prime numbers between two given numbers ‘a’ and ‘b’ (b > a). What I did was store Boolean values in an array of size b-1 (that is for numbers 2 to b) and then I applied the sieve method.
Is there a better way, that reduces space complexity, if I don't need all prime numbers from 2 to b?
You need to store all primes which are smaller of equal than the square root of b, then for each number between a and b check whether they are divisible by any of these numbers and they don't equal these numbers. So in our case the magic number is sqrt(b)
You can use segmented sieve of Eratosthenes. The basic idea is pretty simple.
In a typical sieve, we start with a large array of Booleans, all set to the same value. These represent odd numbers, starting from 3. We look at the first and see that it's true, so we add it to the list of prime numbers. Then we mark off every multiple of that number as not prime.
Now, the problem with this is that it's not very cache friendly. As we mark off the multiples of each number, we go through the entire array. Then when we reach the end, we start over from the beginning (which is no longer in the cache) and walk through the entire array again. Each time through the array, we read the entire array from main memory again.
For a segmented sieve, we do things a bit differently. We start by by finding only the primes up to the square root of the limit we care about. Then we use those to mark off primes in the main array. The difference here is the order in which we mark off primes. Instead of marking off all the multiples of three, then all the multiples of 5, and so on, we start by marking off the multiples of three for data that will fit in the cache. Then, instead of continuing on to more data in the array, we go back and mark off the multiples of five for the data that fits in the cache. Then the multiples of 7, and so on.
Then, when we've marked off all the multiples in that cache-sized chunk of data, we move on to the next cache-sized chunk of data. We start over with marking off multiples of 3 in this chunk, then multiples of 5, and so on until we've marked off all the multiples in this chunk. We continue that pattern until we've marked off all the non-prime numbers in all the chunks, and we're done.
So, given N primes below the square root of the limit we care about, a naive sieve will read the entire array of Booleans N times. By contrast, a segmented sieve will only read each chunk of the data once. Once a chunk of data is read from main memory, all the processing on that chunk is done before any more data is read from main memory.
The exact speed-up this gives will depend on the ratio of the speed of cache to the speed of main memory, the size of the array you're using vs. the size of the cache, and so on. Nonetheless, it is generally pretty substantial--for example, on my particular machine, looking for the primes up to 100 million, the segmented sieve has a speed advantage of about 10:1.
One thing you must remember, if you're using C++. A well-known issue with std::vector<bool> is Under C++98/03, vector<bool> was required to be a specialization that stored each Boolean as a single bit with some proxy trickery to get bool-like behavior. That requirement has since been lifted, but many libraries still include it.
With a non-segmented sieve, it's generally a useful trade-off. Although it requires a little extra CPU time to compute masks and such to modify only a single bit at a time, it saves enough bandwidth to main memory to more than compensate.
With a segmented sieve, bandwidth to main memory isn't nearly as large a factor, so using a vector<char> generally seems to give better results (at least with the compilers and processors I have handy).
Getting optimal performance from a segmented sieve does require knowledge of the size of your processor's cache, but getting it precisely correct isn't usually critical--if you assume the size is smaller than it really is, you won't necessarily get optimal use of your cache, but you usually won't lose a lot either.

redundant encoding?

This is more of a computer science / information theory question than a straightforward programming one, so if anyone knows of a better site to post this, please let me know.
Let's say I have an N-bit piece of data that will be sent redundantly in M messages, where at least M-1 of those messages will be received successfully. I am interested in different ways of encoding the N-bit piece of data in fewer bits per message. (this is similar to RAID but at a much smaller level, where N = 8 or 16 or 32)
Example: suppose N = 16 and M = 4. Then I could use the following algorithm:
1st and 3rd message: send "0" + bits 0-7
2nd and 4th message: send "1" + bits 8-15
If I can guarantee that 3 messages of the 4 will get through, then at least one message from each group will get through. Thus I can make this work with 9 bits or less, there's probably a way to do this with fewer total bits but I'm not sure how.
Are there some simple encoding/decoding algorithms to do this kind of thing? Does this problem have a name? (if I know what it's called, I can google it!)
note: in my particular case, the messages either arrive correctly or do not arrive at all (no messages arrive with errors).
(edit: moved 2nd part to a separate question)
(Incomplete answer follows. I may add more later.)
The term you may be interested in is channel coding: adding redundancy to a source in order to make it robust during transmission over a noisy channel. In information theory, the complementary problem to channel coding is source coding: reducing the redundancy in a source to represent it using fewer bits. (The combination of these two problems is called joint source-channel coding.)
Your first question asks to find a channel code. The simple example you give is similar to a repetition code, i.e., you send the same message more than twice (usually an odd number of times), and then the message which is received most often is accepted as the original message.
This code is inefficient. To use standard notation, let k = number of bits in original message, and n = number of bits in the transmitted message. For your example, k = 16 and n = 36. A measure of coding efficiency is k/n, where higher means more efficient. In your case, k/n = 0.44. This is low.
The repetition code is a simple kind of block code, i.e., redundancy is added to each block of k bits to create a codeword of n bits. So are the Hamming and Reed-Solomon codes as others mentioned. Hamming codes are relatively easy to understand with some basic linear algebra.
These should be enough terms for you to search on your own. Good luck.
I'm not sure if I understood all the details of your question correctly, but your problem is definitely aboud designing some kind of error correcting code. This is a vast area of computer science and thick tomes have been written about it. Start with wikipedia and see if you can get any simple schemes (like Hamming or Reed-Solomon codes) to work in your case.
If you want to deal not only with symbol corruption, but also deletion of symbols, you should look at erasure codes, this is definitely a more difficult task but good methods exist in many cases.
EDIT: This material from hackersdelight.org seems a nice introduction.
See erasure codes.
You're looking for a packet erasure code. There are only two useful packet erasure codes that are not totally encumbered by patents, and there's only one open-source library to implement those. Find it here: http://planete-bcast.inrialpes.fr/rubrique.php3?id_rubrique=5
Here's a trivially simple scheme that's almost twice as efficient as your example.
You chopped the message into blocks of (N/M)*2 bits. Instead, chop it into N/(M-1)-bit blocks. (Round it up if necessary.) The first block, src[0], encodes as itself: enc[0]=src[0]. The same for the last block: enc[M-1]=src[M-1]. Each of the other blocks gets XORed with its left neighbor: enc[i]=src[i-1]^src[i].
Prefix each encoded block with a log(M)-bit sequence number, essentially as you did, so the receiver can tell which was dropped. (If you can be sure that whichever blocks arrive will arrive in order, then a 1-bit sequence number will do. Just alternate 0 and 1.)
To decode, successively XOR from the left and the right until you hit the dropped block. E.g. src[1] == enc[0]^enc[1]. (Dropping one of the endpoint blocks isn't a special case -- e.g. if the first block is dropped, the scan from the right recovers it, and the scan from the left is of length 0.)