I have a series of numbers, always starting from 64 and ending at 8. The sequence of numbers could be 64, 63, 62, 56, 50, 30, 29, 28, 27, 8. It can also be 64, 55, 27, 26, 16, 15, 14, 13, 12, 9, 8, and many other types.
There are some fixed parameters in my sequence:
- For each 8-bytes of data in a file, there is one sequence of numbers. So if a file is 8 KB, there are approximately 1,000 of these sequences (of course with some frequencies and some unique sequences within them).
- The numbers are always descending,
- The maximum number is 64 and the minimum number is 8, in all sequences,
- There's no fixed number of decimals in the sequence. Sometimes it can consist of all numbers from 64 to 8, sometimes 10 numbers, sometimes 30 numbers, and sometimes more or less.
I want to be able to save these numbers in less than 56 bits, more less than that. For example, if I want to assign a single bit for each number, and turn a bit on (1) when each number is present in the sequence, and turn it off (0) when a number is not present in the sequence, there will normally be 56 bits of data to be saved in my file.
Regarding the fact that all the file is composed of these numbers between 64 and 8, is there any way that I can efficiently store the sequences in a smaller file?
Related
One can use _mm256_packs_epi32. as follows: __m256i e = _mm256_packs_epi32 ( ai, bi);
In the debugger, I see the value of ai: m256i_i32 = {0, 1, 0, 1, 1, 1, 0, 1}. I also see the value of bi: m256i_i32 = {1, 1, 1, 1, 0, 0, 0, 1}. The packing gave me e: m256i_i16 = {0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1}. The packing is interleaved. So we have in e first four numbers in ai, first four numbers in bi, last four numbers in ai, last four numbers in bi in that order.
I am wondering if there is an instruction that just packs ai and bi side by side without the interleaving.
vpermq after packing would work, but I'm wondering if there's a single instruction to achieve this.
No sequential-across-lanes pack until AVX-512, unfortunately. (And even then only for 1 register, or not with saturation.)
The in-lane behaviour of shuffles like vpacksswd and vpalignr is one of the major warts of AVX2 that make the 256-bit versions of those shuffles less useful than their __m128i versions. But on Intel, and Zen2 CPUs, it is often still best to use __m256i vectors with a vpermq at the end, if you need the elements in a specific order. (Or vpermd with a vector constant after 2 levels of packing: How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?)
If your 32-bit elements came from unpacking narrower elements, and you don't care about order of the wider elements, you can widen with in-lane unpacks, which sets you up to pack back into the original order.
This is cheap for zero-extending unpacks: _mm256_unpacklo/hi_epi16 (with _mm256_setzero_si256()). That's as cheap as vpmovzxwd (_mm256_cvtepu16_epi32), and is actually better because you can do 256-bit loads of your source data and unpack two ways, instead of narrow loads to feed vpmovzx... which only works on data at the bottom of an input register. (And memory-source vpmovzx... ymm, [mem] can't micro-fuse the load with a YMM destination, only for the 128-bit XMM version, on Intel CPUs, so the front-end cost is the same as separate load and shuffle instructions.)
But that trick doesn't work work quite as nicely for data you need to sign-extend. vpcmpgtw to get high halves for vpunpckl/hwd does work, but vpermq when re-packing is about as good, just different execution-port pressure. So vpmovsxwd is simpler there.
Slicing up your data into odd/even instead of low/high can also work, e.g. to get 16 bit elements zero-extended into 32-bit elements:
auto veven = _mm256_and_si256(v, _mm256_set1_epi32(0x0000FFFF));
auto vodd = _mm256_srli_epi32(v, 16);
After processing, one can recombine with a shift and vpblendw. (1 uop for port 5 on Intel Skylake / Ice Lake). Or for bytes, vpblendvb with a control vector, but that costs 2 uops on Intel CPUs (but for any port), vs. only 1 uop on Zen2. (Those uop counts aren't including the vpslld ymm, ymm, 16 shift to line up the odd elements back with their starting points.)
Even with AVX-512, the situation isn't perfect. You still can use a single shuffle uop to combine 2 vectors to one of the same width.
There's very nice single-vector narrowing with truncation, or signed or unsigned saturation, for any pair of element sizes like an inverse of vpmovzx / sx. e.g. qword to byte vpmov[su]qb, with an optional memory destination.
(Fun fact: vpmovdm [rdi]{k1}, zmm0 was the only way Xeon Phi (lacking both AVX-512BW and AVX-512VL) could do byte-masked stores to memory; that might be why these exist in memory-destination form. On mainstream Intel like Skylake-X / Ice Lake, the memory destination versions are no cheaper than separate pack into register then store. https://uops.info/)
AVX-512 also has nice 2-input shuffles with a control vector, so for dword-to-word truncation you could use vpermt2w zmm1, zmm2, zmm3. But that needs a shuffle control vector, and vpermt2w is 3 uops on SKX and IceLake. (t2d and t2q are 1 uop). vpermt2b is only available in AVX-512VBMI (Ice Lake), and is also 3 uops there.
(Unlike vpermb which is 1 uop on Ice Lake, and AVX-512BW vpermw which is still 2 uops on Ice Lake. So they didn't reduce the front-end cost of the backwards-compatible instruction, but ICL can run 1 of its 2 uops on port 0 or 1, instead of both on the shuffle unit on port 5. Perhaps ICL has one uop preprocess the shuffle control into a vpermb control or something, which would also explain the improved latency: 3 cycles for data->data, 4 cycles for control->data. vs. 6 cycles on SKX for the 2p5 uops, apparently a serial dependency starting with both the control and data vectors.)
Suppose we have to solve a global optimization problem in which we have to find values of 5 variables, all of which are integers. Assume we get following two parent chromosomes,
Parent 1: 6, 10, 3, 5, 12
Parent 2: 12, 10, 3, 8, 11
If we do cross-over after first 2 elements, we get following
Child 1: 6, 10, 3, 8, 11
Child 2: 12, 10, 3, 5, 12
Here we can clearly see the children are related to parents.
But when we encode as bit strings, then each chromosome is encoded as a single string of bits and we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions.
I have a similar question with mutation. We randomly flip a bit. If the bit flipped has a small place value, then the change will be small. But if it has a big place value, the change will be big. How is it better than completely randomly changing a chromosome?
Binary encoding is still common mainly because first works about GA used that encoding.
Furthermore it's often space efficient: [6, 10, 3, 5, 12] represented as a sequence of integers would probably require 5 * 32 bits; for a bit string representation 5 * 4 bits are enough (assuming numbers in the [0;15] range).
Under this aspect the knapsack problem is the best case for the bit string representation (each bit says if the corresponding object is in the knapsack).
we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions
In general choosing a crossover point in the middle of a digit introduces an essentially arbitrary mutation of that digit with a destructive (negative) effect.
There is a nice example about this effect in the Local Search Algorithms and Optimization Problems - Genetic Algorithms section of Artificial Intelligence - A modern approach (Russel, Norvig).
Take also a look at a similar question on Software Engineering.
The last word of a MIDI header chunk specifies the division. It contains information about whether delta times should be interpreted as ticks per quarter note or ticks per frame (where frame is a subdivision of a second). If bit 15 of this word is set then information is in ticks per frame. Next 7 bits (bit 14 through bit 8) specify the amount of frames per second and can contain one of four values: -24, -25, -29, or -30. (they are negative)
Does anyone know whether the bit 15 counts towards this negative value? So the question is, are the values which specify fps actually 8 bits long (15 through 8) or are they 7 bit long(14 through 8). The documentation I am reading is very unclear about this, and I can not find info anywhere else.
Thanks
The MMA's Standard MIDI-File Format Spec says:
The third word, <division>, specifies the meaning of the delta-times.
It has two formats, one for metrical time, and one for time-code-based
time:
+---+-----------------------------------------+
| 0 | ticks per quarter-note |
==============================================|
| 1 | negative SMPTE format | ticks per frame |
+---+-----------------------+-----------------+
|15 |14 8 |7 0 |
[...]
If bit 15 of <division> is a one, delta times in a file correspond
to subdivisions of a second, in a way consistent with SMPTE and MIDI
Time Code. Bits 14 thru 8 contain one of the four values -24, -25, -29,
or -30, corresponding to the four standard SMPTE and MIDI Time Code
formats (-29 corresponds to 30 drop frome), and represents the
number of frames per second. These negative numbers are stored in
two's complement form. The second byte (stored positive) is the
resolution within a frame [...]
Two's complement representation allows to sign-extend negative values without changing their value by adding a MSB bit of value 1.
So it does not matter whether you take 7 or 8 bits.
In practice, this value is designed to be interpreted as a signed 8-bit value, because otherwise it would have been stored as a positive value.
It may be a very basic low level architecture questions. I am trying to get my head around it. Please correct if my understanding is wrong, as well.
Word = 64 bit, 32 bit, etc. This is a number of bits computer can read at a time.
Questions:
1.) Would this mean, we can send, 4 numbers (of a 8 bits/byte length each) for 32 bit? Or combination of 8 bit (byte), 32 bit (4 bytes), etc numbers at one time?
2.) If we need to send only 8 bit number, then how does it form a word? Only first byte is filled and rest all bytes are padded with 0s or last byte gets filled while rest of the bytes are padded with 0s? Or I saw somewhere like first byte has information as to how the rest of the bytes are filled. Does that apply here? For example, UTF-8. Here, ASCII is 1 byte, and some other chars take up to 4 bytes. So when we send one char, we send all 4 bytes together, but fill the bytes as required for the char and rest of the bytes 0s?
3.) Now to represent 8 digit number, we would need 27 bits (remember famous question, sorting 1 million 8 digit number with just 1 MB RAM). Can we exactly use 27 bits, which is 32 bits (4 bytes) - 5 bits? and use those 5 digits for something else?
Appreciate your answers!
1- Yes, four 8-bit integers can fit in a 32-bit integer. This can be done using bitwise operations, for example (using C operators):
((a & 255) << 24) | ((b & 255) << 16) | ((c & 255) << 8) | (d & 255)
This example uses C operators, but they are also used for the same purpose in several other languages (see below - a complete, compilable version of this example in C). You may want to look up the bitwise operators AND (&), OR (|), and Left Shift (<<);
2- Unused bits are generally 0. The first byte is sometimes used to represent the type of encoding (Look up "Magic Numbers"), but this is implementation dependent. Sometimes it is a different number of bits.
3- Groups of 8-digit numbers can be compressed to use only 27 bits each. This is very similar to the example, except the number of bits and size of the data are different. To do this, you will need 864-bit groups, i.e. 27 32-bit integers to store 32 27-bit numbers. This would be more complex than the example, but it would use the same principles.
Complete, compilable example in C:
#include <stdio.h>
/*Compresses four integers containing one byte of data in the least
*significant byte into a single 32-bit integer*/
__int32 compress(int a, int b, int c, int d){
__int32 compressed = ((a & 255) << 24) | ((b & 255) << 16) |
((c & 255) << 8) | (d & 255);
return compressed;
}
/*Test the compress() function and print the resuts*/
int main(){
printf("%x\n", (unsigned)compress(255, 0, 255, 0));
printf("%x\n", (unsigned)compress(192, 168, 0, 255));
printf("%x\n", (unsigned)compress(84, 94, 255, 2));
return 0;
}
I think that clarification on 2 points is required here :
1. Memory addressing.
2. Word
Memories can be addressed in 2 ways, they are generally either byte addressable or word addressable.
Byte addressable memory means that each byte is given a separate address.
a -> 0th byte
b -> 1st byte
Word addressable memories are those in which each group of bytes that is as wide as the word gets an address. Eg if the Word Length is 32 bits :
a->0th byte
b->4th byte
And so on.
Word
I would say that a word defines the maximum number of bits a processor can handle at a time. For 8086, for eg, it's 16.
It is usually the largest number on which the arithmetic can be performed by the processor. Continuing the example , 8086 can perform operations on 16 bit numbers at a time.
Now i'll try and answer the questions :
1.) Would this mean, we can send, 4 numbers (of a 8 bits/byte length each) for 32 bit? Or combination of 8 bit (byte), 32 bit (4 bytes),
etc numbers at one time?
You can always define your own interpretation for a bunch of bits.
For eg, If it is byte addressable, we can treat every byte individually and thus , we can write code at assemble level that treats each byte as a separate 8 bit number.
If it is not, you can use bit operations to extract individual bytes out.
The point is you can represent 4 8 bit numbers in 32 bits.
2) Mostly, leftover significant bits are stuffed with 0s ( for unsigned numbers)
3.) Now to represent 8 digit number, we would need 27 bits (remember famous question, sorting 1 million 8 digit number with just 1 MB RAM).
Can we exactly use 27 bits, which is 32 bits (4 bytes) - 5 bits? and
use those 5 digits for something else?
Yes, you can do this also. But you know the great space-time tradeoff.
You sure save 5 bits, per number. But you'll need to use bit operations and all the really cool but hard to read stuff. Shooting up time and making code more complex.
But i don't think you'll ever come across a situation where you need such level of saving, unless you are coding for a very constrained system. (embedded etc)
How can a large set of integers all with a known constant digit sum, and a constant amount of digits be encoded.
Example of integers in base 10, with digit sum 5, and 3 digits:
014, 041, 104, 113, 122, 131, 140, 203 ....
The most important factor is space, but computing time is not completely unimportant.
The simplest way would be to store the digit sum itself, and leave it at that.
But it's possible that I misunderstand the question.
Edit: Here's my takeaway: You want to encode the set itself; yeah?
Encoding the set itself is as easy as storing the base, the digit sum, and the number of digits, e.g. {10, 5, 3} in the example you give.
Most of the time, however, you'll find that the most compact representation of a number is the number itself, unless it is very large.
Also, because digit sum is commonly taken to be recursive; and between one and nine, inclusive; 203 has the same digit sum as 500, or as 140, or as 950. This means that the set is huge for any combination of numbers, and also that any set (except for certain degenerate cases) uses every available digit in the base they are related to.
So, you know, the most efficient encoding of the numbers themselves when stored singly becomes the number itself, especially considering that every number between ±2 147 483 648 generally takes the same amount of space in memory, and often in storage.
When you have as clearly a defined set of possible values to encode as this the straight-forward encoding theoretic approach is to sequentially number all possible values, then store this number is as many bits as necessary. This is quite clearly optimal, if the frequencies of the individual values are identical or not known. If you know something about the frequency distribution you'll instead have to use something like a Huffman code to get a truly optimal result, but that's rather complicated and I'll handle only the other case.
For the uniformly distributed (or unknown) case the approach is as follows:
Imagine (you can pre-generate and store it, or generate it on the fly) a lexicographically sorted list of all your input (for the encoding) values. E.g. in your case the list would start be (unless your digit sum is recursive): 005, 023, 032, 050, 104, 113, 122, 131, 140, 203, 212, 221, 230, 401, 410, 500.
Then assign each item in the list an integer based on its position in the list: 005 becomes 0, 023 becomes 1, 032 becomes 2 and so on. There are (unless I made a mistake, if so, adjust appropriately) 16 values in the list, for which you need 4 bits to encode any index into the list. This index is your encoded value, and encoding and decoding become obvious.
As for an algorithm to generate the list in the first place: The simplest way is to count from 000 to 999 and throw away everything that doesn't match your criterion. It is possible to be more clever about that by replicating counting and overflow (e.g. how 104 follows 050) but it's probably not worth the effort.