Parse bit strings in Perl - perl

When working with unpack, I had hoped that b3 would return a bitstring, 3 bits in length.
The code that I had hoped to be writing (for parsing a websocket data packet) was:
my($FIN,$RSV1, $RSV2, $RSV3, $opcode, $MASK, $payload_length) = unpack('b1b1b1b1b4b1b7',substr($read_buffer,0,2));
I noticed that this doesn't do what I had hoped.
If I used b16 instead of the template above, I get the entire 2 bytes loaded into first variable as "1000000101100001".
That's great, and I have no problem with that.
I can use what I've got so far, by doing a bunch of substrings, but is there a better way of doing this? I was hoping there would be a way to process that bit string with a template similar to the one I attempted to make. Some sort of function where I can pass the specification for the packet on the right hand side, and a list of variables on the left?
Edit: I don't want to do this with a regex, since it will be in a very tight loop that will occur a lot.
Edit2: Ideally it would be nice to be able to specify what the bit string should be evaluated as (Boolean, integer, etc).

If I have understood correctly, your goal is to split the 2-bytes input to 7 new variables.
For this purpose you can use bitwise operations. This is an example of how to get your $opcode value:
my $b4 = $read_buffer & 0x0f00; # your mask to filter 9-12 bits
$opcode = $b4 >> 8; # rshift your bits
You can do the same manipulations (maybe in a single statement, if you want) for all your variables and it should execute at a resonable good speed.

Related

Subscript multiple characters in Julia variable name?

I can write:
x\_m<TAB> = 5
to get x subscript m as a variable name in Julia. What if I want to subscript a word instead of a single character? This
x\_max<TAB> = 5
doesn't work. However,
x\_m<TAB>\_a<TAB>\_x<TAB> = 5
does work, it's just very uncomfortable. Is there a better way?
As I noted in my comment, not all ASCII characters exist as unicode super- or sub-scripts. In addition, another difficulty in generalizing this tab completion will be determining what \_phi<TAB> should mean: is it ₚₕᵢ or ᵩ? Finally, I'll note that since these characters are cobbled together from different ranges for different uses they look pretty terrible when used together.
A simple hack to support common words you use would be to add them piecemeal to the Base.REPLCompletions.latex_symbols dictionary:
Base.REPLCompletions.latex_symbols["\\_max"] = "ₘₐₓ"
Base.REPLCompletions.latex_symbols["\\_min"] = "ₘᵢₙ"
You can put these additions in your .juliarc.jl file to load them every time on startup. While it may be possible to get a comprehensive solution, it'll take much more work.
Since Julia 1.6 this works for subscripts (\_) and superscripts(\^) in the Julia REPL.
x\_maxTAB will print out like this: xₘₐₓ.
x\^maxTAB will print out like this: xᵐᵃˣ.

Variable output hash function

I know that there are hash functions that from a variable length input can give a fixed output. To take the simplest one, using the module of ten no matter how big is the input number I will always get an output between 0 and 9.
I need to do have from an unknown password, a variable length output. My first thought was to use the module, increasing the prim number as much as many digits I need to have as output.
My problems are:
I must handle short passwords as well as I would with long passwords;
I don't know how long should the output be before writing the program, and even though I would know after the user has set the password I may need to change it if he modifies the file.
My first thought was using a simple function and modify it based on my needs.
If I have to hash 123 but I need to have 5 characters as output, that's what I would do:
I add 2 zeros on the right, changing the input to 12300;
I take the lowest 5 digits prime number (10007);
And I then I have my hash doing 12300 % 10007 = 02293.
But since I would probably need output in the order of hundreds if not thousands I'm pretty sure module is not the solution to my problem.
I could also try to create my own hash function, but I have no idea how to verify if it works or if it's trash.
Are there some common solutions in literature for this kind of problem?

Bit shifting a bit array and binary math for SHA-1

I am posting for a sanity check so please forgive me if this sounds a little basic. I am trying to learn more about encryption so I figured a good starter project would be implement the Sha-1 hashing algorithm. I found a walk-through and have hit a point where I do not know if the walk-through is wrong or my understanding of bitness/rotation/binary operations is wrong.
From the document:
Step 11.2: Put them together
After completing one of the four functions above, each variable will move on to this step before restarting the loop with the next
word. For this step we are going to create a new variable called
'temp' and set it equal to: (A left rotate 5) + F + E + K + (the
current word).
Notice that other than the left rotate the only operation we're doing is basic addition. Addition in binary is about as simple as it
can be.
We'll use the results from the last word(79) as an example for this step.
A lrot 5:
00110001000100010000101101110100
F:
10001011110000011101111100100001
A lrot 5 + F
Out:
110111100110100101110101010010101
Notice that the result of this operation is one bit longer than the two inputs. This is just like adding 5 and 6, you will need a new
place value to represent the answer. For everything to work out
properly we will need to truncate that extra bit eventually. However,
we do not want to do that until the end!
This does not quite work out. What I think would happen is:
A = 00110001000100010000101101110100
F = 10001011110000011101111100100001
A Left rotate 5 = 00100010001000010110111010000110
(A Left Rotate 5) + F = 10101101111000110100110110100111 (which is still 32 bits)
What I need is just another set of eyes on this to say "Yes krtzer, you are correct and this document is wrong" Or "Your understanding of bitness, endianness, and/or bit rotation is wrong, this is how it works".
Right now I am not sure if my integer representation is wrong (the spec says use U32s, but this section says that I need to keep track of the extra bits), the endianness of my program is messing up the rotation (I use little endian) or there is something else.
Any experience or insight will be appreciated!
You are correct in your understanding of how everything works. The problem was with the article (which I wrote). An extra digit must always be added in the beginning of step 11.2, whether it's necessary or not, and if it's not necessary, it should be set to 1.
The article now reads:
Notice that the result of this operation is one bit longer than the two inputs. After each iteration the new word should be one bit longer than the last. Sometimes this will be a necessary carrier bit (like the extra place value you need to represent the result of adding the two single digit numbers 5 and 6 in base 10), and when that's not needed you must simply prepend a 1. For everything to work out properly we will need to truncate that extra bit eventually; however, we do not want to do that until the end!
The article was also unclear about the fact that an already rotated A was being shown in the example.

Tokenize the mmaped string using strtok

I have mmaped a huge file containing 4k * 4k floats. Since it was an text file, I need to mmap it as char string and use. Now I need to parse floats and write into 2d array. If I tokenize it using strtok, it will not allow me to do since mmapped string is not modifiable. If I copy the string into std::string and then tokenize using getline function, it let me to do it but I feel I will lose the performance got from mmap. How do I optimally solve this problem ??
You can try some different solutions, but you will have to benchmark to find out which one is the best for you. It's not always clear that mmap()ing a file and processing the memory-mapped pages directly is the best solution. Especially if you make a single sequential pass through the file, a loop that read()s pieces at a time into a buffer can be faster, even if you use madvise() together with mmap(). Again, benchmark if you want to know what is fastest for you.
Some solutions you might try:
mmap() with MAP_WRITE and MAP_PRIVATE and then use your existing strtok() code. This will allow strtok() to write the NUL bytes it wants to write, without having those changes be reflected in the file. If you choose this solution, you should probably call madvise(MADV_DONTNEED) on the parts of the file you have already processed, else memory usage will grow linearly.
Implement your own variant of strtok() that returns the length of the matched token instead of a NUL-terminated string. It's not difficult, using memchr(). This way you don't need to modify the memory. You might then need to pass the resulting tokens to functions that take a string and a length instead of a NUL-terminated string. There aren't many such functions in the C library, but even so you might be able to get away with calling functions like strtod() if the tokens are guaranteed to end in some non-digit delimiter. Or you can copy them into a small stack-allocated buffer (they're floats, they can't be that long, right?)
Use a read()-and-process loop instead of mmap().

Can the C preprocessor perform simple string manipulation?

This is C macro weirdness question.
Is it possible to write a macro that takes string constant X ("...") as argument and evaluates to sting Y of same length such that each character of Y is [constant] arithmetic expression of corresponding character of X.
This is not possible, right ?
No, the C preprocessor considers string literals to be a single token and therefore it cannot perform any such manipulation.
What you are asking for should be done in actual C code. If you are worried about runtime performance and wish to delegate this fixed task at compile time, modern optimising compilers should successfully deal with code like this - they can unroll any loops and pre-compute any fixed expressions, while taking code size and CPU cache use patterns into account, which the preprocessor has no idea about.
On the other hand, you may want your code to include such a modified string literal, but do not want or need the original - e.g. you want to have obfuscated text that your program will decode and you do not want to have the original strings in your executable. In that case, you can use some build-system scripting to do that by, for example, using another C program to produce the modified strings and defining them as macros in the C compiler command line for your actual program.
As already said by others, the preprocessor sees entire strings as tokens. There is only one exception the _Pragma operator, that takes a string as argument and tokenizes its contents to pass it to a #pragma directive.
So unless your targeting a _Pragma the only way to do things in the preprocessing phases is to have them written as token sequences, manipulate them and to stringify them at the end.