Bit shifting a bit array and binary math for SHA-1 - hash

I am posting for a sanity check so please forgive me if this sounds a little basic. I am trying to learn more about encryption so I figured a good starter project would be implement the Sha-1 hashing algorithm. I found a walk-through and have hit a point where I do not know if the walk-through is wrong or my understanding of bitness/rotation/binary operations is wrong.
From the document:
Step 11.2: Put them together
After completing one of the four functions above, each variable will move on to this step before restarting the loop with the next
word. For this step we are going to create a new variable called
'temp' and set it equal to: (A left rotate 5) + F + E + K + (the
current word).
Notice that other than the left rotate the only operation we're doing is basic addition. Addition in binary is about as simple as it
can be.
We'll use the results from the last word(79) as an example for this step.
A lrot 5:
00110001000100010000101101110100
F:
10001011110000011101111100100001
A lrot 5 + F
Out:
110111100110100101110101010010101
Notice that the result of this operation is one bit longer than the two inputs. This is just like adding 5 and 6, you will need a new
place value to represent the answer. For everything to work out
properly we will need to truncate that extra bit eventually. However,
we do not want to do that until the end!
This does not quite work out. What I think would happen is:
A = 00110001000100010000101101110100
F = 10001011110000011101111100100001
A Left rotate 5 = 00100010001000010110111010000110
(A Left Rotate 5) + F = 10101101111000110100110110100111 (which is still 32 bits)
What I need is just another set of eyes on this to say "Yes krtzer, you are correct and this document is wrong" Or "Your understanding of bitness, endianness, and/or bit rotation is wrong, this is how it works".
Right now I am not sure if my integer representation is wrong (the spec says use U32s, but this section says that I need to keep track of the extra bits), the endianness of my program is messing up the rotation (I use little endian) or there is something else.
Any experience or insight will be appreciated!

You are correct in your understanding of how everything works. The problem was with the article (which I wrote). An extra digit must always be added in the beginning of step 11.2, whether it's necessary or not, and if it's not necessary, it should be set to 1.
The article now reads:
Notice that the result of this operation is one bit longer than the two inputs. After each iteration the new word should be one bit longer than the last. Sometimes this will be a necessary carrier bit (like the extra place value you need to represent the result of adding the two single digit numbers 5 and 6 in base 10), and when that's not needed you must simply prepend a 1. For everything to work out properly we will need to truncate that extra bit eventually; however, we do not want to do that until the end!
The article was also unclear about the fact that an already rotated A was being shown in the example.

Related

How arbitrary is one?

My teacher says our homework program must handle "an arbitrary number of input lines". It seems pretty arbitrary to only accept one line, but is it arbitrary enough? My roommate said seven is more a arbitrary number than one, and maybe he's right. But I just have no idea how to measure the arbitrariness of a number and Google doesn't seem to help.
UPDATE:
It sounds like maybe the best thing to do is accepty any given number of input lines, and hope the prof can see that that makes a lot more sense than insisting that the user just give you one specific arbitrary number of input lines. Especially since we weren't instructed to notify the user about what the arbitrary number is. You can't just make the user guess, that's crazy.
"Arbitrary" doesn't mean you get to pick a random number to accept. It means that it should handle an input with any number of lines.
So if someone decides to give your program an input with 0 lines, 1 line, 2 lines... n lines, then it should still do the right thing (and not crash).
Arbitrary means it could be ANY number. 0, 1, 7, 100124453225.
I would probably test for 0 and display some sort of error in that case since it's supposed to have SOME text. Other than that so long as there are more lines your program should keep doing whatever it's designed to do.
Typically when teachers indicate that a program should accept arbitrary amounts of input they are indicating to you that you should consider corner cases which you may not have thought about, one of the most common being no input at all which can often cause errors in programs if the programmer hasn't considered this case.
The point of the word is to emphasize that your program should be able to handle different inputs instead of simply crashing unless input comes in a certain quantity or is formatted in a specific way.

Difference between machine language, binary code and a binary file

I'm studying programming and in many sources I see the concepts: "machine language", "binary code" and "binary file". The distinction between these three is unclear to me, because according to my understanding machine language means the raw language that a computer can understand i.e. sequences of 0s and 1s.
Now if machine language is a sequence of 0s and 1s and binary code is also a sequence of 0s and 1s then does machine language = binary code?
What about binary file? What really is a binary file? To me the word "binary file" means a file, which consists of binary code. So for example, if my file was:
010010101010010
010010100110100
010101100111010
010101010101011
010101010100101
010101010010111
Would this be a binary file? If I google binary file and see Wikipedia I see this example picture of binary file which confuses me (it's not in binary?....)
Where is my confusion happening? Am I mixing file encoding here or what? If I were to ask one to SHOW me what is machine language, binary code and binary file, what would they be? =) I guess the distinction is too abstract to me.
Thnx for any help! =)
UPDATE:
In Python for example, there is one phrase in a file I/O tutorial, which I don't understand: Opens a file for reading only in binary format. What does reading a file in binary format mean?
Machine code and binary are the same - a number system with base 2 - either a 1 or 0. But machine code can also be expressed in hex-format (hexadecimal) - a number system with base 16. The binary system and hex are very interrelated with each other, its easy to convert from binary to hex and convert back from hex to binary. And because hex is much more readable and useful than binary - it's often used and shown. For instance in the picture above in your question -uses hex-numbers!
Let say you have the binary sequence 1001111000001010 - it can easily be converted to hex by grouping in blocks - each block consisting of four bits.
1001 1110 0000 1010 => 9 14 0 10 which in hex becomes: 9E0A.
One can agree that 9E0A is much more readable than the binary - and hex is what you see in the image.
I'm honestly surprised to not see the information I was looking for, looking back though, I guess the title of this thread isn't fully appropriate to the question the OP was asking.
You guys all say "Machine Code is a bunch of numbers".
Sure, the "CODE" is a bunch of numbers, but what people are wondering (I'm guessing) is "what actually is happening physically?"
I'm quite a novice when it comes to programming, but I understand enough to feel confident in 'roughly' answering this question.
Machine code, to the actual circuitry, isn't numbers or values.
Machine code is a bunch of voltage gates that are either open or closed, and depending on what they're connected to, a certain light will flicker at a certain time etc.
I'm guessing that the "machine code" dictates the pathway and timing for specific electrical signals that will travel to reach their overall destination.
So for 010101, 3 voltage gates are closed (The 0's), 3 are open (The 1's)
I know I'm close to the right answer here, but I also know it's much more sophisticated - because I can imagine that which I don't know.
010101 would be easy instructions for a simple circuit, but what I can't begin to fathom is how a complex computer processes all of the information.
So I guess let's break it down?
x-Bit-processors tell how many bits the processor can process at once.
A bit is either 1 or 0, "On" or "Off", "Open" or "Closed"
so 32-bit processors process "10101010 10101010 10101010 10101010" - this many bits at once.
A processor is an "integrated circuit", which is like a compact circuit board, containing resistors/capacitors/transistors and some memory. I'm not sure if processors have resistors but I know you'll usually find a ton of them located around the actual processor on the circuit board
Anyways, a transistor is a switch so if it receives a 1, it sends current in one direction, or if it receives a 0, it'll send current in a different direction... (or something like that)
So I imagine that as machine code goes... the segment of code the processor receives changes the voltage channels in such a way that it sends a signal to another part of the computer (why do you think processors have so many pins?), probably another integrated circuit more specialized to a specific task.
That integrated circuit then receives a chunk of code, let's say 2 to 4 bits 01 or 1100 or something, which further defines where the final destination of the signal will end up, which might be straight back to the processor, or possibly to some output device.
Machine code is a very efficient way of taking a circuit and connecting it to a lightbulb, and then taking that lightbulb out of the circuit and switching the circuit over to a different lightbulb
Memory in a computer is highly necessary because otherwise to get your computer to do anything, you would need to type out everything (in machine code). Instead, all of the 1's and 0's are stored inside some storage device, either a spinning hard disk with a magnetic head pin that 'reads' 1's or 0's based on the charge of the disk, or a flash memory device that uses a series of transistors, where sending a voltage through elicits 1's and 0's (I'm not fully aware how flash memory works)
Fortunately, someone took the time to think up a different base number system for programming (hex), and a way to compile those numbers (translate them) back into binary. And then all software programs have branched out from there.
Each key on the keyboard creates a specific signal in binary that translates to
a bunch of switches being turned on or off using certain voltages, so that a current could be run through the specific individual pixels on your screen that create "1" or "0" or "F", or all the characters of this post.
So I wonder, how does a program 'program', or 'make' the computer 'do' something... Rather, how does a compiler compile a program of a code different from binary?
It's hard to think about now because I'm extremely tired (so I won't try) but also because EVERYTHING you do on a computer is because of some program.
There are actively running programs (processes) in task manager. These keep your computer screen looking the way you've become accustomed, and also allow for the screen to be manipulated as if to say the pictures on the screen were real-life objects. (They aren't, they're just pictures, even your mouse cursor)
(Ok I'm done. enough editing and elongating my thoughts, it's time for bed)
Also, what I don't really get is how 0's are 'read' by the computer.
It seems that a '0' must not be a 'lack of voltage', rather, it must be some other type of signal
Where perhaps something like 1 volt = 1, and 0.5 volts = 0. Some distinguishable difference between currents in a circuit that would still send a signal, but could be the difference between opening and closing a specific circuit.
If I'm close to right about any of this, serious props to the computer engineers of the world, the level of sophistication is mouthwatering. I hope to know everything about technology someday. For now I'm just trying to get through arduino.
Lastly... something I've wondered about... would it even be possible to program today's computers without the use of another computer?
Machine language is a low-level programming language that generally consists entirely of numbers. Because they are just numbers, they can be viewed in binary, octal, decimal, hexadecimal, or any other way. Dave4723 gave a more thorough explanation in his answer.
Binary code isn't a very well-defined technical term, but it could mean any information represented by a sequence of 1s and 0s, or it could mean code in a machine language, or it could mean something else depending on context.
Technically, all files are stored in binary, we just don't usually look at the binary when we view a file. However, the term binary file is usually used to refer to any non-text file; e.g. an .exe, a .png, etc.
You have to understand how a computer works in its basic principles and this will clear things up for you... Therefore I recommend on reading into stuff like Neumann Architecture
Basically in a very simple computer you only have one memory like an array
which has instructions for your processor, the data and everything is a binary numbers.
Your program starts at a certain place in your memory and reads the first number...
so here comes the twist: these numbers can be instructions or data.
Your processor reads these numbers and interprets them as instructions
Example: the start address is 0
in 0 is a instruction like "read value from address 120 into the ALU (Math-Unit)
then it steps to address 1
"read value from address 121 into ALU"
then it steps to address 2
"subtract numbers in ALU"
then it steps to address 3
"if ALU-Value is smaller than zero go to address 10"
it is not smaller than zero so it steps to address 4
"go to address 20"
you see that this is a basic if(a < b)
You can write these instructions as numbers and they can be run by your processor but because nobody wants to do this work (that was what they did with punchcards in the 60s)
assembler was invented...
that looks like:
add 10 ,11, 20 // load var from address 10 and 11; run addition and store into address 20
In Conclusion:
Assembler (processor instructions) can be called binary because it's stored in plain numbers
But everything else can be a Binary file, too.
In reality if you have a simple .exe file it is both... If you have variables in there like a = 10 and b = 20, these values can be stored some where between if clauses and for loops... It depends on the compiler where it put these
But if you have a complex 3D-model it can be stored in a separate file with no executable code in it...
I hope it helps to clear things up a little.

Need to rewrite CNC file to shift absolute position

This question is really in two parts. To briefly introduce the issue, we have a requirement to take a CNC file (used with a Roland milling machine) that has been produced using a tool called ArtCam, and modify it to shift the absolute position of the pattern being cut.
The software produces, and the machine accepts, input files in the following form:
;;^IN;
!MC1;
!RC5000;
V50.0;
^PR;Z0,0,10500;
^PA;
V49.8;
Z0,0,1000;
V39.8;
Z0,0,100;
Z10,0,99;
Z1000,0,-13;
Z10,0,-124;
Z0,0,-125;
...thousands more Zx,y,z; instructions...
The first part to my question is, can anyone actually tell me what this file format is called? It's clearly not G-Code, and I haven't been able to find any reference or documentation for it anywhere.
The second part is, does anyone know how we might easily modify the absolute position of the pattern that these files cut. Obviously the Z lines are X,Y,Z position commands but I don't know if they're absolute or relative, and I don't know in what coordinate space/system they are. For all I know there might be a simple command we can add at the top that shifts the starting point, or we might need to rewrite all the Z lines, but without some information on the file format I'm at a dead end.
Thanks!
I realise this is an old question and you maybe already have an answer (or have no need for one now) but it looks like it's RML-1, assuming my searches were correct.
I first found this which showed very similar code to your example. It mentions ArtCAM and output for the MDX-540, a Roland machine.
Searching Roland's milling machines for information was a bit useless, but going through their 3D products for the MDX-540 mentions that the control command sets is "RML-1 and NC codes".
Then searching for RML-1 gives a result for a PDF manual.
Reading that PDF it looks like the single letter commands are "Mode 1", the ^ is used to select Mode2 and the 2 letter commands are Mode2 commands. !xx commands are common to both Mode1 and Mode2.
^PR sets the movement to relative mode.
^PA sets the movement to absolute mode.
Z moves.
Looking at your code sample it appears as if most positions are absolute and you'd need to re-write them all.

Parse bit strings in Perl

When working with unpack, I had hoped that b3 would return a bitstring, 3 bits in length.
The code that I had hoped to be writing (for parsing a websocket data packet) was:
my($FIN,$RSV1, $RSV2, $RSV3, $opcode, $MASK, $payload_length) = unpack('b1b1b1b1b4b1b7',substr($read_buffer,0,2));
I noticed that this doesn't do what I had hoped.
If I used b16 instead of the template above, I get the entire 2 bytes loaded into first variable as "1000000101100001".
That's great, and I have no problem with that.
I can use what I've got so far, by doing a bunch of substrings, but is there a better way of doing this? I was hoping there would be a way to process that bit string with a template similar to the one I attempted to make. Some sort of function where I can pass the specification for the packet on the right hand side, and a list of variables on the left?
Edit: I don't want to do this with a regex, since it will be in a very tight loop that will occur a lot.
Edit2: Ideally it would be nice to be able to specify what the bit string should be evaluated as (Boolean, integer, etc).
If I have understood correctly, your goal is to split the 2-bytes input to 7 new variables.
For this purpose you can use bitwise operations. This is an example of how to get your $opcode value:
my $b4 = $read_buffer & 0x0f00; # your mask to filter 9-12 bits
$opcode = $b4 >> 8; # rshift your bits
You can do the same manipulations (maybe in a single statement, if you want) for all your variables and it should execute at a resonable good speed.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one