xkcd: Externalities - hash

So the April 1, 2013 xkcd Externalities web comic features a Skein 1024 1024 hash breaking contest. I'm assuming that this must be nothing more than a brute force effort where random strings are hashed in an effort to match Randall's posted hash? Is this correct?
Also, my knowledge of Skein hashing theory is virtually non-existent but being a halfway decent programmer I was able to download and run both SkeinFish (C#) and Maarten Bodewes Skein implementation (Java) locally in 1024 1024 mode with some input strings. The hashes that they gave, however, were different than the hash that xkcd returned for the same input. This may be an extremely naive question but do different Skein implementations give different hashes? And what Skein implementation is xkcd using?
Thanks for pardoning my ignorance!

There are several different iterations of the skein algorithm. XKCD is using version 1.3, which is also the most recent. Sources can be found here (look for "V1.3")
Interestingly enough, this brute-force method is the same one employed by Bitcoin to "mine" bitcoins. The big differences are the hash algorithm (SHA-256 in that case) and the target hash (which is dynamically determined to be any hash starting with a certain number of zeros.) It takes a lot of work to discover the hash, but once it has been found it is trivial to verify the source bits and that the resulting hash meets the criteria.

Here's the source code the Stanford team used. We ran this on about a hundred 8-core EC2 servers for a while, but not the whole competition.
https://github.com/jhiesey/skeincrack

If you were hashing non-alphanumeric characters (spaces, punctuation, etc.), you may have been getting different results due to HTML form encoding. The "enctype" attribute on the form XKCD was hosting was "application/octet-stream", which according to https://developer.mozilla.org/en-US/docs/HTML/Element/form is not a browser-supported standard. I assume the browser falls back on the URL-encoding type when it sees one it doesn't recognize.
I observed the string "=" being submitted URL-encoded in Chrome, and returning a different hash than what I got locally with the latest pyskein. But when I submitted it with this curl command line (no longer works), I got the expected hash:
curl -X POST --data-binary "hashable==" "http://almamater.xkcd.com/?edu=school.edu"
The Stanford code in another answer does the same thing, and they apparently had some success. I never got any random data to locally hash to a better score than even my own school, so I never got a chance to test thoroughly how to pass arbitrary data in properly. I don't know what the exact behavior was (e.g., perhaps if you omitted hashable= the server would detect that and just hash the whole POST body), but it may have intentionally been a little tricky as part of April Fool's.

Related

Is there a standard or alternative for shorter UUIDs?

The UUID standard has several versions. Version 4 for example is based on a completely random input. But it still encodes the version information and only uses 125bits of the possible 128 bits.
But for transferring these via HTTP, it is more efficient to encode them in BASE64. There are libraries for this (https://github.com/skorokithakis/shortuuid).
But what I am wondering: Is there an alternative standard for shorter ID-strings? Of course I could slap together a version-byte + n random bytes and encode them BASE64, having my own working 'short, random ID scheme', but I wonder if there is any alternative that someone already specified before I make my own.
There is no standard for anything shorter.
Numerous folks have asked the same question and all come to the same conclusion: UUIDs are overkill for their specific requirements. And they developed their own alternatives, which essentially boil down to a random string of some length (based on the expected size of their dataset) in a more efficient encoding like base64. Many of them have published libraries for working with these strings. However, all of them suffer from the same two problems:
They cannot handle arbitrarily large datasets (the reason UUIDs are so big)
They are not standardized, which means there is no real interop.
If neither of these problems affect you, then feel free to pick any of the various schemes available or roll your own. But do consider the future costs if you discover you're wrong about your requirements and whether the cost of having to change justifies whatever minuscule advantage in space or complexity you get by not using a proven, universal system from the start.
I have just found https://github.com/ai/nanoid
It it not really a 'standard', but at least not an arbitrary scheme that I would come up with myself. It is shorter through smarter encoding (larger alphabet) and faster.
A quick and dirty alternative is mktemp, depending on your requirements for security, uniqueness and your access to a shell.
Use the form mktemp -u XXXXXXXX
-u: dry-run, don't create a file
XXXXXXXX is the format, in this case eight random characters
$ echo `mktemp -u XXXXXXXX`
KbbHRWYv
$ echo `mktemp -u XXXXXXXX`
UnyO2eH8
$ echo `mktemp -u XXXXXXXX`
g6NiTfvT

Reversing rand in perl 5.10.0, anyone know where to find the source code for rand/srand?

I am doing an assignment where I have a passwd file and I am to find all the passwords in it. Most of them were easy with Jack the ripper and some tweaking but the extra credit requires I find a 8 byte Alphanumeric password generated by rand in perl 5.10.0 and encrypted with crypt.
I came up with three ways to approaching this:
Brute force: 62^8 Computations = 300 Weeks on my machine. I could
rent a server with 300 times my machine power to do in 1 week.
Somehow that feels like a waste of resources/electricity for an
extra credit.
Break Crypt: Not sure on this one, I have however generated a
char-set from the other passwords I found, reducing the Incremental
brute force to 5 days, but I think that will only work if this
password contains only characters present in the previous ones (17
plain-texts), so maybe if i get lucky! (Highly Unlikely)
Break rand: If I can find the same seed used to generate the
password. I can then generate dictionaries to feed to Jack. In order
to get the seed from the file given to me however I have to
understand how perl is creating the seed (and if it is even possible
on 5.10.0).
From what I have researched on earlier Perl versions only the System Time was used as a seed. I made a script that uses the m_time (Time From Epoch) on the passwd file given to me (+-10 to be sure although I'm sure the file got generated in one second) as seed to generate a dictionary, in this format, since I do not know at what call of rand() my password actually starts:
abcdefgh bcdefghi cdefhijk
I fed the dictionary to Jack. Of course this didn't work because after Perl 5.004 Perl uses other stuff (the point of my question) to generate a seed.
So, my question is if anyone knows where to find the source code Perl uses to generate the seed, and/or source code for rand/srand. I was looking for something that looked like this, but for version 5.10.0:
What are the weaknesses of Perl's srand() default seed, post version 5.004?
I tried using grep in the /lib/perl directory but I get lost in all the #define structure files.
Also feel free to let me know if you think I am completely offtrack with the assignment and/or any advice on the matter.
You don't want to look in /lib/perl, you want to look in the Perl source.
Here is Perl_seed() in util.c as of v5.10.0, which is the function called if srand is called without an argument, or if rand is called without srand being called first.
As you can see, on a Unix system with random device support, it uses bytes from /dev/urandom to seed the RNG. On a system without such support, it uses a combination of the time (with microsecond resolution if possible), the PID of the Perl process, and memory locations of various data structures in the Perl interpreter.
In the urandom case, guessing the seed is effectively impossible. In the second case, it's still of difficulty probably similar to brute-forcing the passwords; you have 20 bits of unpredictability from the microsecond timestamp, up to 16 bits from the PID, and an unknown amount from the memory addresses, probably between 0 and 20 bits if you know details of the system where it was run, but up to 64 or 96 bits if you have no knowledge at all.
I would say that attacking Perl's rand by guessing the seed is probably not practical, and reversing it from its output is probably not either, especially if it was run on a system with drand48. Have you considered a GPU-based brute-forcing tool?

Should all implementations of SHA512 give the same Hash?

I am working on writing a SHA512 function. When i check the file I am encrypting on different sources, a Linux SHA512SUM tool, a couple websites, and run it through the old source code i have for SHA512, they all give different hash values. My thought going into this project is that all Hash algorithms will output the same hash value if implemented correctly, to be used as a check sum. Am I wrong in thinking this? If I am wrong how would I really check to see if my work is correct?
Thanks in advance.
Yes, that's one of the basic building block of PKI: the same data block passed to a hash should always return the same hash value.
beware of the interpretation, though: the result of a SHA-2(512) hash is a block of 512 bits, not a string value so it will first be encoded for human consumption and it is therefore possible that you see what looks like visually different results when it's simply a matter of using different encodings.

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

Reliably getting a character count for .doc files

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris