How can I generate this hash? - hash

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?

The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.

If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.

I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).

I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.

You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!

These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc

Related

I can't understand the behaviour of btrim()

I'm currently working with postgresql, I learned about this function btrim, I checked many websites for explanation, but I don't really understand.
Here they mention this example:
btrim('xyxtrimyyx', 'xyz')
It gives trim.
When I try this example:
btrim('xyxtrimyyx', 'yzz')
or
btrim('xyxtrimyyx', 'y')
I get this: xyxtrimyyx
I don't understand this. Why didn't it remove the y?
From the docs you point to, the definition says:
Remove the longest string consisting only of characters in characters
(a space by default) from the start and end of string
The reason your example doesn't work is because the function tries to strip the text from Both sides of the text, consisting only of the characters specified
Lets take a look at the first example (from the docs):
btrim('xyxtrimyyx', 'xyz')
This returns trim, because it goes through xyxtrimyyx and gets up to the t and doesn't see that letter in xyz, so that is where the function stops stripping from the front.
We are now left with trimyyx
Now we do the same, but from the end of the string.
While one of xyz is the last letter, remove that letter.
We do this until m, so we are left with trim.
Note: I have never worked with any form of sql. I could be wrong about the exact way that postgresql does this, But I am fairly certain from the docs that this is how it is done.

number representing text string

A web form collects data on students in a band organization at school. The form data is fed into a google sheet that then populates a merge template and the merged forms are emailed to the recipient. A parent needs to print, sign and turn in the forms. There are hundreds of kids in this band and at registration time when the forms are turned in it is easier to sort all the papers in the stack if you have a short sort number in the corner... Volunteer kids don't apply alphabetization well. I'm trying to create a formula that will give me that sorting number to merge onto the header of each page of the PDF they receive after submitting the form. I want it based on last name and then first name and be able to create that number (in the google sheet) on the fly because the merging happens almost instantly when the user submits the form. Hence, an excel type formula is desired that will result in a number representing the kids name. I'd like for each number to be unique but some names are the same for the first few letters, also some names are only 2 characters long. I tried making A=10, B=11, z=35 etc. (so all are 2 digits) So, using only the first 3 characters, Bob Jones would = 192423112411 - hardly easy to sort the paper at a glance and it doesn't really differentiate between Bob Janes either. 4 digits is preferable. I also looked at =code() formula and it came out with long numbers too. Any advice is appreciated. Thanks!
Side note: What method do spreadsheets use to sort text? Do they weight the characters or what? Before I got the automerge thing to work I assigned each kid in the list a number higher than the one below and lower than above (on the sheet), then did the merge.
One option is to:
sort the name list alphabetically
add a sort number column, and put a =TEXT(row(),"0000") formula to generate a unique ID
on the merge spreadsheet, use a VLOOKUP function to retrieve the unique ID for that specific name.
First off, that wall of text was kind of hard to read through. Please try and do a little formatting so the people trying to help you can easily follow what you're trying to convey.
Personally I would suggest a hyphenated system. First initial of last name converted to a number, followed by a hyphen, followed by the first two letters of their first name converted to numbers.
Bob Jones becomes 11-1956 assuming you differentiate between upper and lower case, or 11-1924 if you convert everything to upper case, which I guess makes more sense.
You could use this VBA function to convert names to a system like that:
Function ConvertToIndex(strInput As String) As String
Dim strLast As String
Dim arrName() As String
Dim strFirst1 As String
Dim strFirst2 As String
arrName = Split(strInput, " ")
strLast = Mid(arrName(1), 1, 1)
strFirst1 = Mid(arrName(0), 1, 1)
strFirst2 = Mid(arrName(0), 2, 1)
ConvertToIndex = Asc(UCase(strLast)) - 55 & "-" & Asc(UCase(strFirst1)) - 55 & Asc(UCase(strFirst2)) - 55
'MsgBox ConvertToIndex
End Function
Thank you Tim, Nutsch and Mad Tech for your responses. I appreciate your input. Sorry the paragraph was so long, I get wordy. Because the members get their merged PDF sheet immediately after submitting I need the number to be based on the name as soon as it's entered, not after the fact; so I was looking for a formula that would reside in the sheet. Interesting VBA function too though. I'll settle for numbering them afterwards, maybe when the sheets are turned in. By then I'll know all who are in the band and can assign numbers like before. Thanks again!

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by reading it and typing it back into some other interface.
What is a "good" way to encode the key to make reading / hearing / typing it easy & accurate?
This could be an invoice number, a document ID, a transaction ID or some other abstract value. Let's say for the sake of this discussion the underlying key value is a big number, say 40 digits in base 10.
Some thoughts:
Shorter keys are generally better
a 40-digit base 10 value may not fit in the space given, and is easy to get lost in the middle of
the same value could be represented in base 16 in 33-34 digits
the same value could be represented in base 36 in 26 digits
the same value could be represented in base 64 in 22-23 digits
Characters that can't be visually confused with each other are better
e.g. an encoding that includes both O (oh) and 0 (zero), or S (ess) and 5 (five), could be bad
This issue depends on the font / face used to display the key, which you may be able to control in some cases (like printing on paper) but can't control in others (like web pages and email).
Also depends on whether you can control the exclusive use of upper and / or lower case -- e.g. capital D (dee) may look like O (oh) but lower case d (dee) would not; while lower case l (ell) looks like a 1 (one) while capital L (ell) would not. (With exceptions for especially exotic fonts / faces).
Characters that can't be verbally / aurally confused with each other are better
a (ay) 8 (eight)
B (bee) C (cee) D (dee) E (ee) g (gee) p (pee) t (tee) v (vee) z (zee) 3 (three)
This issue depends on the audio quality of the end-to-end channel -- bigger challenge if the expected user base could have a speech impediment, or may have to speak through a gas mask, or the communication channel could include CB radios or choppy VOIP phone systems.
Adding a check digit or two would detect errors but not help resolve errors.
An alpha - bravo - charlie - delta type dialog can help with hearing errors, but not reading errors.
Possible choices of encoding:
Base 64 -- compact, but too many hard-to-verbalize characters (underscore, dash etc.)
Base 34 -- 0-9 and A-Z but with O (oh) and I (aye) left out as the easiest to confuse with digits
Base 32 -- same as base 34 but leave out the 0 (zero) and 1 (one) as well
Is there a generally recognized encoding that is a reasonable solution for this scenario?
When I heard it first, I liked the article A Proposal for Proquints: Identifiers that are Readable, Spellable, and Pronounceable. It encodes data as a sequence of consonants and vowels. It's tied to the English language though. (Because in German, f and v sound equal, so they should not be used both.) But I like the general idea.

hash function to index similar text

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.