How do I encode Reddit Data using Fast Text (running into Key Error)? - fasttext

I'm running into a Key Error when I try to convert a column that contains Reddit data into Fast Text vectors. I've cleaned the Reddit text so that each row is a list with the words of the blog (all numbers & punctuation removed). Would appreciate any advice on how to get around this key error!
def abstract2vec(a, vecs):
return np.array([vecs[w] for w in a if w not in stop_words.ENGLISH_STOP_WORDS]).mean(axis=0)
EDIT: Yes, I believe this is because I'm trying to retrieve a word vector for a word that isn't in the mode's vocabulary. Does anyone have advice on how to change the function to address that?

Related

Protractor paste list of numbers into text field

I am currently trying to use protractor to upload multiple numbers eg, 23245, 23343, 34324 into a text field these numbers can be copied out of a excel spread sheet id column and pasted into the text field on the application. The application will then add the ids onto a table. Each id will have a new row. Does anyone know if this can be done. Currently I am only sending one number into the text field. But the text field can also receive multiple numbers...
I don't fully understand the question, could you provide more context? What are multicity numbers?
If I were to send multiple numbers to a text field, I would probably:
<element>.sendKeys("1 \n 2 \n 3 \n 4 \n");
Or something similar. Please provide more information about the question so I/we can help better.
N
NOTE: I ORIGINALLY HAD THE SLASHES THE WRONG WAY, IVE EDITED THE ABOVE SNIPPET TO NOW BE CORRECT, MY BAD!

number representing text string

A web form collects data on students in a band organization at school. The form data is fed into a google sheet that then populates a merge template and the merged forms are emailed to the recipient. A parent needs to print, sign and turn in the forms. There are hundreds of kids in this band and at registration time when the forms are turned in it is easier to sort all the papers in the stack if you have a short sort number in the corner... Volunteer kids don't apply alphabetization well. I'm trying to create a formula that will give me that sorting number to merge onto the header of each page of the PDF they receive after submitting the form. I want it based on last name and then first name and be able to create that number (in the google sheet) on the fly because the merging happens almost instantly when the user submits the form. Hence, an excel type formula is desired that will result in a number representing the kids name. I'd like for each number to be unique but some names are the same for the first few letters, also some names are only 2 characters long. I tried making A=10, B=11, z=35 etc. (so all are 2 digits) So, using only the first 3 characters, Bob Jones would = 192423112411 - hardly easy to sort the paper at a glance and it doesn't really differentiate between Bob Janes either. 4 digits is preferable. I also looked at =code() formula and it came out with long numbers too. Any advice is appreciated. Thanks!
Side note: What method do spreadsheets use to sort text? Do they weight the characters or what? Before I got the automerge thing to work I assigned each kid in the list a number higher than the one below and lower than above (on the sheet), then did the merge.
One option is to:
sort the name list alphabetically
add a sort number column, and put a =TEXT(row(),"0000") formula to generate a unique ID
on the merge spreadsheet, use a VLOOKUP function to retrieve the unique ID for that specific name.
First off, that wall of text was kind of hard to read through. Please try and do a little formatting so the people trying to help you can easily follow what you're trying to convey.
Personally I would suggest a hyphenated system. First initial of last name converted to a number, followed by a hyphen, followed by the first two letters of their first name converted to numbers.
Bob Jones becomes 11-1956 assuming you differentiate between upper and lower case, or 11-1924 if you convert everything to upper case, which I guess makes more sense.
You could use this VBA function to convert names to a system like that:
Function ConvertToIndex(strInput As String) As String
Dim strLast As String
Dim arrName() As String
Dim strFirst1 As String
Dim strFirst2 As String
arrName = Split(strInput, " ")
strLast = Mid(arrName(1), 1, 1)
strFirst1 = Mid(arrName(0), 1, 1)
strFirst2 = Mid(arrName(0), 2, 1)
ConvertToIndex = Asc(UCase(strLast)) - 55 & "-" & Asc(UCase(strFirst1)) - 55 & Asc(UCase(strFirst2)) - 55
'MsgBox ConvertToIndex
End Function
Thank you Tim, Nutsch and Mad Tech for your responses. I appreciate your input. Sorry the paragraph was so long, I get wordy. Because the members get their merged PDF sheet immediately after submitting I need the number to be based on the name as soon as it's entered, not after the fact; so I was looking for a formula that would reside in the sheet. Interesting VBA function too though. I'll settle for numbering them afterwards, maybe when the sheets are turned in. By then I'll know all who are in the band and can assign numbers like before. Thanks again!

Identifying hidden characters in text

I have an ETL process that regularly extracts code from an ODBC data source, manipulates it, and inserts it into my postgres database. One of the columns from this data source regularly has odd characters in it.
For the most part I can catch and convert all of the characters appropriately, but I have one character that exists in the ODBC data source, cannot be brought into postgres (all of the text after that character gets truncated), and I'm having a hard time identifying what the character is.
I can't even insert an example of the character directly into this post because it gets stripped out :/ The closest I can get is a screen shot of the character in textmate (the only application I can actually see the character in):
There character is the diamond between the 1 and 0. When my data comes in, everything after the 0 is truncated.
Is there a good way of identifying what this character is so I can figure out a way of stripping it out?
Per tripleee's comment on the original question post:
To identify the character I grabbed the hex value of the text to identify the hex value of the offending character in question.
There are a number of ways to do this, but the quickest way for me was to use a utility application I have called HexFiend so dump the text into. Once the text was in and I highlighted the character it returned the hex value "00".
A bit more investigation pointed towards the hex null value being used as a line terminator in C applications (which makes sense given the context of my project).
I've fit this null value into my ETL process so that it gets switched out with a new line and now everything is sunshine and daises.
Thanks again for the help!

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

How can I generate this hash?

I'm new to programming (just started!) and have hit a wall recently. I am making a fansite for World of Warcraft, and I want to link to a popular site (wowhead.com). The following page shows what I'm trying to figure out: http://www.wowhead.com/?talent#ozxZ0xfcRMhuVurhstVhc0c
From what I understand, the "ozxZ0xfcRMhuVurhstVhc0c" part of the link is a hash. It contains all the information about that particular talent spec on the page, and changes whenever I add or remove points into a talent. I want to be able to recreate this part, so that I can then link my users directly to wowhead to view their talent trees, but I havn't the foggiest idea how to do that. Can anyone provide some guidance?
The first character indicates the class:
0 Druid
c Hunter
o Mage
s Paladin
b Priest
f Rogue
h Shaman
I Warlock
L Warrior
j Death Knight
The remaining characters indicate where in each tree points have been allocated. Each tree is separate, delimited by 'Z'. So if e.g. all the points are in the third tree, then the 2nd and 3rd characters will be "ZZ" indicating "end of first tree" and "end of second tree".
To generate the code for a given tree, split the talents up into pairs, going left-to-right and top-to-bottom. Each pair of talents is represented by a single character. So for example, in the DK's Blood tree segment, the first character will indicate the number of points allocated to Butchery and Subversion, and the second character will stand for Blade Barrier and Bladed Armor.
What character represents each allocation among the pair? I'm sure there's an algorithm, probably based on the ASCII character set, but all I've worked out so far is this lookup table. Find the number of points in the first talent along the top, and the number of points in the second talent along the left side. The encoded character is at the intersection.
0 1 2 3 4 5
0 0 o b h L x
1 z k d u p t
2 M R r G T g
3 c s f I j e
4 m a w N n v
5 V q i A y E
So if our Death Knight has one point in Butchery and two points in Subversion, the first character is 'R'. If instead we put no points in those two and five in Blade Barrier, the first two characters will be "0x". Trailing '0's (all the other pairs in the tree with no points allocated) can be omitted, as can trailing 'Z' delimiters (when there are no points in the subsequent trees). For one final example, the entire code for a DK with just a single point in Toughness would be "jZ0o": "Death Knight", "End of the first tree", "No points in the first pair of talents", "one point in the first talent of the second pair".
Can anyone work out what function generates the lookup table above? There's probably a clue in the codes for the classes: in alphabetical order (except for the DK which was added to the game after the others), they correspond to a series in the lookup table of (0,0), (0,3), (1,0), (1,3), (2,0), etc.
If you go to http://www.wowhead.com/?talent and start using the talent tree you can see the mysterious code being built up in the address bar as you click on the various boxes. So it's definitely not a hash but some kind of encoded structure data.
As the code is built up as you click the logic for building the code will be in the JavaScript on that page.
So my advice is do a view source on the page, download the JavaScript files and have a look at them.
I think it isn't a hash value, because hash values are normally one-ways values. This means you cannot (easily) restore the original information from which the hash code was generated.
Best thing would be to contact someone from wowhead.com and ask them how to interpret this information. I am sure they will help you out with some information about what type of encoding they use for the parameters. But without any help of the developers from wowhead.com it is almost impossible to figure out what information is encoded into this parameter.
I am not even sure the parameter you mentioned contains the talents of your character. Maybe it's just a session id or something like that. Take a look into the post data your browser sends to the server, it may contain a hidden field with the value you are searching for (you can use Tamper Data Firefox Addon).
I don't think ozxZ0xfcRMhuVurhstVhc0c is a hash value. I think it is a key (probably encrypted/encoded in some way). The server uses this key to retrieve information from it database. Since you don't have access to the database you don't know which key is needed, let alone how to encode it.
You need the original function that generates the hash.
I don't think that's public though :(
Check this out: hash wikipedia
Good luck learning how to program!
These hashes are hard to 'reverse engineer' unless you know how it was generated.
For example, it could be:
s1 = "random_string-" + score;
hash = encrypt(s1)
...etc
so it is hard to get the original data back from the hash (that is the whole point anyway).
your best bet would be link to the profile that would have the latest score ..etc