What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)?

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)? - neural-network

I found these 2 words "dont" and "isnt" in the vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. I wonder if they were originally "don't" and "isn't". This will likely depend on the sentence_to_word parsing algorithms they used. If someone is familiar, please confirm if this is the case.
A secondary question is if this is a common way to deal with apostrophe for words like "don't", "isn't", "hasn't" and so on. i.e. just filter replace that apostrophe with an empty string such that "don" and "t" becomes one word.
Finally, I am also not sure if GloVe comes with API to do sentence_to_word parsing so you can be consistent with what the researchers have done originally.

I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Related

Is this possible to write a Quine in ook

According to this comment from the general question Is it possible to create a quine in every turing-complete language? it seems like it is said that it's possible.
However I didn't find any Ook! Quine on the internet.
Do you think that it's really possible?
And if yes will we be able to find it?

It wouldn't even be very difficult. You would want to code it in brainfuck and then translate, and the internal representation for each command should be a pair of numbers (probably from 0-2) to represent the punctuation of each half-command. You could borrow much of the structure from Erik Bosman's brainfuck quine.
Updated: here. https://gist.github.com/danielcristofani/1fe53487df1f7afcb5b91c06d95184b2
This is ~40 commands taken directly from Erik Bosman's quine, another ~120 freshly written commands of rather clunky output code to handle Ook!'s verbosity, and then the data segment to represent all that.

How to find literals in source code of Smartforms and in SAPScripts (or reports, if the others can't be done)

I'd like to check hardcoded values in (a lot of) Smartforms and SAPScript forms.
I have found a way to read the source code of both of these, but it seems that i will have to go through a lot of parsing before I get anything reliable.
I've come across function module GET_LITERAL but that doesn't seem to help me much since i have to specify the offset of the value, if i got right what the function is doing in the first place.
I also found RS_LITERAL_LIST but that also doesn't do what i expect.
I also tried searching for reports and methods, but haven't found anything that seemed to help.
A backup plan would be to get some good parsing tool, so do you know of anything like that.
Anyway, any hints would be helpful and appreciated.
[EDIT]
Forgot to mention, the version of my system is 4.6C

If you have a fairly recent version of ABAP, you can use a regex.
Follow the pattern of this example, but use your source as the text and create your own regex. Have it look for any single quotes on the end of a word separated by spaces or any integers with spaces on either side. That's just a start, you might need to work on a better pattern.
String functions count, find, and match

Perl module for text comparison

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?

There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone

What's a Good package for Phonetic Representation for Various Human Languages?

I'm currently working on a project for which I think being able to come up with phonetic representations of words in various languages would be really helpful. I know Aspell does this pretty well, but I don't think there's a very easy way to get at their phonetic representations, so I ask: is there some other good package for getting the phonetic representation of a word given the word and the language/dialect/accent/whatever it's coming from?
This doesn't need to be in any particular language, but if it were Perl, that would be best.
I've already tried Soundex, Metaphone, DoubleMetaphone, and everything else in Text::Phonetic, and none of that stuff was very good – definitely nowhere near as good as the stuff in Aspell.

The first thing that springs to mind is Soundex. Of course, there is a Perl module Soundex, too. While this is designed to generate a soundex "key" from input it might be useful in mapping different variants to a common key.

There is a package Text::Aspell in CPAN. Might be useful.

I you are trying to make a google style suggestion/correction system, it's not based on just phonetics or AI, but on a massive amount of user input. When a user makes a search, and doesn't click in any link but corrects the input and searches again, it gives google a lot of data about "correct" writing than a phonetics test or dictionary matching.
The main problem is in human language itself, it's not that people speak or write in a deterministic way, let alone in multiple languages.
Of course , i might be wrong, but if you need a library that let's you do this:
getLanguage(string);
I want to see that working, really.

Reliably getting a character count for .doc files

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?

Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format

Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.

Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.

If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)? - neural-network

I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Related

Is this possible to write a Quine in ook

How to find literals in source code of Smartforms and in SAPScripts (or reports, if the others can't be done)

Perl module for text comparison

What's a Good package for Phonetic Representation for Various Human Languages?

Reliably getting a character count for .doc files

Categories

Resources