Autocorrect a document corpus

Autocorrect a document corpus - autocomplete

I have an approximately 6GB sized document corpus of mostly user generated content on mobile platforms. Due to the nature of origin of this corpus, it is rife with misspelled, abbreviated and truncated words. Is there a way i could autocorrect these words to the nearest English language word?

This might be fun to look at, seen that you tagged your question with machine learning:
http://norvig.com/spell-correct.html
It's a fascinating read. On the other hand, if you are not looking to tinker, a better one might be Enchant, have a look at
https://pypi.org/project/pyenchant/

Related

Word database for iOS custom keyboard

I am writing a custom keyboard for an Asian language and I have a word database with over half a million words. I use Realm for now and use it to give word suggestions. When users type the first few letters keyboard will search the DB and provide words based on priority values given to each word. But this seems inefficient compared to other keyboards in the App Store, I can't find any concrete way or idea on this issue. Anyone can point me in the direction to increase the efficiency of word searching with a custom iOS keyboard.
I haven't tried CoreData but generally, the realm is considered faster than CoreData.

First, type of storage: maybe consider using plist files or .text files wouldn't be bad.
and saving words in a sorted way in ASCII mode would be great.
Second: you need an algorithm to break into a group of words so fast. you can do this by saving the ASCII code.
Here is an example of a binary search algorithm :
Please search around about different algorithms.

How do I prove a DFA has no synchronizing word?

To find a synchronizing word I have always just used trial and error, which for small DFAs is fine but not so useful on larger DFAs. What I want to know, however, is if there exists an algorithm for determining a synchronizing word or if there is a way of being able to tell that one does not exist. (Rather than just saying "I can't find one, therefore one can not exist" which is by no means a proof).
I have had a look around on google and so far just came across methods for determining what the upper and lower bounds for a length of a synchronizing word would be based on the number of states, however this is not helpful to me.

The existence of upper bounds on the length of a synchronizing word immediately implies the existence of a (very slow) algorithm for finding one: just list all strings of length less than the upper bound and test whether each is a synchronizing word. If any of them are, then the synchronizing word exists, and if none of them are, there's no synchronizing word. This is exponentially slow, though, so it's not advisable on large DFAs.
David Eppstein designed a polynomial-time algorithm for finding synchronizing words in DFAs, though I'm not very familiar with this algorithm.
Hope this helps!

Teaching OCR to understand NSA and FISC redactions

I'm maintaining an archive of the heavily redacted documents coming out of the Foreign Intelligence Surveillance Court.
They come with big sections of text that look like this:
And when the OCR tries to work with this, you get text like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and
So in the OCRed version, where there are blacked out spots, there are just missing words. Sometimes, the missing words create a grammatically correct sentence with a different/weird meaning (like above). Other times, the resulting sentences make no sense, but either way it's a problem. It would be much better if the OCR engine could return X's for these spots or Unicode squares like ▮▮▮▮ instead.
The result I'd like is something like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and
My question is how to go about getting these X's. Is there a way to analyze the images to identify the black spots? Is there a way to replace them with X's or some better unicode character? I'm open to any ideas to make this look right, but image editing is not a strong suit for me nor is hacking deep within the OCR engine.

You may want to train Tesseract for those long blobs. Depending on the length of the blob, you would assign a different number of 'X' characters. Read TrainingTesseract3 for training process.

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance

You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.

I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language.
I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because
a) I want to check the working / accuracy of my prediction
b) Corpus or dataset based solutions fail for an unseen word
For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent
My question is whether any particular service or api exists for the purpose?

Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:
Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.
As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.

You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference.
One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered.
In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.

I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.
If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.
Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.
My original answer is here I'm leaving it because it's my own related question:
I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).
I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.
The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.
For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.
The ngram datasets are here by the way http://books.google.com/ngrams/datasets
This site may have what you want http://www.wordfrequency.info/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse