Teaching OCR to understand NSA and FISC redactions - unicode

I'm maintaining an archive of the heavily redacted documents coming out of the Foreign Intelligence Surveillance Court.
They come with big sections of text that look like this:
And when the OCR tries to work with this, you get text like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and
So in the OCRed version, where there are blacked out spots, there are just missing words. Sometimes, the missing words create a grammatically correct sentence with a different/weird meaning (like above). Other times, the resulting sentences make no sense, but either way it's a problem. It would be much better if the OCR engine could return X's for these spots or Unicode squares like ▮▮▮▮ instead.
The result I'd like is something like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and
My question is how to go about getting these X's. Is there a way to analyze the images to identify the black spots? Is there a way to replace them with X's or some better unicode character? I'm open to any ideas to make this look right, but image editing is not a strong suit for me nor is hacking deep within the OCR engine.

You may want to train Tesseract for those long blobs. Depending on the length of the blob, you would assign a different number of 'X' characters. Read TrainingTesseract3 for training process.

Related

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance
You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.
I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language.
I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because
a) I want to check the working / accuracy of my prediction
b) Corpus or dataset based solutions fail for an unseen word
For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent
My question is whether any particular service or api exists for the purpose?
Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:
Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.
As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.
You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference.
One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered.
In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.
I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.
If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.
Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.
My original answer is here I'm leaving it because it's my own related question:
I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).
I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.
The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.
For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.
The ngram datasets are here by the way http://books.google.com/ngrams/datasets
This site may have what you want http://www.wordfrequency.info/

Would MongoDB be a good fit for my industry?

I work in the promotional products industry. We sell pretty much anything that you can print, embroider, engrave, or any other method to customize. Popular products are pens, mugs, shirts, caps, etc. Because we have such a large variety of products, storing information about these products including all the possible product options, decoration options, and all associated extra charges gets extremely complicated. So much so, that although many have tried, no one has been able to provide industry product data in such a way that you could algorithmically turn the data into an eCommerce store without some degree of data massaging. It seems near impossible to store this information to properly in a relational database. I am curious if MongoDB, or any other NoSQL option, would allow me to model the information in such a way that makes it easier to store and manipulate our product information better than an RDBMS like MySQL. The company I work for is over 100 years old and has been using DB2 on an AS400 for many years. I'll need some good reasons to convince them to go with a non relational DB solution.
A common example product used in our industry is the Bic Clic Stic Pen which has over 20 color options each for barrel and trim colors. Even more colors to choose for silkscreen decoration. Then you can choose additional options for what type of ink to use. There are multiple options for packaging. After all that is selected, you have additional option for rush processing. All of these options may or may not have additional charges that can be based on how many pens you order or how many colors in your decoration. Pricing is usually based on quantity, so ordering 250 pens would be cost more per pen than ordering 1000. Similarly, the extra charge for getting special ink would be cheaper per pen ordered when you order 1000 than 250.
Without wanting to sound harsh, this has the ring of a silver bullet question.
You have an inherently complex business domain. It's not clear to me that a different way of storing your data will have any impact on that complexity - storing documents rather than relational data probably doesn't make it easier to price your pen at $0.02 less if the customer orders more than 250.
I'd recommend focussing on the business domain, and not worrying too much about the storage mechanism. I'm a big fan of Domain Driven Design - this sounds like a perfect case for that approach.
Using a document database won't solve your problem completely, but it probably can help.
If your documents represent the options available on a product and an order for that product, in most cases you will be accessing the document as a whole - it's nothing you can't do with SQL, but a good fit for a document database. Since the structure of the documents is flexible, it is relatively easy to define an object within the document as a complex type to define a particular option or rule without changing the database.
However, that only helps with the data - your real problem is on the UI side. The two documents together map directly to the order form, but whatever method you use to define the options/rules some of the products are going to end up with extremely complex settings pages.
Yes, MongoDB is what you need. It doesn't have a strict documents structure, so you'll be able to create set of models you need and embed them into your product page in any order and combinations you need. Actually its possible to work with this data without describing the real model fields directly, so I (for example) can use fields my Rails application doesn't know about at all.
Also MongoDB is extremely easy to set up for replication and sharding. Also it supports GridFS virtual filesystem, so you can store images for your products with documents which describe them and manipulate them as a single object easily.
You should definitely give it a try.
UPD: Anyway it would be good to keep your RDBMS for financial data and crunching numbers, like grouping reports for the sales analysys and so on. NoSQL bases aren't very good at this.