I'm writing an Outlook Add-in that saves emails for historical purposes. Outlook's MSG format is unfortunately overly-verbose, even when compressed. This causes saved MSG files to be many times the size of their text equivalent. However, saving all messages as text has the obvious pitfalls of lacking attachments, images, and any relevant formatting.
For the majority of emails this isn't an issue, however emails with a certain degree of complex formatting, pictures, attachments, (etc...) ought to be saved in MSG format.
The majority of users' emails are sent as HTML making my algorithm roughly as follows:
1. If email has attachment(s), save as MSG and be done
2. If email is stored as text, save as text and be done
3. If email is not stored as HTML store as MSG and be done
4. Decide if the HTML should be converted to text and
store it as text if so
store it as MSG if not
This is straightforward with exception of Step #4: How can I decide which format an HTML-formatted email should be converted to upon saving?
An idea: count the weighted density of HTML tags in the message. Choose a threshold based on existing data. Messages with HTML density higher than the threshold get stored as MSG; messages with density lower than the threshold get stored as plain text.
How do you calculate the weighted density? Use an HTML parsing library. Have it parse the document and count the number of each type of HTML tag are in the document. That's all you need from the library. Multiply each tag-count by its weight and sum them together. Then try converting the message to plain text and counting the number of characters in the message. Divide the weighted-tag-count-sum by that number and you have your density.
What should the density be weighted by? By a table you create with the importance of each type of HTML tag. I would guess that losing bold and italics are not too bad. Losing ordered and unordered lists lists are a bit worse, unless bullets and numbers are preserved when the messages are are converted to plain text. Tables should be weighted highly as they are important to the formatting. Choose a weight for unrecognized tags too.
How should you choose your threshold? Run your density-calculating function on a sample of emails. Also manually inspect those emails to see if they would be better off as MSG or plain text, and write that choice down for each email. Use some algorithm with that data to find the boundary value. I think that algorithm could be Naive Bayes classification, but there might be a simpler algorithm in this case. Or a human-calculated guess might be good enough. I think you could make a guess after looking at a scatter plot of human-chosen format vs weighted HTML tag density, and eyeballing the density value that approximately separates the two format decisions.
Related
First of all, I know this question is kind of off-topic, but I have already tried to ask elsewhere but got no response.
Adding a UNK token to the vocabulary is a conventional way to handle oov words in tasks of NLP. It is totally understandable to have it for encoding, but what's the point to have it for decoding? I mean you would never expect your decoder to generate a UNK token during prediction, right?
Depending on how you preprocess your training data, you might need the UNK during training. Even if you use BPE or other subword segmentation, OOV can appear in the training data, usually some weird UTF-8 stuff, fragments of alphabets, you are not interested in at all, etc.
For example, if you take WMT training data for English-German translation, do BPE and take the vocabulary, you vocabulary will contain thousands of Chinese characters that occur exactly once in the training data. Even if you keep them in the vocabulary, the model has no chance to learn anything about them, not even to copy them. It makes sense to represent them as UNKs.
Of course, what you usually do at the inference time is that you prevent the model predict UNK tokens, UNK is always incorrect.
I have used it one time in the following situation:
I had a preprocessed word2vec(glove.6b.50d.txt) and I was outputting an embedded vector, in order to transform it into a word I used cosine similarity based on all vectors in the word2vec if the most similar vector was the I would output it.
Maybe I'm just guessing it here, but what I think might happen under the hoods is that it predicts based on previous words(e.g. it predicts the word that appeared 3 iterations ago) and if that word is the neural net outputs it.
How does the box file need to look like if I use a multipage tiff to train Tesseract?
More precisely: how do the Y-coordinates of a box file correspond to Y-coordinates within pages?
The last, 6th column in the box file represents zero-based page number.
https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files
Update:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
Each font should be put in a single multi-page tiff and the box file
can be modified to specify the page number for each character after
the coordinates. Thus an arbitrarily large amount of training data may
be created for any given font, allowing training for large
character-set languages.
Even if you can have as large training text as you want, it could potentially result in unnecessarily large image and hence slow down training.
I want to generate an output for my correlation table so I can use it in Word. The only command I found is mkcorr, which only generates an output I can copy in Excel, but not in Word.
I need a correlation table in Paper style ( with means, standard deviation, correlation and significance label).
You can do this with estout/esttab/estpost after installing it:
capture ssc install estout
sysuse auto
estpost correlate price mpg weight
esttab using corr.rtf
This is pretty basic looking, but you can make it a lot more fancy after looking at some examples here.
Two things:
First, the help file of mkcorr says that you can produce the output in Word:
"mkcorr produces a correlation table in a format that is easy to import into a
spreadsheet or word processing document"
Second, mkcorr does what you want. It includes means, standard deviation, correlation and significance label (and also minimum and maximum).
Here is an example:
sysuse auto
mkcorr price mpg, log(C:\Users\Vista\Desktop\auto.doc) replace means sig cdec(2) mdec(2)
However, the output in Word needs some manipulation compared with that in Excel.
I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes
I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos.
Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.
If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.
I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.
Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?
It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.
The hOCR output also contains this value.