OCR software or homemade CNN for document processing? - neural-network

I have a dilemma. If you have only one type of invoice/document, and you have a specific field that you want to process from that invoice and use somewhere else (that filed happens to be a handwritten digit, sometimes written with dashes or slashes), would you use some OCR software or build your own CNN for recognizing the digits? What accuracy would you expect from OCR? Would your CNN be more accurate, as you are just interested in a specific type of digit writing, with specific image dimensions, etc. What would be better in the given situation?
Keep in mind, that you would not use it in any other way, or any other place for handwritten digits recognition, and you already have up to 100k and more documents that are copied to a computer by a human, and you can use it for training and testing.
Thank you.

I would definitely go for a CNN based solution. Since the structure of your document is consistent:
Extract the desired portion of the document with a standard computer vision approach
Train a CNN on an annotated set of a few thousand documents. You should even be able to finetune an existing CNN trained on MNIST and this would require less training images.
This approach should give you >99% accuracy without much effort. The accuracy of the OCR solution really depends on which library you use and the preprocessing you implement.

Related

Design of a Neural Network for Emotion Classification using Tweet Data

I have a dataset of four emotion labelled tweets (anger, joy, fear, sadness). For instance, I transformed tweets to a vector similar to the following input vector for anger:
Mean of frequency distribution to anger tokens
word2vec similarity to anger
Mean of anger in emotion lexicon
Mean of anger in hashtag lexicon
Is that vector valid to train a neural network?
Your input vector looks fine to start with. Of-course, you might later make it much advanced with statistical and derivative data from twitter or other relevant APIs or datasets.
Your network has four outputs, just like you mentioned:
Joy: [1,0,0,0]
Sadness: [0,1,0,0]
Fear: [0,0,1,0]
Anger: [0,0,0,1]
And you may consider adding multiple hidden layers and make it a deep network, if you wish, to increase stability of your neural network prototype.
As your question also shows, it may be best to have a good preprocessor and feature extraction system, prior to training and testing your data, which it certainly seems you know, where the project is going.
Great project, best wishes, thank you for your good question and welcome to stackoverflow.com!
Playground Tensorflow

Choose training and test set for MLP and Hopfield network

I have a question regarding the choice of the training and the test set for a Multilayer Perceptron (MLP) and a Hopfield network.
For example, assume that we got 100 patterns of the digits 0-9 given in a bitmap format. 10 of them are perfect digits while the other 90 are distorted. Which of these patterns will be used for the training set and which for the test set? The goal is to classify the digits.
I suppose for the Hopfield network the perfect digits will be used as the training set, but what about the MLP? One approach I thought of was to take for example 70 of the distorted digits and use them as the training set along with the corresponding perfect digits as their intended targets. Is this approach correct?
Disclaimer: I have not worked with Hopfield Networks before, so I trust you in your statements about it, but it should not be of that great relevance for the answer, anyways.
I am also assuming that you want to classify the digits, which is something you don't explicitly state in your question.
As for a proper split: Aside from the fact that that little training data is generally not a feasible amount to get decent results for a MLP (even for a simple task such as digit classification), it is unlikely that you will be able to "pre-label" your training data in terms of quality in most real-world scenarios. You should therefore always assume that the data you are processing is inherently noisy. A good example for this is also the fact that data augmentation is frequently used to enrich your training corpus. Since data augmentation can consist of such simple changes as
added noise
minor rotations
horizontal/vertical flipping (the latter only makes so much sense for digits, though)
can improve your accuracy, it goes to show that visual quality and quantity for training are two very different things. Of course, it is not per se true that quantity alone will solve your problem (although research indicates that it is at least a good idea to use very much data)
Further, what you judge to be a good representation might be very much different from the network's perspective (although for labeling digits it might be rather easy to tell). A decent strategy is therefore to simply perform a random sampling for your training/test split.
Something I like to do when preprocessing a dataset is, when done splitting, to check whether every class is somewhat evenly represented in the splits, so you won't overfit.
Similarly, I would argue that having clean/high quality images of digits in both your test and training set might make the most sense, since you want to both be able to recognize a high quality number, as well as a sloppily written digit, and then test whether you can actually recognize it (with your test set).

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

can we use autoencoders for text data

I am doing my project based on health care.I am going to train my autoencoders with the symptoms and the diseases i.e my input is in textual form. Will that work? (I am using Rstudio).Please anyone help me with this
You have to convert the text to vectors/numbers. To do this traditional approaches like Bag of words, Tf-Idf will help but the latest Neural Word Embedding like Word2Vec, RNN Language model etc are the best techniques to obtain numeric representation of text.
Please use any Neural Word Embedding technique and convert the text(word level[word2vec], document level[doc2vec]) into numbers/vectors.
Now these vectors come with some dimension and to compress this representation to even smaller dimension u can use AutoEncoder.
Feel Free to ask any other information required.
Try using Python for these tasks as it has the latest packages.
You can use Autoencoder on Textual data as explained here.
Autoencoder usually worked better on image data but recent approaches changed the autoencoder in a way it is also good on the text data.
have a look at this.
the code is also available in GitHub.

neural network data for training and testing

I have a question regarding Training and testing data for my ANN .
Should the testing data going trough a feature extraction process before it can be classified?
I am new to this field. Is what I am doing right?
I separate the dataset to 80% train and 20 % test. Both sets , I extract the features. for train data I put it into training network but not for the test data. Then go to classification. Is this correct? because my SV said the test data should not go through the feature extraction process. I am wondering how the ANN can recognize the input if not specific feature is being extract. Apologize my bad English.
If anyone have link or journal that I can refer please provide it..
Thanks a lot.
Both the training and the test data needs to be in the same format - thus your training data and test data should go through the same pre-processing steps else your network will not learn correctly.
You are doing it right (as far as I understand your question).
Example: If you were to show me 10 images of faces (training data) on paper and then present me 2 people (training data) by their name only (different feature representation) - I wouldn't be able to classify what I didn't learn. You can't train the network with images and then test it with audio or any representation other than the one you used for training. I can't link any papers for that as it's just common sense.
You can modify the training set, e.g. by adding noise. But whatever you do, the representation format has to be the same.