Training tesseract for images having printed and handwritten data - tesseract

I am really confused with the tesseract training
I have used tesseract with eng language the results are 80% accurate, i am trying to improve the accuracy. I thought of creating a new language with orginal images as train data. Will the results improve if i use orginal data as the train data? I have trained it but the results are very bad when compared with the original.
So is there any way i can modify the orginal eng trainedata so that i can improve the accuracy? I tried following the docs,like changing the uncharambigs but it is of no use.
And another question is my image has both printed data(majority) and handwritten data. So, how should i train the tesseract, should i do it for printed data first and then for the handwritten data or do it seperately?

Related

Word to vector where should I start?

I'm trying to implement a neural networks model on labeled data that I have. The data contains several columns (categorical and numeric features as well).
Few columns in this data contains a short description, written by users which I also want to analyze but I don't know how to start.
The data looks something like this:
problem ID status description labels
1 closed short description of the problem CRM
2 open short description of the problem ERP
3 closed short description of the problem CRM
Using status (which I will convert into dummy variables) and description (this is where I need you guys), I want to train the model to predict the labels.
Any idea about how should I start? How can I convert the description columns into a useful data?
Thanks!
You want to do the classification basically based on the features, for categorical variables encode them into some trainable form. for text first, perform cleaning, if that has more numbers then convert numbers into their words form and make vectors for it using tf-idf or any other vectorization approach, also normalize your numerical features and then train a simple svm classifier with it, if not giving good accuracy then go with CNN and LSTM based neural network, you can also try CNN+Embeddings for better results.

OCR software or homemade CNN for document processing?

I have a dilemma. If you have only one type of invoice/document, and you have a specific field that you want to process from that invoice and use somewhere else (that filed happens to be a handwritten digit, sometimes written with dashes or slashes), would you use some OCR software or build your own CNN for recognizing the digits? What accuracy would you expect from OCR? Would your CNN be more accurate, as you are just interested in a specific type of digit writing, with specific image dimensions, etc. What would be better in the given situation?
Keep in mind, that you would not use it in any other way, or any other place for handwritten digits recognition, and you already have up to 100k and more documents that are copied to a computer by a human, and you can use it for training and testing.
Thank you.
I would definitely go for a CNN based solution. Since the structure of your document is consistent:
Extract the desired portion of the document with a standard computer vision approach
Train a CNN on an annotated set of a few thousand documents. You should even be able to finetune an existing CNN trained on MNIST and this would require less training images.
This approach should give you >99% accuracy without much effort. The accuracy of the OCR solution really depends on which library you use and the preprocessing you implement.

Non-image data with cnn [Matlab Specific]

I am trying to use a cnn to build a classifier for my data.
The training set is comprised of 2D numerical matrices which are not image data.
It seems that Matlab's cnns only work with image inputs:
https://uk.mathworks.com/help/nnet/ref/imageinputlayer-class.html
Does anyone have experience with cnns and non-image data using Matlab's deep learning toolbox?
Thank you.
Well I first would like to understand why you want to use a CNN with non-image data? CNNs are specially good because they take into account information in the neighborhood. Unless your data has some kind of region pattern (like pixels that get together to create a pattern or sentences where word order is relevant) the CNN would not be the best approach to handle it.
That been said, if you still want to use it you could convert the matrix to images. I'm not sure if that would help though.
Function to convert: mat2gray

Handwritten digits recognition

I am going to do a neural network project about handwritten digits recognition but this area is well-studied. I have found some essays online but most of them are before 2012. Could anyone tell me what is the state-of-the-art technology and current issue of this area ?
SO you can refer MNIST DATASET because it is made up of a huge number of corner cases and it contains 28*28 pixel images with 42000 rows and 786 cols so that you can easily get an idea
how much data you want to train and for predictions without needing the train_test_split.
So that you can get a proper idea of the dataset and you can Visualize it.
dataset link:- MNIST DATASET
refer video:- HANDWRITTEN DIGIT RECOGNITION
The state of the art is almost always determined as the result of performance against a particular dataset. For handwritten digits, the MNIST dataset is the one I have seen most commonly referenced (though I'm not an expert in the area). For any dataset you should be able to easily find the state of the art performance against it very near where you download it.

neural network data for training and testing

I have a question regarding Training and testing data for my ANN .
Should the testing data going trough a feature extraction process before it can be classified?
I am new to this field. Is what I am doing right?
I separate the dataset to 80% train and 20 % test. Both sets , I extract the features. for train data I put it into training network but not for the test data. Then go to classification. Is this correct? because my SV said the test data should not go through the feature extraction process. I am wondering how the ANN can recognize the input if not specific feature is being extract. Apologize my bad English.
If anyone have link or journal that I can refer please provide it..
Thanks a lot.
Both the training and the test data needs to be in the same format - thus your training data and test data should go through the same pre-processing steps else your network will not learn correctly.
You are doing it right (as far as I understand your question).
Example: If you were to show me 10 images of faces (training data) on paper and then present me 2 people (training data) by their name only (different feature representation) - I wouldn't be able to classify what I didn't learn. You can't train the network with images and then test it with audio or any representation other than the one you used for training. I can't link any papers for that as it's just common sense.
You can modify the training set, e.g. by adding noise. But whatever you do, the representation format has to be the same.