can we use autoencoders for text data - autoencoder

I am doing my project based on health care.I am going to train my autoencoders with the symptoms and the diseases i.e my input is in textual form. Will that work? (I am using Rstudio).Please anyone help me with this

You have to convert the text to vectors/numbers. To do this traditional approaches like Bag of words, Tf-Idf will help but the latest Neural Word Embedding like Word2Vec, RNN Language model etc are the best techniques to obtain numeric representation of text.
Please use any Neural Word Embedding technique and convert the text(word level[word2vec], document level[doc2vec]) into numbers/vectors.
Now these vectors come with some dimension and to compress this representation to even smaller dimension u can use AutoEncoder.
Feel Free to ask any other information required.
Try using Python for these tasks as it has the latest packages.

You can use Autoencoder on Textual data as explained here.
Autoencoder usually worked better on image data but recent approaches changed the autoencoder in a way it is also good on the text data.
have a look at this.
the code is also available in GitHub.

Related

Word to vector where should I start?

I'm trying to implement a neural networks model on labeled data that I have. The data contains several columns (categorical and numeric features as well).
Few columns in this data contains a short description, written by users which I also want to analyze but I don't know how to start.
The data looks something like this:
problem ID status description labels
1 closed short description of the problem CRM
2 open short description of the problem ERP
3 closed short description of the problem CRM
Using status (which I will convert into dummy variables) and description (this is where I need you guys), I want to train the model to predict the labels.
Any idea about how should I start? How can I convert the description columns into a useful data?
Thanks!
You want to do the classification basically based on the features, for categorical variables encode them into some trainable form. for text first, perform cleaning, if that has more numbers then convert numbers into their words form and make vectors for it using tf-idf or any other vectorization approach, also normalize your numerical features and then train a simple svm classifier with it, if not giving good accuracy then go with CNN and LSTM based neural network, you can also try CNN+Embeddings for better results.

OCR software or homemade CNN for document processing?

I have a dilemma. If you have only one type of invoice/document, and you have a specific field that you want to process from that invoice and use somewhere else (that filed happens to be a handwritten digit, sometimes written with dashes or slashes), would you use some OCR software or build your own CNN for recognizing the digits? What accuracy would you expect from OCR? Would your CNN be more accurate, as you are just interested in a specific type of digit writing, with specific image dimensions, etc. What would be better in the given situation?
Keep in mind, that you would not use it in any other way, or any other place for handwritten digits recognition, and you already have up to 100k and more documents that are copied to a computer by a human, and you can use it for training and testing.
Thank you.
I would definitely go for a CNN based solution. Since the structure of your document is consistent:
Extract the desired portion of the document with a standard computer vision approach
Train a CNN on an annotated set of a few thousand documents. You should even be able to finetune an existing CNN trained on MNIST and this would require less training images.
This approach should give you >99% accuracy without much effort. The accuracy of the OCR solution really depends on which library you use and the preprocessing you implement.

Understanding Feature Extraction and Feature Vectors in Image Processing?

I am working on a small project in Matlab just because of my interest in image processing and I have not studied a degree or a course related to image processing.
I want to understand a small concept about feature extraction and feature vectors. I have read some articles about that and in general I can understand that, but my question is:
For example, I want to extract some information from different objects of a binary image, the information is about length, width and distance between the objects. In one application I want to extract the features on which I want to apply some algorithms to compute width of all the objects and ignore the length and distance. Can we name this as feature extraction regarding the width? And storing them in different vectors as Feature Vectors?
It makes me think that, I might be complicating the simple things. Should I use some other terminologies for this instead of feature extraction and feature vectors?
Please suggest me if I am going in the right direction or not?
Thank you!
Feature extraction is the process of computing numerical values on regions/objects/shapes/blobs detected in an image. [Sometimes the detection itself can be called extraction and the features need not be numbers.]
The feature values can indeed be stored in vectors, usually they fill a table. Sometimes they are structured in a more complicated way (such as a graph f.i.). Most of the time they are used for classification/recognition purposes or they can just be the output of the process on hand.

Non-image data with cnn [Matlab Specific]

I am trying to use a cnn to build a classifier for my data.
The training set is comprised of 2D numerical matrices which are not image data.
It seems that Matlab's cnns only work with image inputs:
https://uk.mathworks.com/help/nnet/ref/imageinputlayer-class.html
Does anyone have experience with cnns and non-image data using Matlab's deep learning toolbox?
Thank you.
Well I first would like to understand why you want to use a CNN with non-image data? CNNs are specially good because they take into account information in the neighborhood. Unless your data has some kind of region pattern (like pixels that get together to create a pattern or sentences where word order is relevant) the CNN would not be the best approach to handle it.
That been said, if you still want to use it you could convert the matrix to images. I'm not sure if that would help though.
Function to convert: mat2gray

NuPIC on MNIST Dataset

I am a newbie. I think idea of NuPIC is really cool and therefore wanted to apply KNN Classifier on
NuPIC's output. I saw there is a KNNClassifier object already in python. I am confused about the input
patter that I should use. In case of MNIST dataset I will be having images where each image is a 2D
array of numbers and will be sparse. I can understand the format of output can be encoded using
categorical encoder in NuPIC but there is no such example of encoding an input that comes in the
form of arrays.
Any help will be highly appreciated.
This might help: http://numenta.org/search.html?q=mnist. There are some good discussions on our mailing lists about MNIST.