I have built a neural network for coreference resolution. I need a corpus for training the neural network. Since the OntoNotes corpus has become private is there an alternative? The corpus need not be large. Any alternative with few annotated documents is fine. I just need the coreference annotation other ones are not important to me.
PS: The language in context is ENGLISH.
I found this : http://www.cs.umd.edu/~aguha/qbcoreference which is an other english dataset based on "Removing the Training Wheels: A Coreference Dataset
that Entertains Humans and Challenges Computers" (link here)
Related
I have a dataset of four emotion labelled tweets (anger, joy, fear, sadness). For instance, I transformed tweets to a vector similar to the following input vector for anger:
Mean of frequency distribution to anger tokens
word2vec similarity to anger
Mean of anger in emotion lexicon
Mean of anger in hashtag lexicon
Is that vector valid to train a neural network?
Your input vector looks fine to start with. Of-course, you might later make it much advanced with statistical and derivative data from twitter or other relevant APIs or datasets.
Your network has four outputs, just like you mentioned:
Joy: [1,0,0,0]
Sadness: [0,1,0,0]
Fear: [0,0,1,0]
Anger: [0,0,0,1]
And you may consider adding multiple hidden layers and make it a deep network, if you wish, to increase stability of your neural network prototype.
As your question also shows, it may be best to have a good preprocessor and feature extraction system, prior to training and testing your data, which it certainly seems you know, where the project is going.
Great project, best wishes, thank you for your good question and welcome to stackoverflow.com!
Playground Tensorflow
I have a dilemma. If you have only one type of invoice/document, and you have a specific field that you want to process from that invoice and use somewhere else (that filed happens to be a handwritten digit, sometimes written with dashes or slashes), would you use some OCR software or build your own CNN for recognizing the digits? What accuracy would you expect from OCR? Would your CNN be more accurate, as you are just interested in a specific type of digit writing, with specific image dimensions, etc. What would be better in the given situation?
Keep in mind, that you would not use it in any other way, or any other place for handwritten digits recognition, and you already have up to 100k and more documents that are copied to a computer by a human, and you can use it for training and testing.
Thank you.
I would definitely go for a CNN based solution. Since the structure of your document is consistent:
Extract the desired portion of the document with a standard computer vision approach
Train a CNN on an annotated set of a few thousand documents. You should even be able to finetune an existing CNN trained on MNIST and this would require less training images.
This approach should give you >99% accuracy without much effort. The accuracy of the OCR solution really depends on which library you use and the preprocessing you implement.
As the title says, how can I determine the architecture or build a reasonable model for training a neural network with regards to the number of examples?
For example, assuming that I have roughly 50 thousand images and I have successfully converted all data to fit the model which means they are ready for training, how can I choose a model that is suitable for training a neural network? I am a little bit confused sometimes when I have data but I did not know how to initiate a model for training NN.
Fine tuning is the way
Sometimes you have a pre-trained CNN that you can use as a starting point for your domain. For more about fine tuning You can check here.
According to this, my advice is to fine tune a pre-trained Neural Network that you can find in Keras (This page, under "Available models") or TensorFlow. You can go deeper as far as you are confident with your training set!
In any case, you need to see the number of samples per class rather than the absolute number of images in your training set. If you are confident you can choose a Deep Learning SOA architecture and try to train it from zero.
I have decided to use a feed-forward NN with back-propagation training for my OCR application for Handwritten text and the input layer is going to be with 32*32 (1024) neurones and at least 8-12 out put neurones.
I found Neuroph easy to use by reading some articles at the same time Encog is few times better in performance. Considering the parameters in my scenario which API is the most suitable one. And I appreciate if u can comment on the number of input nodes i have taken, is it too large value (Although it is out of the topic)
First my disclaimer, I am one of the main developers on the Encog project. This means I am more familiar with Encog that Neuroph and perhaps biased towards it. In my opinion, the relative strengths of each are as follows. Encog supports quite a few interchangeable machine learning methods and training methods. Neuroph is VERY focused on neural networks and you can express a connection between just about anything. So if you are going to create very custom/non-standard (research) neural networks of different typologies than the typical Elman/Jordan, NEAT, HyperNEAT, Feedforward type networks, then Neuroph will fit the bill nicely.
I have a couple slightly modified / non-traditional setups for feedforward neural networks which I'd like to compare for accuracy against the ones used professionally today. Are there specific data sets, or types of data sets, which can be used as a benchmark for this? I.e. "the style of ANN typically used for such-and-such a task is 98% accurate against this data set." It would be great to have a variety of these, a couple for statistical analysis, a couple for image and voice recognition, etc.
Basically, is there a way to compare an ANN I've put together against ANNs used professionally, across a variety of tasks? I could pay for data or software, but would prefer free of course.
CMU has some benchmarks for neural networks: Neural Networks Benchmarks
The Fast Artificial Neural Networks library (FANN) has some benchmarks that are widely used: FANN. Download the source code (version 2.2.0) and look at the directory datasets, the format is very simple. There is always a training set (x.train) and a test set (x.test). At the beginning of the file is the number of instances, the number of inputs and the number of outputs. The next lines are the input of the first instance and the output of the first instance and so on. You can find example programs with FANN in the directory examples. I think they even had detailed comparisons to other libraries in previous versions.
I think most of FANN's benchmarks if not all are from Proben1. Google for it, there is a paper from Lutz Prechelt with detailed descriptions and comparisons.