Use brain.js neural network to do text analysis - neural-network

I'm trying to do some text analysis to determine if a given string is... talking about politics. I'm thinking I could create a neural network where the input is either a string or a list of words (ordering might matter?) and the output is whether the string is about politics.
However the brain.js library only takes inputs of a number between 0 and 1 or an array of numbers between 0 and 1. How can I coerce my data in such a way that I can achieve the task?

new brain.recurrent.LSTM();
this does the trick for you.
Example,
var brain = require('brain.js')
var net = new brain.recurrent.LSTM();
net.train([
{input: "my unit-tests failed.", output: "software"},
{input: "tried the program, but it was buggy.", output: "software"},
{input: "i need a new power supply.", output: "hardware"},
{input: "the drive has a 2TB capacity.", output: "hardware"},
{input: "unit-tests", output: "software"},
{input: "program", output: "software"},
{input: "power supply", output: "hardware"},
{input: "drive", output: "hardware"},
]);
console.log("output = "+net.run("drive"));
output = hardware
refer to this link=> https://github.com/BrainJS/brain.js/issues/65
this has clear explanation and usage of brain.recurrent.LSTM()

You need to come up with the model to convert your data to a list of tuples [input, expected_output], where input is a list of numbers between 0 and 1 representing the given words, and output is one number between 0 and 1 representing how close the sentence is to your objective analysis (being political). For example, for the sentence "The quick brown cat jumped over the lazy dog" you might want to give a score of zero. A sentence like "President shakes off corruption scandal" you might want to give a score very close to one.
As you can see, your biggest challenge is actually obtaining the data and cleaning it. Converting it to the training format is easy, you could just hash words into numbers between 0 and 1, and make sure to handle different casing, punctuation, and you might want to step words to get the best results.
One more thing, you can use a term relevance algorithm to rank the importance of words in your training data set, so that you can choose only the top k relevant words in a sentence, since you need uniform data size for each sentence.

So apparently text doesn't coerce very well to NN input.
A Naive Bayes Classifier looks like exactly what I want. https://github.com/harthur/classifier

Related

append an atom with exisiting variables and create new set in clingo

I am totally new in asp, I am learning clingo and I have a problem with variables. I am working on graphs and paths in the graphs so I used a tuple such as g((1,2,3)). what I want is to add new node to the path in which the tuple sequence holds. for instance the code below will give me (0, (1,2,3)) but what I want is (0,1,2,3).
Thanks in advance.
g((1,2,3)).
g((0,X)):-g(X).
Naive fix:
g((0,X,Y,Z)) :- g((X,Y,Z)).
However I sense that you want to store the path in the tuple as is it is a list. Bad news: unlike prolog clingo isn't meant to handle lists as terms of atoms (like your example does). Lists are handled by indexing the elements, for example the list [a,b,c] would be stored in predicates like p(1,a). p(2,b). p(3,c).. Why? Because of grounding: you aim to get a small ground program to reduce the complexity of the solving process. To put it in numbers: assuming you are searching for a path which includes all n nodes. This sums up to n!. For n=10 this are 3628800 potential paths, introducing 3628800 predicates for a comparively small graph. Numbering the nodes as mentioned will lead to only n*n potential predicates to represent the path. For n=10 these are just 100, in comparison to 3628800 a huge gain.
To get an impression what you are searching for, run the following example derived from the potassco website:
% generating path: for every time exactly one node
{ path(T,X) : node(X) } = 1 :- T=1..6.
% one node isn't allowed on two different positions
:- path(T1,X), path(T2,X), T1!=T2.
% there has to be an edge between 2 adjascent positions
:- path(T,X), path(T+1,Y), not edge(X,Y).
#show path/2.
% Nodes
node(1..6).
% (Directed) Edges
edge(1,(2;3;4)). edge(2,(4;5;6)). edge(3,(1;4;5)).
edge(4,(1;2)). edge(5,(3;4;6)). edge(6,(2;3;5)).
Output:
Answer: 1
path(1,1) path(2,3) path(3,4) path(4,2) path(5,5) path(6,6)
Answer: 2
path(1,1) path(2,3) path(3,5) path(4,4) path(5,2) path(6,6)
Answer: 3
path(1,6) path(2,2) path(3,5) path(4,3) path(5,4) path(6,1)
Answer: 4
path(1,1) path(2,4) path(3,2) path(4,5) path(5,6) path(6,3)
Answer: 5
...

How to embed a sentence into vector

I had the sentence.I use word2vec to embed word to vector.For example, consider I have a sentence of 5 words.so I get 5 different vectors(One for each word) for the sentence.Which is the best method to make the complete sentence as a single vector which I will pass to the ANN?
This is an open problem; many approaches exist to creating meaningful sentence vectors.
BoW models, as Fabrizio_P explained
Element-wise vector operations (http://www.aclweb.org/anthology/P/P08/P08-1028.pdf)
Addition (i.e. simply add all the word vector together, possibly normalizing afterwards)
Multiplication (i.e. multiply all vectors together, element-wise, resulting in a logically grounded embedding)
Arbitrary task-specific recurrent functions (http://www.aclweb.org/anthology/D12-1110)
More sophisticated general-purpose approaches (https://arxiv.org/abs/1508.02354, https://arxiv.org/abs/1506.06726)
Element-wise operations, such as vector addition, suffice for most simple tasks, but obviously exhibit a high amount of information loss as sentences grow larger or the task at hand gets more demanding. Recurrent neural networks are quite good at creating task specific sentence embeddings, but obviously these require training data and some familiarity with machine learning. General purpose sentence embeddings are the most interesting ones from a research perspective, but probably not what you're looking for.
You could use the bag of words concept, as explained here https://machinelearningmastery.com/gentle-introduction-bag-words-model/. So that you collect all of you words and put them in a vocabulary. After that you can represent your sentence as a vector, where each element is either 1 or 0, depending on whether the word is in the sentence or not.
For example if your sentence is
Hello my name is Peter.
Your dictionary will be
[Hello, my, name, is, Peter]
The vector for your sentence will be
[1, 1, 1, 1, 1]
If you have another sentence like
I am happy.
Your dictionary will extend including also those words. So it will be
[Hello, my, name, is, Peter, I, am, happy]
And your vector sentence will also extend
[1, 1, 1, 1, 1, 0, 0, 0]
As an alternative you can also create a vocabulary where each word is represented by a number, so that
{Hello: 1, my: 2, name: 3, is: Peter: 4, I: 5, am: 6, happy: 7}
And the vector for your sentence will be
[1,2,3,4]
For each new sentence you will convert the words into numbers using the vocabulary as reference.
word2vec is an algorithm to create word embeddings, you can read the details here https://www.tensorflow.org/tutorials/word2vec
You can run this algorithm on your own dataset, or use saved word embeddings that Google (or other parties) have been run on billions of documents.
The idea is to map each word as dense vector in some n-dimensional vector space, thus containing much more information about words and their relationships.
Put simply each word is represented by a unique list of numbers, and now mathematical operations are possible on words, sentences and documents.

Neutrality for sentiment analysis in spark

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

How do I interpret pycaffe classify.py output?

I created a GoogleNet Model via Nvidia DIGITS with two classes (called positive and negative).
If I classify an image with DIGITS, it shows me a nice result like positive: 85.56% and negative: 14.44%.
If it pass that model it into pycaffe's classify.py with the same image, I get a result like array([[ 0.38978559, -0.06033826]], dtype=float32)
So, how do I read/interpret this result? How do I calculate the confidence levels (not sure if this is the right term) shown by DIGITS from the results shown by classify.py?
This issue led me to the solution.
As the log shows, the network produces three outputs. Classifier#classify only returns the first output. So e.g. by changing predictions = out[self.outputs[0]] to predictions = out[self.outputs[2]], I get the desired values.

how to find all the possible longest common subsequence from the same position

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:
String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";
I need the answer to be:
String[] Answer = ["abd", "dpq"];
My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem.
Does anyone have any idea on how to solve this with low time cost?
Thanks
I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.
Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.
Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.
P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)
You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times.
In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.