How to embed a sentence into vector - neural-network

I had the sentence.I use word2vec to embed word to vector.For example, consider I have a sentence of 5 words.so I get 5 different vectors(One for each word) for the sentence.Which is the best method to make the complete sentence as a single vector which I will pass to the ANN?

This is an open problem; many approaches exist to creating meaningful sentence vectors.
BoW models, as Fabrizio_P explained
Element-wise vector operations (http://www.aclweb.org/anthology/P/P08/P08-1028.pdf)
Addition (i.e. simply add all the word vector together, possibly normalizing afterwards)
Multiplication (i.e. multiply all vectors together, element-wise, resulting in a logically grounded embedding)
Arbitrary task-specific recurrent functions (http://www.aclweb.org/anthology/D12-1110)
More sophisticated general-purpose approaches (https://arxiv.org/abs/1508.02354, https://arxiv.org/abs/1506.06726)
Element-wise operations, such as vector addition, suffice for most simple tasks, but obviously exhibit a high amount of information loss as sentences grow larger or the task at hand gets more demanding. Recurrent neural networks are quite good at creating task specific sentence embeddings, but obviously these require training data and some familiarity with machine learning. General purpose sentence embeddings are the most interesting ones from a research perspective, but probably not what you're looking for.

You could use the bag of words concept, as explained here https://machinelearningmastery.com/gentle-introduction-bag-words-model/. So that you collect all of you words and put them in a vocabulary. After that you can represent your sentence as a vector, where each element is either 1 or 0, depending on whether the word is in the sentence or not.
For example if your sentence is
Hello my name is Peter.
Your dictionary will be
[Hello, my, name, is, Peter]
The vector for your sentence will be
[1, 1, 1, 1, 1]
If you have another sentence like
I am happy.
Your dictionary will extend including also those words. So it will be
[Hello, my, name, is, Peter, I, am, happy]
And your vector sentence will also extend
[1, 1, 1, 1, 1, 0, 0, 0]
As an alternative you can also create a vocabulary where each word is represented by a number, so that
{Hello: 1, my: 2, name: 3, is: Peter: 4, I: 5, am: 6, happy: 7}
And the vector for your sentence will be
[1,2,3,4]
For each new sentence you will convert the words into numbers using the vocabulary as reference.

word2vec is an algorithm to create word embeddings, you can read the details here https://www.tensorflow.org/tutorials/word2vec
You can run this algorithm on your own dataset, or use saved word embeddings that Google (or other parties) have been run on billions of documents.
The idea is to map each word as dense vector in some n-dimensional vector space, thus containing much more information about words and their relationships.
Put simply each word is represented by a unique list of numbers, and now mathematical operations are possible on words, sentences and documents.

Related

append an atom with exisiting variables and create new set in clingo

I am totally new in asp, I am learning clingo and I have a problem with variables. I am working on graphs and paths in the graphs so I used a tuple such as g((1,2,3)). what I want is to add new node to the path in which the tuple sequence holds. for instance the code below will give me (0, (1,2,3)) but what I want is (0,1,2,3).
Thanks in advance.
g((1,2,3)).
g((0,X)):-g(X).
Naive fix:
g((0,X,Y,Z)) :- g((X,Y,Z)).
However I sense that you want to store the path in the tuple as is it is a list. Bad news: unlike prolog clingo isn't meant to handle lists as terms of atoms (like your example does). Lists are handled by indexing the elements, for example the list [a,b,c] would be stored in predicates like p(1,a). p(2,b). p(3,c).. Why? Because of grounding: you aim to get a small ground program to reduce the complexity of the solving process. To put it in numbers: assuming you are searching for a path which includes all n nodes. This sums up to n!. For n=10 this are 3628800 potential paths, introducing 3628800 predicates for a comparively small graph. Numbering the nodes as mentioned will lead to only n*n potential predicates to represent the path. For n=10 these are just 100, in comparison to 3628800 a huge gain.
To get an impression what you are searching for, run the following example derived from the potassco website:
% generating path: for every time exactly one node
{ path(T,X) : node(X) } = 1 :- T=1..6.
% one node isn't allowed on two different positions
:- path(T1,X), path(T2,X), T1!=T2.
% there has to be an edge between 2 adjascent positions
:- path(T,X), path(T+1,Y), not edge(X,Y).
#show path/2.
% Nodes
node(1..6).
% (Directed) Edges
edge(1,(2;3;4)). edge(2,(4;5;6)). edge(3,(1;4;5)).
edge(4,(1;2)). edge(5,(3;4;6)). edge(6,(2;3;5)).
Output:
Answer: 1
path(1,1) path(2,3) path(3,4) path(4,2) path(5,5) path(6,6)
Answer: 2
path(1,1) path(2,3) path(3,5) path(4,4) path(5,2) path(6,6)
Answer: 3
path(1,6) path(2,2) path(3,5) path(4,3) path(5,4) path(6,1)
Answer: 4
path(1,1) path(2,4) path(3,2) path(4,5) path(5,6) path(6,3)
Answer: 5
...

Online Algorithm approach for alternating subsequence

Consider a sequence A = a1, a2, a3, ... an of integers. A subsequence B of A is a sequence B = b1, b2, .... ,bn which is created from A by removing some elements but by keeping the order. Given an integer sequence A, the goal is to compute an alternating subsequence B, i.e. a sequence b1, ... bn such that for all i in {2, 3, ... , m-1}, if b{i-1} < b{i} then b{i} > b{i+1} and if b{i-1} > b{i} then b{i} < b{i+1}**
Consider an online version of the problem, where the sequence A is given element-by-element and each time, one needs to directly decide whether to include the next element in the subsequence B. Is it possible to achieve a constant competitive ratio (by using a deterministic online algorithm)? Either give an online algorithm which achieves a constant competitive ratio or show that it is not possible to find such an online algorithm.
Assume sequence [9,8,9,8,9,8, .... , 9,8,9,8,2,1,2,9,8,9, ... , 8,9,8,9,8,9]
My Argumentation:
The algorithm must decide immediately if it inserts an incoming number into the subsequence. If the algorithm now gets the numbers 1 then 2 then 2 it will eventually decide that they are part of the sequence and thus by a nonlinear factor worse than the optimal solution of n-3.
-> No constant competitive ratio!
Is this a proper argumentation?
If I understood what you meant, your argument is correct, but the sequence you gave in the example is wrong. for example the algorithm may choose all the 9's and 8's.
You can alter your argument slightly to make it more accurate, for example consider the sequence
3,4,3,4,3,4,......, 1/5,2/6,1/5,2/6,....
Explanation:
You start the sequence with 3,4,3,4,... etc. until the algorithm picks two numbers. If it never does, it's obviously not competitive (it gets 0/1 out of n)
If the algorithm picked a 3, then 4, the algorithm must next take a number lower than 4. By continuing with 5,6,5,6,... the algorithm cannot take another number.
If the algorithm chose to take a 4 then a 3, by a similar resoning we can easily see how continuing with 1,2,1,2,... prevents the algorithm from taking another nubmer.
Thus, in any case, the algorithm cannot take more than 2 numbers for every n, which, as you stated, isn't a constant competitive ratio.

How to handle huge sparse matrices construction using Scipy?

So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take.
The files are preprocessed and hence are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm
and the format is:
from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)
so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse or scipy.dok.sparse, now my issue is:
How on earth do I go about converting the .txt file with the link information to a sparse matrix? Read it and compute it as a normal N*N matrix then convert it or what? I have no idea.
Also, the links sometimes span across lines so what would be the correct way to handle that?
eg: a random line is like..
[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]
exactly like this: no commas & no delimiters.
Any information on sparse matrix construction and data handling across lines would be helpful.
Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information about the matrix formats here:
There are several ways to get to your desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due high memory requirements (about 10^12 entries!).
In your case I would prepare your data to construct a coo_matrix.
coo_matrix((data, (i, j)), [shape=(M, N)])
data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries
You might also want to have a look at lil_matrix, which can be used to incrementally build your matrix.
Once you created the matrix you can then convert it to a better suited format for calculation, depending on your use case.
I do not recognize the data format, there might be parsers for it, there might not. Writing your own parser should not be very difficult, though. Each line containing a colon starts a new row, all indices after the colon and in consecutive lines without colons are the column entries for said row.

Pearson perfect hashing

I'm trying to write a generator that produces Pearson perfect hashes. Note that I don't need a minimal perfect hash. Wikipedia says that a Pearson perfect hash can be found in O(|S|) time using a randomized algorithm (where S is the set of keys). However, I haven't been able to find such an algorithm online. Is this even possible?
Note: I don't want to use gperf/cmph/etc., I'd rather write my own implementation.
Pearson's original paper outlines an algorithm to construct a permutation table T for perfect hashing:
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between −10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31 in alphabetic order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process:
A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
For each word, the first candidate considered for exchange was T[h[n − 1] ⊕ C[n]], the last table element referenced in the computation of the hash function for that word.
A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n − 2] ⊕ C[n − 1]].
The procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (χ² = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (χ² = 565.2, 532 d.f., p = 0.154).

Is my subset sum algorithm of polynomial time?

I came up with a new algorithm to solve the subset sum problem, and I think it's in polynomial time. Tell me I'm either wrong or a total genius.
Quick starter facts:
Subset sum problem is an NP-complete problem. Solving it in polynomial time means that P = NP. The number of subsets in a set of length N, is 2^N.
On the more useful hand, the number of unique subsets in the length N is at maximum the sum of the whole set minus the smallest element, or, the range of sums that any subset can possibly produce is between the sum of all the negative elements and the sum of all the positive elements, since no sum can possibly be bigger or smaller than all the positive or negative sums, which grow at a linear rate when we add extra elements.
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly. If an algorithm could be devised that could remove the duplicate subsets at the earliest possible opportunity, we would run in polynomial time. A quick example is easily taken from binary. From only the numbers that are powers of two, we can create unique subsets for any integral value. As such, any subset involving any other number (if we had all powers of two) is a duplicate and a waste. By not computing them and their derivatives, we can save virtually all the running time of the algorithm, since the numbers which are powers of two are logarithmically occurring compared to any integer.
As such, I propose a simple algorithm that will remove these duplicates and save having to compute them and all their derivatives.
To begin with, we'll sort the set which is only O(N log N), and split it into two halves, positive and negative. The procedure for the negative numbers is identical, so I'll only outline the positive numbers (the set now means just the positive half, just for clarification).
Imagine an array indexed by sum, which has entries for all the possible result sums of the positive side (which is only linear, remember). When we add an entry, the value is the entries in the subset. So like, array[3] = { 1, 2 }.
In general, we now move to enumerate all subsets in the set. We do this by starting with the subsets of one length, then adding to them. When we have all the unique subsets, they form an array, and we simply iterate them in the fashion used in Horowitz/Sahni.
Now we start with the "first generation" values. That is, if there were no duplicate numbers in the original data set, there are guaranteed to be no duplicates in these values. That is, all single-value subsets, and all length of the set minus one length subsets. These can easily be generated by summing the set, and subtracting each element in turn. In addition the set itself is a valid first generation sum and subset, as well as each individual element of the subset.
Now we do the second generation values. We loop through each value in the array and for each unique subset, if it doesn't have it, we add it and compute the new unique subset. If we have a duplicate, fun occurs. We add it to a collision list. When we come to add new subsets, we check if they're on the collision list. We key by the less desirable (normally larger, but can be arbitrary) equal subset. When we come to add to subsets, if we would generate a collision, we simply do nothing. When we come to add the more desirable subset, it misses the check and adds, generating the common subset. Then we just repeat for the other generations.
By removing duplicate subsets in this manner, we don't have to keep combining the duplicates with the rest of the set, nor keep checking them for collisions, nor sum them. Most importantly, by not creating new subsets that are non-unique, we're not generating new subsets from them, which can, I believe, turn the algorithm from NP to P, since the growth of subsets is no longer exponential- we discard the vast majority of them before they can "reproduce" in the next generation and create more subsets by being combined with the other non-duplicate subsets.
I don't think I've explained this too well. I have pictures... they're in my head. The important thing is that by discarding duplicate subsets, you could remove virtually all of the complexity.
For example, imagine (because I'm doing this example by hand) a simple dataset that goes -7 to 7 (not zero) for which we aim at zero. Sort and split, so we're left with (1, 2, 3, 4, 5, 6, 7). The sum is 28. But 2^7 is 128. So 128 subsets fit in the range 1 .. 28, meaning that we know in advance that 100 sets are duplicates. If we had 8, then we'd only have 36 slots, but now 256 subsets. So you can easily see that the number of dupes would now be 220, greater than double what it was before.
In this case, the first generation values are 1, 2, 3, 4, 5, 6, 7, 28, 27, 26, 25, 24, 23, 22, 21, and we map them to their constituent components, so
1 = { 1 }
2 = { 2 }
...
28 = { 1, 2, 3, 4, 5, 6, 7 }
27 = { 2, 3, 4, 5, 6, 7 }
26 = { 1, 3, 4, 5, 6, 7 }
...
21 = { 1, 2, 3, 4, 5, 6 }
Now to generate the new subsets, we take each subset in turn and add it to each other subset, unless they have a mutual subsubset, e.g. 28 and 27 have a hueg mutual subsubset. So when we take 1 and we add it to 2, we get 3 = { 1, 2 } but owait! It's already in the array. What this means is that we now don't add 1 to any subset that already has 2 in it, and vice versa, because that's a duplicate on 3's subsets.
Now we have
1 = { 1 }
2 = { 2 }
// Didn't add 1 to 2 to get 3 because that's a dupe
3 = { 3 } // Add 1 to 3, amagad, get a duplicate. Repeat the process.
4 = { 4 } // And again.
...
8 = { 1, 7 }
21? Already has 1 in.
...
27? We already have 28
Now we add 2 to all.
1? Existing duplicate
3? Get a new duplicate
...
9 = { 2, 7 }
10 = { 1, 2, 7 }
21? Already has 2 in
...
26? Already have 28
27? Got 2 in already.
3?
1? Existing dupe
2? Existing dupe
4? New duplicate
5? New duplicate
6? New duplicate
7? New duplicate
11 = { 1, 3, 7 }
12 = { 2, 3, 7 }
13 = { 1, 2, 3, 7 }
As you can see, even though I am still adding new subsets each time, the quantity is only going up linearly.
4?
1? Existing dupe
2? Existing dupe
3? Existing dupe
5? New duplicate
6? New duplicate
7? New duplicate
8? New duplicate
9? New duplicate
14 = {1, 2, 4, 7}
15 = {1, 3, 4, 7}
16 = {2, 3, 4, 7}
17 = {1, 2, 3, 4, 7}
5?
1,2,3,4 existing duplicate
6,7,8,9,10,11,12 new duplicate
18 = {1, 2, 3, 5, 7}
19 = {1, 2, 4, 5, 7}
20 = {1, 3, 4, 5, 7}
21 = new duplicate
Now we have every value in the range, so we stop, and add to our list 1-28. Repeat for negative numbers, iterate through lists.
Edit:
This algorithm is totally wrong in any case. Subsets which have duplicate sums are not duplicates for the purposes of which subsets can be spawned from them, because they are arrived at differently- i.e., they cannot be folded.
This does not prove P = NP.
You have failed to consider the possibility where the positive numbers are: 1, 2, 4, 8, 16, etc... and so there will be no duplicates when you sum subsets, so it will run in O(2^N) time in this case.
You can treat this as a special case but still the algorithm is still not polynomial for other similar cases. This assumption that you made is where you go away from the NP-complete version of subset sum to solving only easy (polynomial time) problems:
[assume the sum of the positive numbers grows] at a linear rate when we add extra elements.
Here you are effectively assuming that P (i.e. number of bits required to state the problem) is smaller than N. Quote from Wikipedia:
Thus, the problem is most difficult if N and P are of the same order.
If you assume that N and P are of the same order then you can't assume that the sum grows linearly indefinitely as you add more elements. As you add more elements to your set those elements also need to get larger to ensure that problem remains hard to solve.
If P (the number of place values) is a small fixed number, then there are dynamic programming algorithms that can solve it exactly.
You have rediscovered one of these algorithms. It's a nice piece of work but it isn't something new and it doesn't prove P = NP. Sorry!
Dead MG,
It has been almost half a year since you posted but I will answer anyway.
Mark Byers wrote most of what should be written.
The algorithm is known.
Such algorithms are known as generating functions algorithms or simply as dynamic programming algorithms.
Your algorihtm has very important feature, the so called pseudopolynomial complexity.
Traditional complexity is a function of the size of the problem. In terms of traditional complexity your algorithm has O(2^n) pessimistic complexity (that is for the numbers 1,2, 4,... as was mentioned earlier )
The complexity of your algorithm algorithm can be alternatively expressed as the function of the size of the problem and the size of some numbers in the problem. For your algorithm it would be something like O(nw) where w is the number of distinct sums.
This is psuedopolynomial complexity. It is a VERY important feature. Such algorithms can solve lots of real-world problem instances, despite problem complexity class.
Horowitz and Sahni algorithm has pessimistic complexity O(2^N/2). This is not two times better than your algorithm but lot's more - 2^N/2 times better than your algorithm. What Greg probably meant was that Horowitz and Sahni algorithm can solve twice as big instances of the problem (having twice as many numbers in the subset sum)
That's true in theory but in practice Horowitz and Sahni can solve (on home computers) instances with about 60 numbers, while the algorithm similiar to yours can handle even instances with 1000 numbers (provided that the numbers aren't too big themselves)
In fact the two algorithms can even be mixed, that is of your kind and of Horowitz and Sahni algorithm. Such solution has both pseudopolynomial complexity and pessimistic complexity of O(2^n/2).
A trained computer sciencist can construct such algorithm as yours by means of generating functions theory.
Both trained and untrained can think it up the way you did.
Do not necessarily think in terms "is it known?". It should be important to you that you can invent such algorithm on your own. It means that you probably can invent other important algorithms on your own and someday one that isn't known maybe. Knowing current progress in the field and what's in the literature helps. Otherwise you will keep on reinventing the wheel.
When I was way back in high school I reinvented Dijkstra algorithm. My version had terrible complexity because I didn't know anything about data structures. Anyway, I am still proud of myself.
If you are still studying pay attention to generating functions theory.
You may also want to check out on wiki:
psuedopolynomial time
weakly NP-complete
strongly NP-complete
generating functions
Megli
What this means is that as N increases, the number of duplicate subsets increases exponentially, and the number of unique, useful subsets increases only linearly.
Not necessarily - the number of duplicate subset sums is also determined by the value of the number closest to zero in the set (that the greater the minimum distance to zero - the fewer the duplicate subset sums for the set).
In general, we now move to enumerate all subsets in the set.
Unfortunately, enumerating all the sums of the subsets of the set requires performing an exponential number of addition operations (2^7 or 128 in your example). Otherwise, how would the algorithm determine what the unique sums happen to be? So, although the steps that follow the first step could very well have a polynomial running time, the algorithm as a whole has exponential complexity (because an algorithm is only as fast as its slowest part).
Incidentally, best known algorithm for solving the subset sum problem (Horowitz and Sahni, 1974) has O(2^N/2) complexity - which makes it about twice as fast as this algorithm.