neo4j Similarity cosine graphaware - plugins

How do i write a statement for similarity cosine using ga.nlp.ml.similarity.cosine for node News:
CREATE (n:News)
SET n.text = "Scores of people were already lying dead or injured inside a crowded Orlando nightclub,
and the police had spent hours trying to connect with the gunman and end the situation without further violence.
But when Omar Mateen threatened to set off explosives, the police decided to act, and pushed their way through a
wall to end the bloody standoff.";
What is the proper syntax?

This is the call structure:
CALL ga.nlp.ml.similarity.cosine([<nodes>],depth,Query,Relationship type)
//nodes->The list of annotated nodes for which it will compute the distances
//depth->Integer. if 0, it will not use Concept Net 5 imported data for the distance computing. If greater than 0 it will consider concepts during computation, the value will define how much in general it should go.
//Query->String. It is the query that will be used to compute the tags vector, some are already defined, so this cold be null
//Relationship Type->String. The name to assign to the Relationship created between AnnotatedText nodes.
This is an example:
MATCH (a:AnnotatedText)
with collect(a) as list
CALL ga.nlp.ml.similarity.cosine(list, 0, null, "SIMILARITY") YIELD result
return result

CALL ga.nlp.ml.similarity.cosine([<nodes>],depth,Query,Relationship type)
//nodes->Must be annotated nodes
//depth->integer data
//Query->String
//Relationship Type->String

Related

Scala Nested Iteration within RDD

I have to iterate through all columns to find similarity of 1 column value. For example:
ID,FN,LN,Phone
-----------
1,James,Butt,872-232-1212
2,Josephine,Darakjy, 872-232-1213
3,Art,Venere,872-232-1214
4,Lenna,Paprocki,872-232-1215
5,Donette, Foller,872-232-1216
6,Jmes,Butt,666-232-1212
7,Donette, Foller,888-232-1216
8,Josphne,Darkjy, 555-232-1213
Inside the loop, I will take FN, which is 'James' and see if I have similar name in the complete data set using some kind string distances (e.g Levenshtein) and in this case I have match with ID#6: 'Jmes', I will create a bucket by adding a new GUID column this:
ID,FN,LN,Phone,GrupId
----------------------
1,James,Butt,872-232-1212,G1
2,Josephine,Darakjy, 872-232-1213,G2
3,Art,Venere,872-232-1214,G3
4,Lenna,Paprocki,872-232-1215,G4
5,Donette, Foller,872-232-1216,G5
6,Jmes,Butt,666-232-1212,G1
7,Donette, Foller,888-232-1216,G5
8,Josphne,Darkjy, 555-232-1213,G2
I have to do same operation on multiple columns, like LN, Phone as well. Imagine if I have 1 million records.
Any thoughts, suggestions or links are appreciated. Thank you!
I would definitely not try anything pairwise and would rather think towards coding a per-field Levenshtein-y index and accumulate results on the fly. I’d probably start from a suffix tree -ish one.
Will try to sketch a prototype as soon as I get to the laptop...
Update: after some reading I am leaning towards Affinity Clustering1 combined with pairwise (yes I know) Levenshtein cached on a Trie2. Code in progress...

Gremlin query to find the count of a label for all the nodes

Sample query
The following query returns me the count of a label say
"Asset " for a particular id (0) has >>>
g.V().hasId(0).repeat(out()).emit().hasLabel('Asset').count()
But I need to find the count for all the nodes that are present in the graph with a condition as above.
I am able to do it individually but my requirement is to get the count for all the nodes that has that label say 'Asset'.
So I am expecting some thing like
{ v[0]:2
{v[1]:1}
{v[2]:1}
}
where v[1] and v[2] has a node under them with a label say "Asset" respectively, making the overall count v[0] =2 .
There's a few ways you could do it. It's maybe a little weird, but you could use group()
g.V().
group().
by().
by(repeat(out()).emit().hasLabel('Asset').count())
or you could do it with select() and then you don't build a big Map in memory:
g.V().as('v').
map(repeat(out()).emit().hasLabel('Asset').count()).as('count').
select('v','count')
if you want to maintain hierarchy you could use tree():
g.V(0).
repeat(out()).emit().
tree().
by(project('v','count').
by().
by(repeat(out()).emit().hasLabel('Asset')).select(values))
Basically you get a tree from vertex 0 and then apply a project() over that to build that structure per vertex in the tree. I had a different way to do it using union but I found a possible bug and had to come up with a different method (actually Gremlin Guru, Daniel Kuppitz, came up with the above approach). I think the use of project is more natural and readable so definitely the better way. Of course as Mr. Kuppitz pointed out, with project you create an unnecessary Map (which you just get rid of with select(values)). The use of union would be better in that sense.

How to do recursive calculation in SPSS Modeler

If I want to compute a value that relies on the previous one (Recursive functions) how can I do it in SPSS ? Example:
Q0 = 0
Qn = Q(n-1) + Constant
If by "... the previous one ..." you mean the value of the same field (or a different field) for the previous record, you can use the #OFFSET(FIELD, EXPR) function.
The function allows you to access values from records other than the current one based on a relative reference.
After many research I couldn't find any way to do recursive function with SPSS Modeler. The only work around is to use R Transform node within SPSS. HTH.
Depending on what you need to do, you can either chain many derive nodes or refer to the previous value in a column after sorting them.
I started with creating a domain context for the stream data flow (iterations) with a simple csv source file with records keeping one field N (range from 1 to 100), just to limit the example. Then I connected this data source with a derive node that defines the variable field Q:
if not(#NULL(#OFFSET(N,1))) then #OFFSET(Q,1) + 2 else 0 endif
Here I used the value 2 for the Constant in the example above. I see this being a recursive function and it relies on the OFFSET just as Kenneth suggested above.

Titan - Cassandra: Process entire set of vertices of a given type without running out of memory

I'm new to Titan and looking for the best way to iterate over the entire set of vertices with a given label without running out of memory. I come from a strong SQL background so I am still working on switching my way of thinking away from SQL-type thinking. Let's say I have 1 million profile vertices. I would like to iterate over each one and perform some type of statistical analysis of the information linked to each profile. I don't really care how long the entire analysis process takes, but I need to iterate over all of the profiles. In SQL I would do SELECT * FROM MY_TABLE, using a scroll-sensitive result, fetch the next result, grab and process the info linked to that row, then fetch the next result. I also don't care if the result is real-time accurate as it is just for gathering general stats, so if a new profile is added during iteration and I miss it, that's ok.
Even if there is a way to grab all the values for a given property, that would probably work too because then I could go through that list and grab each vertex by its ID for example.
I believe titan does lazy loading so you should be able to just iterate over the whole graph:
GraphTraversal<Vertex, Vertex> it = graph.traversal().V();
while(it.hasNext()){
Vertex v = it.next():
//Do what you want here
}
Another option would be to use the range step so that you explicitly choose the range of vertices you need. For example:
List<Vertex> vertices = graph.traversal().V().range(0, 3).toList();
//Do what you want with your batch of vertices.
With regards to getting vertices of a specific type you can query vertices based on their internal properties. For example if you have and internal property "TYPE" which defined the type you are interested in. You can query for those vertices by:
graph.traversal().V().has("TYPE", "A"); //Gets vertices of type A
graph.traversal().V().has("TYPE", "B"); //Gets vertices of type B

Pseudorandom seed methodology for lookup tables

Could someone please suggest a good way of taking a global seed value e.g. "Hello World" and using that value to lookup values in arrays or tables.
I'm sort of thinking like that classic spacefaring game of "Elite" where there were different attributes for the planets but they were not random, simply derived from the seed value for the universe.
I was thinking MD5 on the input value and then using bytes from the hash, casting them to integers and mod them into acceptable indexes for lookup tables, but i suspect there must be a better way? I read something about Mersenne twisters but maybe that would be overkill.
I'm hoping for something which will give a good distrubution over the values in my lookup tables. e.g. Red, Orange, Yellow, Green, Blue, Purple
Also to emphasize I'm not looking for random values but consistent values each time.
Update: Perhaps I'm having difficulty in expressing my own problem domain. Here is an example of a site which uses generators and can generate X number of values: http://www.seventhsanctum.com
Additional criteria
I would prefer to work from first principles rather than making use of library functions such as System.Random
My approach would be to use your key as a seed for a random number generator
public StarSystem(long systemSeed){
java.util.Random r = new Random(systemSeed);
Color c = colorArray[r.nextInt(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.nextInt(politicsArray.length)];
...
}
For a given seed this will produce the same color and the same political system every time.
For getting the starting seed from a string you could just use MD5Sum and grab the first/last 64bits for your long, the other approach would be to just use a numeric for each plant. Elite also generated the names for each system using its pseudo-random-generator.
for(long seed=1; seed<NUMBER_OF_SYSTEMS; seed++){
starSystems.add(new StarSystem(seed));
}
By setting the seed to a known value each time the Random will return the same sequence every time it is called, this is why when trying for good random values a good seed is very important. However in your case a known seed will produce the results your looking for.
The c# equivalent is
public StarSystem(int systemSeed){
System.Random r = new Random(systemSeed);
Color c = colorArray[r.next(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.next(politicsArray.length)];
...
}
Notice a difference? no, nor did I.
Many common random number generators will generate the same sequence given the same seed value, so it seems that all you need to do is convert your name into a number. There are any number of hashing functions that will do that.
Supplementary question: Is it required that all unique strings generate unique hashes and so (probably) unique pseudo-random sequences.
?