TinkerPop Gremlin: Vender independent Vertex ID - orientdb

Summary
I am writing a gremlin script to work for both orientdb and neo4j.
For a sample, purposes let say we want to load a vertex with id 1
for neo4j, we will write the gremlin script as
g.V(1) and for orientDB g.V('#17:0').
such that my script should run for both the databases?

You can't have a vendor independent element identifier as most graph systems don't let you assign the identifier and neither Neo4j or OrientDB allow for that. You likely shouldn't be hardcoding identifiers in your code anyway as I believe that those can change out from under you depending on the graph system.
The correct approach would be to rely on indices and prefer to write your traversals as:
g.V().has('myId', 1234)
in which case any graph database could resolve that. If you do work with the native graph identifiers, I suggest you always treat them as variables in your code as in:
Object vid = g.V().has('myId', 1234).id().next()
...
g.V(vid).out().....

Related

How to ensure that only one item is added to janusgraph

Is there a way that I can ensure that any creation of a vertex in janusgraph with a given set of properties only results in one such vertex being created?
Right now, what I do is I traverse the graph and ensure that the number of vertices I find with particular properties is only one. For example:
val g = graph.traversal
val vertices = g.V().has("type", givenType).has("name", givenName).toList
if (vertices.size > 1) {
// the vertex is not unique, cannot add vertex
}
This can be done with the so called get or create traversal which is described in TinkerPop's Element Existence recipe and in the section Using coalesce to only add a vertex if it does not exist of the Practical Gremlin book.
For your example, this traversal would look like this:
g.V().has("type", givenType).has("name", givenName).
fold().
coalesce(unfold(),
addV("yourVertexLabel").
property("type", givenType).
property("name", givenName))
Note however, that it depends on the graph provider whether this is an atomic operation or not. In your case of JanusGraph, the existence check and the conditional vertex addition are executed with two different operations which can lead to a race condition when two threads execute this traversal at the same time in which case you can still end up with two vertices with these properties. So, you currently need to ensure that two threads can't execute this traversal for the same properties in parallel, e.g., with locks in your application.
I just published a blog post about exactly this topic: How to Avoid Doppelgängers in a Graph Database if you want to get more information about this topic in general. It also describes distributed locking as a way to implement locks for distributed systems and discusses possible improvements to better support upserts in JanusGraph in the future.

Deploy Knowledge Studio dictionary pre-annotator to Natural Language Understanding

I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.

what is the best way to retrive information in a graph through has Step

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?
First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.
The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.

edges.create() function causes duplication if called multiple times

I am using TitanGraphDB + Cassandra.I am starting Titan as follows
cd titan-cassandra-0.3.1
bin/titan.sh config/titan-server-rexster.xml config/titan-server-cassandra.properties
I have a Rexster shell that I can use to communicate to Titan+Cassandra above.
cd rexster-console-2.3.0
bin/rexster-console.sh
I am attempting to model a network topology using Titan Graph DB.I want to program the Titan Graph DB from my python program.I am using bulbs package for that.
I create three types of vertices
- switch
- port
- device
I use the following functions to create unique vertices if it doesn't already exist.
self.g.vertices.index.get_unique( "dpid", dpid_str)
self.g.vertices.index.get_unique( "port_id", port_id_str)
self.g.vertices.index.get_unique( "dl_addr", dl_addr_str)
I create edges between related vertices as follows.
self.g.edges.create(switch_vertex,"out",port_vertex)
However if this function is called twice it is creating a duplicate of the edges already present.Is there a function analogous to get_or_create() for edges so that I can avoid duplication.?
In general, graphs permit duplicate edges between vertices because the definition of a duplicate edge is ambiguous and application specific.
For example, is the edge a duplicate based on its label, direction, or some combination of properties?
However, Titan 0.5 introduced a Multiplicity.SIMPLE constraint that enables you to define unique edges between a pair of vertices.
See Matthias's Titan 0.5 announcement:
https://groups.google.com/d/topic/aureliusgraphs/cNb4fKoe95M/discussion
This new feature is not yet documented, but the Titan team is in the process of updating the docs for Titan 0.5 so it will be documented soon.
Watch the Type Definition Overview page for updates:
https://github.com/thinkaurelius/titan/wiki/Type-Definition-Overview
Also see the section on cardinality constraints:
https://github.com/thinkaurelius/titan/wiki/Type-Definition-Overview#cardinality-constraints

Running out of NHibernate HiLo-ids resulting in negative numbers

We're running an ASP.NET database application which uses HiLo to generate ids for entities. On top of this application, we have several websites using the same database. What we're seeing is that we run out of ids and the ID-column becomes a negative number.
We suspect this has something to do with the generator. As multiple websites run on top of the same codebase and database and probably the HiLo algorithm quickly starts generating ids which are outside of the bigint-range (with quickly being relative of course).
Is it possible to configure the generator in such a way that it also uses the gaps (of which there are quite a few) in the Id-sequences, instead of bluntly increasing the value whenever it feels that's necessary?
Would that be a solution? Or should we be doing something else altogether?
what is your max_lo set to?
The formula to generate id is as follows
h = high sequence (starting at 0)
l_size = size of low block
l = low sequence (starting at 1)
ID = h*l_size + l
Maybe your max_lo is set to high?
You can switch to Guid.Comb generator if this is possible or use int64 for ids. Take a look here for making final decision regarding what generator to use.
I've come across the same problem and also haven't been able to find a suitable answer.
We also have a site which is being ran as separate websites with each site in its own separate application pool, all on the same webserver.
Pragmatically, you'd be better off just switching to Identity mapping, if your databas supports it. It shouldn't be too hard to do, you should be able to modify your database schema with a bit of TSQL and the ID mappings with a bit of search/replace.
Do you have a concept similar to UoW in your application? Because a downside to identity generation is that it'll break the UoW (early inserts in order to get the identifier). It might be a price worth paying, though.
In my case the system could easily exist as a single site/app pool (it's multi-tenant on a single database, with single shared connection string, and is designed to run as a single instance on a webserver) so I'm going to test that before I make the jump to database-identities..