edges.create() function causes duplication if called multiple times - titan

I am using TitanGraphDB + Cassandra.I am starting Titan as follows
cd titan-cassandra-0.3.1
bin/titan.sh config/titan-server-rexster.xml config/titan-server-cassandra.properties
I have a Rexster shell that I can use to communicate to Titan+Cassandra above.
cd rexster-console-2.3.0
bin/rexster-console.sh
I am attempting to model a network topology using Titan Graph DB.I want to program the Titan Graph DB from my python program.I am using bulbs package for that.
I create three types of vertices
- switch
- port
- device
I use the following functions to create unique vertices if it doesn't already exist.
self.g.vertices.index.get_unique( "dpid", dpid_str)
self.g.vertices.index.get_unique( "port_id", port_id_str)
self.g.vertices.index.get_unique( "dl_addr", dl_addr_str)
I create edges between related vertices as follows.
self.g.edges.create(switch_vertex,"out",port_vertex)
However if this function is called twice it is creating a duplicate of the edges already present.Is there a function analogous to get_or_create() for edges so that I can avoid duplication.?

In general, graphs permit duplicate edges between vertices because the definition of a duplicate edge is ambiguous and application specific.
For example, is the edge a duplicate based on its label, direction, or some combination of properties?
However, Titan 0.5 introduced a Multiplicity.SIMPLE constraint that enables you to define unique edges between a pair of vertices.
See Matthias's Titan 0.5 announcement:
https://groups.google.com/d/topic/aureliusgraphs/cNb4fKoe95M/discussion
This new feature is not yet documented, but the Titan team is in the process of updating the docs for Titan 0.5 so it will be documented soon.
Watch the Type Definition Overview page for updates:
https://github.com/thinkaurelius/titan/wiki/Type-Definition-Overview
Also see the section on cardinality constraints:
https://github.com/thinkaurelius/titan/wiki/Type-Definition-Overview#cardinality-constraints

Related

How to ensure that only one item is added to janusgraph

Is there a way that I can ensure that any creation of a vertex in janusgraph with a given set of properties only results in one such vertex being created?
Right now, what I do is I traverse the graph and ensure that the number of vertices I find with particular properties is only one. For example:
val g = graph.traversal
val vertices = g.V().has("type", givenType).has("name", givenName).toList
if (vertices.size > 1) {
// the vertex is not unique, cannot add vertex
}
This can be done with the so called get or create traversal which is described in TinkerPop's Element Existence recipe and in the section Using coalesce to only add a vertex if it does not exist of the Practical Gremlin book.
For your example, this traversal would look like this:
g.V().has("type", givenType).has("name", givenName).
fold().
coalesce(unfold(),
addV("yourVertexLabel").
property("type", givenType).
property("name", givenName))
Note however, that it depends on the graph provider whether this is an atomic operation or not. In your case of JanusGraph, the existence check and the conditional vertex addition are executed with two different operations which can lead to a race condition when two threads execute this traversal at the same time in which case you can still end up with two vertices with these properties. So, you currently need to ensure that two threads can't execute this traversal for the same properties in parallel, e.g., with locks in your application.
I just published a blog post about exactly this topic: How to Avoid Doppelgängers in a Graph Database if you want to get more information about this topic in general. It also describes distributed locking as a way to implement locks for distributed systems and discusses possible improvements to better support upserts in JanusGraph in the future.

TinkerPop Gremlin: Vender independent Vertex ID

Summary
I am writing a gremlin script to work for both orientdb and neo4j.
For a sample, purposes let say we want to load a vertex with id 1
for neo4j, we will write the gremlin script as
g.V(1) and for orientDB g.V('#17:0').
such that my script should run for both the databases?
You can't have a vendor independent element identifier as most graph systems don't let you assign the identifier and neither Neo4j or OrientDB allow for that. You likely shouldn't be hardcoding identifiers in your code anyway as I believe that those can change out from under you depending on the graph system.
The correct approach would be to rely on indices and prefer to write your traversals as:
g.V().has('myId', 1234)
in which case any graph database could resolve that. If you do work with the native graph identifiers, I suggest you always treat them as variables in your code as in:
Object vid = g.V().has('myId', 1234).id().next()
...
g.V(vid).out().....

Meaning of the spatialite scheme generated by the spatialite_osm_map tool

I used the spatialite_osm_map tool to generate a spatialite database from an .osm.pbf file. After the process was finished, a series of tables were generated in the database as shown in the image.
I noticed that there were 3 groups of tables based on the prefixes of their names: In_, pg_ and pt_. I also noticed that the rest of the name corresponded to a key defined in OpenStreetMap.
Can someone explain to me how the information is distributed in each of these groups and tables? I've searched for a site that explains the resulting schema after the conversion, but I've only found information on how to use the tool.
I think you have already identified the key points of this scheme.
It's main purpose is to offer the data from OSM in a way who could be more direct and intuitive for a GIS user. The data is splited according to OSM tags (aerialway, aeroway, amenity, etc., you can change the list of tags to be used if you don't need all of them) and according to the type of geometry (pt_* for points, ln_* for lines, and pg_* for polygons) so these tables (which could be directly seen as "layers" by a GIS user) can quickly be styled (for example in a GIS desktop application such as QGIS) with simple rules due to this simple schema (for example one can set rules like green for pg_natural, blue for ln_waterway and pg_waterway, or just click on the "pg_building" layer to toggle its visibility). That schema doesn't preserve all the objects from the OSM database, but only those requested to build the tables for the requested tags.
Contrary to the original way of storing OSM objects, with this kind of extraction you will lose the relationships between objects (for example in OSM the same node can be used, let's say, as part of the relationship describing an administrative boundary and as part of a road; here you will get a road line in ln_road and a polygon in pg_boundary but you will loose the information that they were maybe partially sharing nodes). Notably due to this last point, the weight of the OSM extractions can be relatively high compared to the original file.
So I guess that this kind of scheme (which is one amongst other existing ways to transform OSM data) offers an interesting abstraction for those who are not accustomed to the OSM schema which use Node, Way and Relation elements (eg, in OSM, buildings can be represented as closed way or as relation, here you get "simply" polygons for these various buildings).

Optimizing a Prefix Tree in OrientDB

In my project, I have a fairly large prefix tree, potentially containing millions of nodes (about 250K nodes in my development instance), managed in OrientDB (pointing to other vertices in my graph).
The nodes of the prefix tree are represented by a Token vertex type. Each Token has a 'key' property and is connected to its child vertices by a 'child' edge type. So, a sequence like "hello world" would be represented as:
root -child-> "hello" -child-> "world"
Currently, I have a NOTUNIQUE_HASH_INDEX on Token.key and I am querying the data structure like this:
SELECT EXPAND(OUT('child')[key=:k]) FROM :p
where k is the child key I am looking for and p is the RID of the parent node.
Generally, performance is pretty good, but I am looking for ideas on improving the query, the indexing, or both for this use case. In particular, queries starting at the root node, which has many children, take noticeably longer than the other, less-connected nodes.
Any suggestions? Thanks in advance!
Luigi Dell'Aquila from the OrientDB team provided an excellent answer on the OrientDB Google Group. To summarize, the following query (suggested by Luigi) dramatically improved performance.
SELECT FROM Token where key = :k AND in('Child') contains :p
I just ran a realistic test and query time was reduced by 97%! See https://groups.google.com/forum/#!topic/orient-database/mUkz6Z7hSwk for more details.

what is the best way to retrive information in a graph through has Step

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?
First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.
The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.