what is the best way to retrive information in a graph through has Step - titan

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?

First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.

The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.

Related

Database schema for a tinder like app

I have a database of million of Objects (simply say lot of objects). Everyday i will present to my users 3 selected objects, and like with tinder they can swipe left to say they don't like or swipe right to say they like it.
I select each objects based on their location (more closest to the user are selected first) and also based on few user settings.
I m under mongoDB.
now the problem, how to implement the database in the way it's can provide fastly everyday a selection of object to show to the end user (and skip all the object he already swipe).
Well, considering you have made your choice of using MongoDB, you will have to maintain multiple collections. One is your main collection, and you will have to maintain user specific collections which hold user data, say the document ids the user has swiped. Then, when you want to fetch data, you might want to do a setDifference aggregation. SetDifference does this:
Takes two sets and returns an array containing the elements that only
exist in the first set; i.e. performs a relative complement of the
second set relative to the first.
Now how performant this is would depend on the size of your sets and the overall scale.
EDIT
I agree with your comment that this is not a scalable solution.
Solution 2:
One solution I could think of is to use a graph based solution, like Neo4j. You could represent all your 1M objects and all your user objects as nodes and have relationships between users and objects that he has swiped. Your query would be to return a list of all objects the user is not connected to.
You cannot shard a graph, which brings up scaling challenges. Graph based solutions require that the entire graph be in memory. So the feasibility of this solution depends on you.
Solution 3:
Use MySQL. Have 2 tables, one being the objects table and the other being (uid-viewed_object) mapping. A join would solve your problem. Joins work well for the longest time, till you hit a scale. So I don't think is a bad starting point.
Solution 4:
Use Bloom filters. Your problem eventually boils down to a set membership problem. Give a set of ids, check if its part of another set. A Bloom filter is a probabilistic data structure which answers set membership. They are super small and super efficient. But ya, its probabilistic though, false negatives will never happen, but false positives can. So thats a trade off. Check out this for how its used : http://blog.vawter.com/2016/03/17/Using-Bloomfilters-to-Avoid-Repetition/
Ill update the answer if I can think of something else.

Gremlin: What's an efficient way of finding an edge between two vertices?

So obviously, a straight forward way to find an edge between two vertices is to:
graph.traversal().V(outVertex).bothE(edgeLabel).filter(__.otherV().is(inVertex))
I feel that filter step will have to iterate through all edges making really slow for some applications with a lot of edges.
Another way could be:
traversal = graph.traversal()
.V(outVertex)
.bothE(edgeLabel)
.as("x")
.otherV()
.is(outVertex) // uses index?
.select("x");
I'm assuming the second approach could be much faster since it will be using ID index which will make it faster than the first the approach.
Which one is faster and more efficient (in terms of IO)?
I'm using Titan, so you could also make your answer Titan specific.
Edit
In terms of time, seems like the first approach is faster (edges were 20k for vertex b
gremlin> clock(100000){g.V(b).bothE().filter(otherV().is(a))}
==>0.0016451789999999999
gremlin> clock(100000){g.V(b).bothE().as("x").otherV().is(a).select("x")}
==>0.0018231140399999999
How about IO?
I would expect the first query to be faster. However, few things:
None of the queries is optimal, since both of them enable path calculations. If you need to find a connection in both directions, then use 2 queries (I will give an example below)
When you use clock(), be sure to iterate() your traversals, otherwise you'll only measure how long it takes to do nothing.
These are the queries I would use to find an edge in both directions:
g.V(a).outE(edgeLabel).filter(inV().is(b))
g.V(b).outE(edgeLabel).filter(inV().is(a))
If you expect to get at most one edge:
edge = g.V(a).outE(edgeLabel).filter(inV().is(b)).tryNext().orElseGet {
g.V(b).outE(edgeLabel).filter(inV().is(a)).tryNext()
}
This way you get rid of path calculations. How those queries perform will pretty much depend on the underlying graph database. Titan's query optimizer recognizes that query pattern and should return a result in almost no time.
Now if you want to measure the runtime, do this:
clock(100) {
g.V(a).outE(edgeLabel).filter(inV().is(b)).iterate()
g.V(b).outE(edgeLabel).filter(inV().is(a)).iterate()
}
In case one does not know the vertex Id's, another solution might be
g.V().has('propertykey','value1').outE('thatlabel').as('e').inV().has('propertykey','value2').select('e')
This is also only unidirectional so one needs to reformulate the query for the opposite direction.

Storing edges in OrientDb in ordered way

I am working with OrientDB graph API using java API and Gremlin Pipeline. I wanted to know is there a way to specify storing order for edges based on an attribute? I know we can create a custom edge type and define index on the attribute based upon which we want to retrieve.
I also had a look at the tutorial on the OrientDB website:
http://orientdb.com/docs/last/Graph-Database-Tinkerpop.html#ordered-edges
There they do mention that edges can be retrieved in an ordered way but they dont mention how is the order determined.So I would like to know:
What is the default storage order?And will fetching from this order give me edges in an LIFO format?
How can we store based on custom order i.e. store in the order in which we want it to be fetched?
The underlying type used is a List, so the order is the inserting order. To change it, get the edge list, work on it and then call vertex.save() where vertex is casted to OrientVertex.

Complex URL handling conception

I'm currently struggling at a complex URL handling concept question. The application have a product property database table/collection with all the different product types (i.e. categories, colors, manufacturers, materials, etc.).
{_id:1,alias:"mercedes-benz",type:"brand"},
{_id:2,alias:"suv-cars",type:"category"},
{_id:3,alias:"cars",type:"category"},
{_‌​id:4,alias:"toyota",type:"manufacturer"},
{_id:5,alias:"red",type:"color"},
{_id:6,alias:"yellow",type:"color"},
{_id:7,alias:"bmw",type:"manufacturer"},
{_id:8,alias:"leather",type:"material"}
...
Now the mission is to handle URL requests in the style below in every(!) possible order to retrieve the included product properties. The only allowed character is the dash (settled SEO requirement, some properties also can include dashes by themselve - i think also an important point - i.e. the category "suv-cars" or the manufacturer "mercedes-benz"):
http:\\www.example.com\{category}-{color}-{manufacturer}-{material}
http:\\www.example.com\{color}-{manufacturer}
http:\\www.example.com\{color}-{category}-{material}-{manufacturer}
http:\\www.example.com\{category}-{color}-nonexistingproperty-{manufacturer}
http:\\www.example.com\{color}-{category}-{manufacturer}
http:\\www.example.com\{manufacturer}
http:\\www.example.com\{manufacturer}-{category}-{color}-{material}
http:\\www.example.com\{category}
http:\\www.example.com\{manufacturer}-nonexistingproperty-{category}-{color}-{material}
http:\\www.example.com\{color}-crap-{manufacturer}
...
...so: every order of the properties should be allowed! The result have to be the information about the used properties per URL-Request (BTW yes, the duplicate content will be fixed by redirects and a predefined schema). The "nonexistingproperties"/"crap" are possible and just should be ignored.
UPDATE:
Idea 1: One way i'm thinking about the question is to split the query string by dashes and analyze them value by value, the problem: At the two or three or more word combinations at some properties there are too many different combinations and variations so a loooot of queries which kills this idea i think..
Idea 2: The other way is to build a (in my opinion) too large Alias/URL-Table with all of the different combinations, but i think that's just an ugly workaround. There are about 15.000 of different properties so the count of the aliases in the different sort orders is killing this idea.
Idea 3: It's your turn! Thanks for your mind and your time.
While your question is a bit broad, below are some ideas. There isn't a single awesome answer unless you find a free or commercial engine for this that works exactly the way you want.
The way I thought about your problem was to consider the URL as a list of keywords.
use Lucene as a keyword/tag system. It's good at the types of searches you suggest you want, including phrases, stems, etc.
store and index the data in DB of choice, but pull the keywords into memory and build a bit index of all keywords vs items. Iterate through the keyword table producing weighted results. If order of keywords matters, you'll also need make a pass through the result set to weight based on word order. These types of searches always need to cap their result set quickly in order to return results quickly.
cache the results like crazy from working matches, and give precedence to results that users seem to click on the most for a given URL.
attack the database by using tag indexes in MongoDB. You'd still need to merge and weight results. Very intensive and not likely a good use of DB resources.
read some of the academic papers on keyword searches. It's a popular topic.
build a table of words that have dashes in them, and normalize/convert those before running your queries
always check for full exact matches first
The only way this may work, if you restrict all property values to be unique. So, you make a set of categories+colors+manufacturers, etc. All values have to be unique. This will allow you to find to what property the value belongs.
The data structure for this should be fairly simple:
{_id:ValueOfTheProperty, Property:TypeOfProperty}
Here are some possible samples:
{ _id: Red, Property: Color }
{ _id: Green, Property: Color }
{ _id: Boots, Property: Category }
{ _id: Shoes, Property: Category }
...
This way, the order does not matter, and you are able to convert them in a single pass to a map:
{ Color: Red, Category: Boots }
Though, I predict some problems with ambigous names here.

Displaying results of perform find in a portal

I have some global variables $$A, $$B, $$C and what to search within a table for these terms in fieldA, fieldB and fieldC (using Perform Find). How can I use the result of this Perform Find to display the results in a portal.
The implementation by my predecessor replaces a field fieldSEARCHwith 1 if it is in the Perform Find results and 0 otherwise, and then uses a portal filtered by this field. This seems a very dodgey way of doing it, not least becuase it means that multiple users will not be able to search at the same time!
Can you enhance the portal filter to filter against the variables themselves? Or you can perform the find, grab IDs of the found set, put them into a global field, and then use the field to construct the relationship. Global fields are multi-user safe.
The best way is not to do this at all, but use list views to perform searches. List views are naturally searchable and much more flexible than portals (you can easily sort them, omit arbitrary records, and so on). It's possible to repeat this functionality in portals, but it's way more complex. I mean, if there's some serious gain from using a portal, then it's doable, but if not, then the native way is obviously better.
List views are easier to search, as FileMaker still hasn't transitioned to the 21st century and insists on this model... Most users however want a Master-Detail view, like a mail app, and understandably so as it's more intuitive (i.e. produce a list view on one side, but clicking on it updates detail/fields in the middle).
If this is what you want, you may want to cast an eye at Modular FM, where someone has already done the hard work for you:
http://www.modularfilemaker.org/module/masterdetail-2-0/
HTH
Stam