Optimizing a Prefix Tree in OrientDB - orientdb

In my project, I have a fairly large prefix tree, potentially containing millions of nodes (about 250K nodes in my development instance), managed in OrientDB (pointing to other vertices in my graph).
The nodes of the prefix tree are represented by a Token vertex type. Each Token has a 'key' property and is connected to its child vertices by a 'child' edge type. So, a sequence like "hello world" would be represented as:
root -child-> "hello" -child-> "world"
Currently, I have a NOTUNIQUE_HASH_INDEX on Token.key and I am querying the data structure like this:
SELECT EXPAND(OUT('child')[key=:k]) FROM :p
where k is the child key I am looking for and p is the RID of the parent node.
Generally, performance is pretty good, but I am looking for ideas on improving the query, the indexing, or both for this use case. In particular, queries starting at the root node, which has many children, take noticeably longer than the other, less-connected nodes.
Any suggestions? Thanks in advance!

Luigi Dell'Aquila from the OrientDB team provided an excellent answer on the OrientDB Google Group. To summarize, the following query (suggested by Luigi) dramatically improved performance.
SELECT FROM Token where key = :k AND in('Child') contains :p
I just ran a realistic test and query time was reduced by 97%! See https://groups.google.com/forum/#!topic/orient-database/mUkz6Z7hSwk for more details.

Related

Filtering a return group by traversal in ArangoDB

I'm in the process of evaluating ArangoDB to be used instead of OrientDB. My dataset is essentially a forest of not-necessarily connected trees (a family tree).
Because the dataset is a directed acyclic graph (a tree), it's always more efficient to walk up the tree looking for something than down the tree.
In earlier versions of OrientDB, before they removed this critical feature for me, I was able to do the following query:
SELECT FROM Person WHERE haircolor = "Red" and in traverse(0, -1, "in") (birth_country = "Ireland")
Since haircolor is an indexed field, it's efficient to get all of those vertices. The magic is in the traverse operator within the WHERE clause, which stops traversal and immediately returns TRUE if it locates any ancestor from Ireland.
Yes, you can turn it around and look for all those from Ireland, and then walk downward looking for those pesky redheads, returning them, but it is substantially less efficient, since you have to evaluate every downward path, which potentially expands exponentially.
Since OrientDB shot themselves in the foot (in my opinion) by taking that feature out, I'm wondering if there's an ArangoDB query that would do a similar task without walking down the tree.
Thanks in advance for your help!
In AQL, it would go something like this:
FOR redhead IN Person // Find start vertices
FILTER doc.haircolor == "Red"
FOR v, e, p IN 1..99 INBOUND redhead Ancestor // Traversal up to depth 99
PRUNE v.birth_country == "Ireland" // Don't walk further if condition is met
RETURN p // Return the entire path
This assumes that the relations (edges) are stored in an edge collection called Ancestor.
PRUNE prevents further traversal down (or here: up) the path but includes the node that it is at.
https://www.arangodb.com/docs/stable/aql/graphs-traversals.html#pruning
Note that the variable depth traversal returns not only the longest paths but also "intermediate" paths of the same route. You may want to filter them out on the client-side or take a look at this AQL solution at the cost of additional traversals: https://stackoverflow.com/a/64931939/2044940

Apache Spark - Implementing a distributed QuadTree

I am really, really, new to Apache Spark.
I am working on implementing Approximate LOCI (or ALOCI), an anomaly detection algorithm, on a distributed way over Spark. This algorithm is based on storing points in a QuadTree that is used to find a point's number of neighbors.
I know exactly how QuadTrees work. In fact, I have implemented such a structure in Java recently. But I am completely lost as far as it concerns the way that such a structure can work in a distributed way over Spark.
Something similar to what I need can be found in Geospark.
https://github.com/DataSystemsLab/GeoSpark/tree/b2b6f1d7f0015d5c9d663a7b28d5e1bb1043c413/core/src/main/java/org/datasyslab/geospark/spatialPartitioning/quadtree
GeoSpark uses in many cases a PointRDD class, that extends a SpatialRDD class which I can see that uses the QuadTree that can be found in the link above to partition the Spatial objects. That is what I understood, at least, in theory.
In practice, I still cannot figure this out. Let's say for example that I have millions of records in a csv and I want to read and load them in a QuadTree.
I could read a csv to an RDD, but then what? How does this RDD logically connects to the QuadTree I am trying to build?
Of course, I don't expect a working solution here. I just need the logic here to fill the gap in my mind. How do I implement a distributed QuadTree and how do I use it?
Ok, sadly there are no answers to this, but here I am two weeks later with a working solution. Not 100% sure if it is the right approach here, though.
I created a class named Element and turned each line of my csv to an RDD[Element]. I then created a serializable class named QuadNode which has a List[Elements] and an Array[String] of size 4. On adding elements to a node, these elements are added in the node's List. If the list get more than X elements (20 in my case), the node breaks into 4 children and the elements are sent to the children. Finally, I created a class QuadTree which has an RDD[QuadNodes] among its rest properties. Every time a node breaks to children then these children-nodes are added in the tree's RDD.
In a non-functional language, each node would have 4 pointers, one for each child. Since, we are in a distributed environment this approach could not work. So, I gave each node a unique Id. Root node has an id = "0". Root's nodes have ids "00", "01", "02" and "03". Node-"00" children have ids "000","001","002","003". In this way if we want to find all the descendants of a node, we filter our tree's RDD[QuadNode] by checking if nodes' ids startWith out node id. Reversing this logic helps us to find a node's parent node.
This is how I implemented my QuadTree, at least for now. If someone knows a better way of implementing this I would love to hear his/her opinion.

what is the best way to retrive information in a graph through has Step

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?
First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.
The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.

some questions about designing on OrientDB

We were looking for the most suitable database for our innovative “collaboration application”. Sorry, we don’t know how to name it in a way generally understood. In fact, highly complicated relationships among tenants, roles, users, tasks and bills need to be handled effectively.
After reading 5 DBs(Postgrel, Mongo, Couch, Arango and Neo4J), when the words “… relationships among things are more important than things themselves” came to my eyes, I made up my mind to dig into OrientDB. Both the design philosophy and innovative features of OrientDB (multi-models, cluster, OO,native graph, full graph API, SQL-like, LiveQuery, multi-masters, auditing, simple RID and version number ...) keep intensifying my enthusiasm.
OrientDB enlightens me to re-think and try to model from a totally different viewpoint!
We are now designing the data structure based on OrientDB. However, there are some questions puzzling me.
LINK vs. EDGE
Take a case that a CLIENT may place thousands of ORDERs, how to choose between LINKs and EDGEs to store the relationships? I prefer EDGEs, but they seem like to store thousands of RIDs of ORDERs in the CLIENT record.
Embedded records’ Security
Can an embedded record be authorized independently from it’s container record?
Record-level Security
How does activating Record-level Security affect the query performance?
Hope I express clearly. Any words will be truly appreciated.
LINK vs EDGE
If you don't have properties on your arch you can use a link, instead if you have it use edges. You really need edges if you need to traverse the relationship in both directions, while using the linklist you can only in one direction (just like a hyperlink on the web), without the overhead of edges. Edges are the right choice if you need to walk thru a graph.Edges require more storage space than a linklist. Another difference between them it's the fact that if you have two vertices linked each other through a link A --> (link) B if you delete B, the link doesn't disappear it will remain but without pointing something. It is designed this way because when you delete a document, finding all the other documents that link to it would mean doing a full scan of the database, that typically takes ages to complete. The Graph API, with bi-directional links, is specifically designed to resolve this problem, so in general we suggest customers to use that, or to be careful and manage link consistency at application level.
RECORD - LEVEL SECURITY
Using 1 Million vertex and an admin user called Luke, doing a query like: select from where title = ? with an NOT_UNIQUE_HASH_INDEX the execution time it has been 0.027 sec.
OrientDB has the concept of users and roles, as well as Record Level Security. It also supports token based authentication, so it's possible to use OrientDB as your primary means of authorizing/authenticating users.
EMBEDDED RECORD'S SECURITY
I've made this example for trying to answer to your question
I have this structure:
If I want to access to the embedded data, I have to do this command: select prop from User
Because if I try to access it through the class that contains the type of car I won't have any type of result
select from Car
UPDATE
OrientDB supports that kind of authorization/authentication but it's a little bit different from your example. For example: if an user A, without admin permission, inserts a record, another user B can't see the record inserted by user A without admin permission. An User can see only the records that has inserted.
Hope it helps

How OrientDB links physically work (to give O(1) relationship complexity and reference update)

How OrientDB links physically work?
For example, if two nodes are connected they know each other physical position (given by the cluster id and node id in the cluster), what happens to the physical references when I move one of them elsewhere? Everything is updated automatically? There is some source of information about that?
Same thing about the O(1) relationship complexity, I don't find nothing about that (only "OrientDB handles relationships as physical links to the records and assigns them only once, when the edge is created. That is, O(1)").
The only information I found are and nothing more
http://orientdb.com/docs/last/Tutorial-Relationships.html#the-problem-with-joins
I need more specific information about that.
UPDATE: found the information I needed thanks to Luca Garulli
Physical position update: MOVE VERTEX SQL command
O(1) complexity: http://www.slideshare.net/lvca/how-graph-databases-started-the-multi-model-revolution (key point at 33)
I do not know how much detail you want to get down, but other information about RID and Computational Complexity, you can find them at these links:
2.1 tutorial
[Design question] Record IDs