I would like to retrieve the total number of incoming or outgoing edges of a given type to/from a vertex in OrientDB. The obvious method is to construct a query using count() and inE(MyEdgeType), outE(MyEdgeType), or bothE(MyEdgeType). However, I am concerned about time complexity; if this operation is O(N) rather than O(1), I might be better off storing the number in the database rather than using count() each time I need it, because I anticipate the number of edges in question becoming very large. I have searched the documentation, but it does not seem to list the time complexities of OrientDB's functions. Also, I am unsure of whether to use in/out/both or inE/outE/bothE; I presume the E versions will be faster, but depending on how OrientDB stores edges under the hood, that may be wrong.
Is counting the set of incoming/outgoing/both edges of a given type to/from a vertex a constant-time operation--and if not, what is its time complexity? To be most efficient, do I need to use inE/outE/bothE, or in/out/both? Or is there some other method entirely that I've missed?
In OrientDB this is O(1): inE()/outE()/bothE().size() with or without the edge label as parameter.
Related
While going through
cursor.skip() MongoDB I read that this is expensive approach and I totally understand it why it is expensive as cursor has to go through from start to execute this skip. And in the below paragraph they wrote
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
I don't understand this part, how this overcome the "expensive(ness)" of skip() operation.
Thanks
When using cursor.skip(N) the server finds all the matching data and then skips over the first N matching documents.
When using range based pagination (ie. with a date range) the server will only find and return the matching documents. If the property you base your pagination on is indexed the index will also be used.
The difference is the amount of data the server has to read in the two situations.
I am developing an app where the user will receive geo based information depending on his position. The server should be able to handle huge number of connected clients >100k.
Now i came up with 4 Approaches on how to handle the users position.
Approach - Without geospatial index:
The app server does just hold a list of connected clients and they're location.
Whenever there is a information available the server does loop over the whole list and checks whether the client is within a given radius.
Doubts: Very expensive
Approach - Handle geospatial index in the app server:
The app server does maintain a R Tree with all connected clients and they're location.
Therefore i was looking at JSI Java Spatial Index
Doubts: It is very expensive to update the geospatial index with JSI
Approach - Let the database "mongoDb" do the geospatial index / calculation:
The app server does only hold a reference to the connected client (connection) and saves the key to that reference together with its location into mondoDb.
When a new information is available the server can query the database to get the keys off all clients nearby.
Pro: I guess mongoDb does have a much better implementation of geospatial indexes than i could ever do in the app server.
Doubts: Clients are traveling around which forces me to update the geospatial index frequently. Can i do that or am i running into a performance problem?
Approach - Own "index" using 2 dimensional array
Today i was thinking about creating a very simple index by using a two dimensional array. While the outer array is for the longitude the inner would be for the latitude.
Lets say 3 longitude / altitude degree would be enough precision.
I could receive a list of users in a given area by
ArrayList userList = data[91][35] //91.2548980712891, 35.60869979858;
// i would also need to get the users in the surrounding arrays 90;35, 92;35 ...
// if i need more precision i could use one more decimal data[912][356]
Pro: I would have fast read and write access without a query to the database
Doubts: Radius is shorter at poles. Ugly hack?
I would be very grateful if someone could point me into "the" right direction.
The index used by MongoDB for geospatial indexing is based on a geohash, which essentially converts a 2 dimensional space into a one-dimensional key, suitable for B-tree indexing. While this is somewhat less efficient than an R-tree index it will be vastly more efficient than your scenario 1. I would also argue that filtering the data at the db level with a spatial query will be more efficient and easier to maintain than creating you own spatial indexing strategy on top.
The main thing to be aware of with MongoDB is that you cannot use a geometry column as a shard key, though you can shard a collection containing a geometry field, using another key. Also, if you wish to do any aggregation queries, (which isn't clear from your question) the geometry field must be the first through the pipeline.
There is also a geohaystack index, which is based on small buckets and optimized for search based on small areas, see, http://docs.mongodb.org/manual/core/geohaystack/, which might be useful in your case.
As far as speed is concerned, insertion and search on a B-Tree index are essentially O(log n), see Wikipedia B-Tree while without an index your search will be O(n), so it will not take very long before the difference in perfomance is enormous between having and not having an index.
If you are concerned about heavy writes slowing things down, you can tune the write concern in MongoDB so that you don't have to wait for a majority of replicas to respond to every write (the default), but at the cost of potentially inconsistent data, if you should lose your master.
I'm using Titan with Cassandra and have several (related) questions about querying the database with Gremlin:
1.) Is there an faster way to count all vertices than
g.V.count()
Titan claims to use an index. But how can I use an index without property?
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [<>]. For better performance, use indexes
2.) Is there an faster way to count all vertices with property 'myProperty' than
g.V.has('myProperty').count()
Again Titan means following:
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [(myProperty<> null)]. For better performance, use indexes
But again, how can I do this? I already have an index of 'myProperty', but it needs a value to query fast.
3.) And the same questions with edges...
Iterating all vertices with g.V.count() is the only way to get the count. It can't be done "faster". If your graph is so large that it takes hours to get an answer or your query just never returns at all, you should consider using Faunus. However, even with Faunus you can expect to wait for your answer (such is the nature of Hadoop...no sub-second response here), but at least you will get one.
Any time you do a table scan (i.e. iterate all Vertices) you get that warning of "iterating over all vertices". Generally speaking, you don't want to do that, as you will never get a response. Adding an index won't help you count all vertices any faster.
Edges have the same answer. Use g.E.count() in Gremlin if you can. If it takes too long, then try Faunus so you can at least get an answer.
doing a count is expensive in big distributed graph databases. You can have a node that keeps track of many of the databases frequent aggregate numbers and update it from a cron job so you have it handy. Usually if you have millions of vertices having the count from the previous hour is not such disaster.
I have a List with about 1176^3 positions.
Making smth like
val x = list.length
takes hours ..
When in list is 1271256 positions is ok, just few seconds.
Any one have idea how to speed up it ?
List is possibly the wrong data structure for a length operation as it is O(n) - it takes longer to complete the longer the list is.
Vector is possibly a better data structure to use if you are needing to invoke length as its storage supports random access in a finite time.
This, of course, does not mean that List is a poor structure to use, just in this case it might not be preferable.
To add to gpampara's answer, in cases like these you may actually be able to justify using an Array, since it has the lowest overhead per item stored and O(1) access to elements and length determination (since it's recorded in the array header itself).
Array has many down-sides, but I consider it justifiable when memory overhead is a primary consideration (and when a fixed-size collection whose size is known at the time of creation is feasible).
For a project I am working on, I need to keep track of up to several thousand objects. The collection I choose needs to support insertion, selection, and deletion of random elements. My algorithm performs each of these operations several times, so I would like a collection that can do all these in constant time.
Is there such a collection? If not, what are some trade-offs with existing collections? I am using Scala 2.9.1.
EDIT: By "random", I mean mathematically/probabilistically random, i.e., I would like to select elements randomly from the collection using Random or some other appropriate generator.
Define "random". If you mean indexed, then there's no such collection. You can have insertion/deletion in constant time if you give up the "random element" requirement -- ie, you have have non-constant lookup of the element which will be deleted or which will be the point of insertion. Or you can have constant lookup without constant insertion/deletion.
The collection that best approaches that requirement is the Vector, which provides O(log n) for these operations.
On the other hand, if you have the element which you'll be looking up or removing, then just pick a HashMap. It's not precisely constant time, but it is a fair approximation. Just make sure you have a good hash function.
As a starting point, take a look at The Scala 2.8 Collections API especially at Performance Characteristics.