What is the difference between `Vertex Centric Indices` query and a simple `has()` - titan

What is the difference between Vertex Centric Indices query and a simple has() query.
I have a vertex having property like targetGender(male,female), isAvailable(yes,no), materialType(glass,plastic,..)
now I should be able to query using all these property.Which one is efficient query:
`g.query().has("materialType",glass);`
or
v.query().direction(Direction.IN)
.labels(edgeLabel.toString()).has("materialType", glass);

I'm not sure there's a "difference" in that they aren't comparable things, so much as they are related things. If you don't define a vertex centric index, and issue a has based query where you hit some vertices with several hundred thousand edges, you will get an answer, but you will iterate all edges to do it. If you do the same has query and have vertex centric indices defined, then Titan will make use of those indices to filter those edges saving a lot of time on the query.
No matter what you choose to do, has will work - vertex centric indices will just make it more efficient.

Related

How to do in-memory search for polygons that contain a given point?

I have a PostGreSQL table that has a geometry type column, in which different simple polygons (possibly intersecting) are stored. The polygons are are all areas within a city. I receive an input of a point (latitude-longitude pair) and need to find the list of polygons that contain the given point. What I have currently:
Unclustered GiST index defined on the polygon column.
Use ST_Contains(#param_Point, table.Polygon) on the whole table.
It is quite slow, so I am looking for a more performant in-memory alternative. I have the following ideas:
Maintain dictionary of polygons in Redis, keyed by their geohash. Polygons with same geohash would be saved as a list. When I receive the point, calculate its geohash and trim to a desired level. Then search in the Redis map and keep trimming the point's geohash until I find the first result (or enough results).
Have a trie of geohashes loaded from the database. Update the trie periodically or by receiving update events. Calculate the point's geohash, search in the trie until I find enough results. I prefer this because the map may have long lists for a geohash, given the nature of the polygons.
Any other approaches?
I have read about libraries like GeoTrie and Polygon Geohasher but can't seem to integrate them with the database and the above ideas.
Any cues or starting points, please?
Have you tried using ST_Within? Not sure if it meets your criteria but I believe it is meant to be faster than st_contains

What are the pros and cons of multiple rows of POLYGON vs one MULTIPOLYGON field?

So for the first time I'm gonna do a project that involves maps and layers on top of maps which have many points and many polygons on them.
I have the tendency to create separate tables for points and polygons and then create many-to-many relationships between them and the layers table. If I do that I end up with 5 tables: points, polygons, layers, layers_points and layers_polygons.
However, I see PostGIS also offers types called MULTIPOINT and MULTIPOLYGON. If I use those types then I could put it all in the layers table. I guess that would make queries faster, because I need less joins. However, I'm not sure if later I might regret it, if it means that working with the individual points and polygons becomes impossible. I'm not even sure yet if it will be necessary to work perform calculations on the individual points and polygons, but it would be nice to know whether that's possible or not in both approaches.
So basically I'm asking, what the pros and cons are of these different approaches?
In general, you would consider using multipolygons to represent entities that have disjoint surfaces (for example, the geometry of Alaska) or other topologies that you can't represent as polygons. The key here is that a single entity needs to be expressed with a multipolygon
What you wouldn't do is group unrelated polygons into a multipolygon, because you won't be able to perform queries at a child polygon level, unless you extract the rings into another geometry. If the polygons are unrelated, chances are you will need to query them individually. Even if they share a layer, you can manage that relation with business logic without merging them as they aren't representing the same entity.
Keep in mind that geometry tools in the frontend won't necesarilly treat multipolygons as a valid geometry or a multi object. Algorithms of point-in-polygon that looks like your use case, won't necesarilly work when checking if a point is contained in a multipolygon.
Tools like Wicket.js (transform from/to WKT/geojson/native objects) don't support multipolygons. Google maps api v3 doesn't support multipolygons except for the data layer (but you can't operate on the data layer as you would on a polygon feature). Turf.js has operations that would run on a Featurecollection containing several polygons, yet not over a multipolygon.
Without knowing your exact use case, that's the best I can tell you, and TL/DR: keep your polygons as they are.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

Having a different radius for different indicies/queries using a radius filter query

Current Code: https://gist.github.com/anonymous/c1a178bc4118f850d9cd
The flaw here is that I have two indicies in the searchable alias. This means that I must use the same radius for both. I actually want to use a larger radius in the radius filter for one of the indicies. Is there any way to do this without having two separate calls to .prepareSearch, thus two separate hits to Elasticsearch
Sounds like your data might be better suited to being in one index, in which case you could use and/or filters to combine a geo distance filter with a type filter.
Another option would be to use the indicies query