Overpass relation railway segments - openstreetmap

I want to query the Overpass Api to find out the distance of special relations (railways). The request is fine, and returns me all relation, way and node objects I'm interested in. Example for Hamburg:
[out:json];(rel(53.55224324163863, 10.006766589408304, 53.55314275836137, 10.008081410591696)["route"="train"];>;);out body;
In Overpass, each relation object has members defining this relation. For way objects you can resolve the lat/lon of its node attributes and calculate the distance for that way. If you sum up all the way distances it seems to be reasonable.
However, there are members from that relation of the type node (most of the time, they have a role of "stop") which seem to represent the right order of stops from that relation. But instead being in between the members, they are roughly at the end.
If I try to look the stops up inside the ways, they are not all present. How am I supposed to calculate the distance between two particular stops?

There seems to be a misconception about relations. Relation members don't necessarily have to be sorted. Consequently you might have to sort the members yourself, if necessary at all.
You can take a look at JOSM which has a neat sorting algorithm for various types of relations. But I don't think it is able to place members with the role stop at the correct position. This isn't always possible because a way doesn't necessarily have to be split at each node with the stop role. This also means a single way can contain more than one node with a stop role, making it impossible to sort the relation members correctly. Unless you do some pre-processing for splitting each way accordingly.
For calculating the distance between each stop it seems unnecessary to sort the elements. Just follow the way by iterating over all each nodes and check for each node if it has a stop role in the corresponding relation. When reaching the end of the way continue with the one which shares the same node at its start or end and which is also member of the relation.

Related

Apache Spark - Implementing a distributed QuadTree

I am really, really, new to Apache Spark.
I am working on implementing Approximate LOCI (or ALOCI), an anomaly detection algorithm, on a distributed way over Spark. This algorithm is based on storing points in a QuadTree that is used to find a point's number of neighbors.
I know exactly how QuadTrees work. In fact, I have implemented such a structure in Java recently. But I am completely lost as far as it concerns the way that such a structure can work in a distributed way over Spark.
Something similar to what I need can be found in Geospark.
https://github.com/DataSystemsLab/GeoSpark/tree/b2b6f1d7f0015d5c9d663a7b28d5e1bb1043c413/core/src/main/java/org/datasyslab/geospark/spatialPartitioning/quadtree
GeoSpark uses in many cases a PointRDD class, that extends a SpatialRDD class which I can see that uses the QuadTree that can be found in the link above to partition the Spatial objects. That is what I understood, at least, in theory.
In practice, I still cannot figure this out. Let's say for example that I have millions of records in a csv and I want to read and load them in a QuadTree.
I could read a csv to an RDD, but then what? How does this RDD logically connects to the QuadTree I am trying to build?
Of course, I don't expect a working solution here. I just need the logic here to fill the gap in my mind. How do I implement a distributed QuadTree and how do I use it?
Ok, sadly there are no answers to this, but here I am two weeks later with a working solution. Not 100% sure if it is the right approach here, though.
I created a class named Element and turned each line of my csv to an RDD[Element]. I then created a serializable class named QuadNode which has a List[Elements] and an Array[String] of size 4. On adding elements to a node, these elements are added in the node's List. If the list get more than X elements (20 in my case), the node breaks into 4 children and the elements are sent to the children. Finally, I created a class QuadTree which has an RDD[QuadNodes] among its rest properties. Every time a node breaks to children then these children-nodes are added in the tree's RDD.
In a non-functional language, each node would have 4 pointers, one for each child. Since, we are in a distributed environment this approach could not work. So, I gave each node a unique Id. Root node has an id = "0". Root's nodes have ids "00", "01", "02" and "03". Node-"00" children have ids "000","001","002","003". In this way if we want to find all the descendants of a node, we filter our tree's RDD[QuadNode] by checking if nodes' ids startWith out node id. Reversing this logic helps us to find a node's parent node.
This is how I implemented my QuadTree, at least for now. If someone knows a better way of implementing this I would love to hear his/her opinion.

Whats is a RESTful way to get resources from all children?

Suppose I have the following api where vin is a Vehicle Identification Number and belongs to a single car.
GET /fleets/123/cars/55/vin
GET /fleets/123/cars/55/history
And I want to get all the vins for all the cars for a fleet. Which would be preferred among these:
GET /fleets/123/cars/all/vin
GET /fleets/123/cars/*/vin
GET /fleets/123/vins
GET /fleets/123/cars/vins
The first two preserve the hierarchy and make the controller more intuitive. The last three feels like it breaks consistency.
Are any of these preferred or is there a better way?
This is mostly an opinion-based question. There's no one true way here, so you'll get my opinion.
I'm not sure what a vin is, but if it's a type of resource, and there's a collection of vins as a child of a fleet resource, I would expect it to live here:
GET /fleets/123/vins
This communicates to me a vin is not a subordinate of a car. It's its own thing and I'm getting all vins for a specific fleet.
The fact that a vin also exists as a subordinate of car (to me) is kind of irrelevant to this. It makes perfectly sense to me that for those, that they live here:
GET /fleets/123/cars/55/vin
Similarly, if I would model the last 2 apis as 2 functions, I would call them:
findVinsForFleet(fleetId: number);
findVinForCar(fleetId: number, carId: number);

Database schema for a tinder like app

I have a database of million of Objects (simply say lot of objects). Everyday i will present to my users 3 selected objects, and like with tinder they can swipe left to say they don't like or swipe right to say they like it.
I select each objects based on their location (more closest to the user are selected first) and also based on few user settings.
I m under mongoDB.
now the problem, how to implement the database in the way it's can provide fastly everyday a selection of object to show to the end user (and skip all the object he already swipe).
Well, considering you have made your choice of using MongoDB, you will have to maintain multiple collections. One is your main collection, and you will have to maintain user specific collections which hold user data, say the document ids the user has swiped. Then, when you want to fetch data, you might want to do a setDifference aggregation. SetDifference does this:
Takes two sets and returns an array containing the elements that only
exist in the first set; i.e. performs a relative complement of the
second set relative to the first.
Now how performant this is would depend on the size of your sets and the overall scale.
EDIT
I agree with your comment that this is not a scalable solution.
Solution 2:
One solution I could think of is to use a graph based solution, like Neo4j. You could represent all your 1M objects and all your user objects as nodes and have relationships between users and objects that he has swiped. Your query would be to return a list of all objects the user is not connected to.
You cannot shard a graph, which brings up scaling challenges. Graph based solutions require that the entire graph be in memory. So the feasibility of this solution depends on you.
Solution 3:
Use MySQL. Have 2 tables, one being the objects table and the other being (uid-viewed_object) mapping. A join would solve your problem. Joins work well for the longest time, till you hit a scale. So I don't think is a bad starting point.
Solution 4:
Use Bloom filters. Your problem eventually boils down to a set membership problem. Give a set of ids, check if its part of another set. A Bloom filter is a probabilistic data structure which answers set membership. They are super small and super efficient. But ya, its probabilistic though, false negatives will never happen, but false positives can. So thats a trade off. Check out this for how its used : http://blog.vawter.com/2016/03/17/Using-Bloomfilters-to-Avoid-Repetition/
Ill update the answer if I can think of something else.

what is the best way to retrive information in a graph through has Step

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?
First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.
The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.

Do having multiple labels for a node in Neo4j make any sense?

Following this post from Neo4j's google group I have to say that I don't see any benefits when using this multiple-label-thing but rather, on the contrary, IMHO it just adds complexity for what a uniqueness constraint is. It could also tempt the user to introduce inheritance into the data model which would cause frustration since that's not possible at all...
Labels have not the notion of just representing a type, they are rather roles which are viable in different contexts.
So in one role, certain attributes and relationships of a node might matter and in another role (label) a different set (that might intersect with the first one).
We stayed away from inheritance as it opens a new can of worms, and we favor composition. So you'd rather compose a node whole as the sum of its parts. You can also mimic an inheritance by also attaching the "super"-types as labels to the child elements in your hierarchy.
Node labels can also be used to separate subgraphs in a larger graph, e.g. label the proteins that are active in human pathways and phylo pathways with those labels. So you can quickly select a part of the graph that you're interested in.
Those separate subgraphs can also come from different domains, like geo,social,catalogue,supplier that are combined in a single graph.
And multiple labels also make sense to separate "technical" namespaces of your graph that are used to represent "in-graph-indexes" from your "domain"-labels.
Regarding uniqueness - all uniqueness constraints for the existing labels and properties on your nodes are enforced at the same time. If they cannot be resolved on insert or update the operation will fail.