Is there a way to get results for an overpass query paginated? - openstreetmap

Let's say I want to get restaurants in Berlin and I have this query:
[out:json];
area["boundary"="administrative"]["name"="Berlin"] -> .a;
(
node(area.a)["amenity"="restaurant"];
); out center;
Let's say this result set is too big to extract in just one request to overpass. I would like to be able to use something like SQL's OFFSET and LIMIT arguments to get the first 100 results (0-99), process them, then get the next 100 (100-199) and so on.
I can't find an option to do that in the API, is it possible at all? If not, how should I query my data to get it divided into smaller sets?
I know I can increase the memory limit or the timeout, but this still leaves me handling one massive request instead on n small ones, which is how I would like to do it.

OFFSET is not supported by Overpass API, but you can limit the number of result this is getting returned by the query via an additional parameter in the out statement. The following example would return only 100 restaurants in Berlin:
[out:json];
area["boundary"="administrative"]["name"="Berlin"] -> .a;
(
node(area.a)["amenity"="restaurant"];
); out center 100;
One approach to limit the overall data volume could be to count the number of objects in a bounding box, and if that number is too large, split the bounding box in 4 parts. counting is supported via out count;. Once the number of objects is feasible, just use out; to get some results.
node({{bbox}})["amenity"="restaurant"];
out count;

Related

Overpass query. Absence of [maxsize] returns significantly smaller results

I have two overpass queries.
node(33.68336,-117.89466,34.14946,-117.03498);
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service"](bn);
(._;>;);
out;
The query above returns an osm.xml file that is 167.306 kb big.
[out:xml][maxsize:2000000000];
(
node(33.68336,-117.89466,34.14946,-117.03498);
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|seconda ry_link|tertiary|tertiary_link|road|residential|service"](bn);
(._;>;);
);
out;
The second query returns a file that is 618.994 kb big. Why does the second query return a significantly bigger result? Does the first query not give me the full dataset? Is there a way to get the same result with both queries? (The absence of [maxsize] sometimes leads to an error…)
I feel that there is something missing about your query:
node(33.68336,-117.89466,34.14946,-117.03498); should return all the nodes in this area,which is a lot of data.
then the second line:
way"highway"~“motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service”;
gives an error, as it should be written with brackets and straight quotes as so:
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service"];
but this looks for all the roads in the world, and your first query is not used any more, as your output is only the second query. But that is a huge amount of data, probably in the GB range.
So I don't see how you would get only 167 kB. I assume you must a bounding box or some other filter that you did not mention.
But in your second example, you make an union of the two queries, as you put them in brackets:
(... ; ...;); out; so you would get all the nodes in the area and all the roads in the world. And again, if you have an extra bounding box or filter, you might get only 619 kB. Supposing that there are a lot of non-road nodes, it makes sense that you get more data, as you get the union of the two searches (all nodes + nodes from the roads)

Remove rows from search expression solr

I'm trying to search for the items which's attribute matches the given function below in my large dataset, but I'm facing a problem here.
The row parameter only selects first 300 objects and the function then filters the matching results, but I'm trying to search the whole index, not only just first few, how can I rewrite this to achieve it?
having(
select(search(myIndex,q="*:*", fl="*", rows=300),
id,
dotProduct(ATTRIBUTE, array(4,5,2)) as prod,
l1norm(array(1,2,3)) as a,
l1norm(ATTRIBUTE) as b,
div(prod, add(a, sub(b, prod))) as c
), and(gteq(c, 5), lteq(c, 8)))
The simplest would be to increase the number of rows to cover the number of entries in the index.
However if this number is huge, you should probably use the /export request handler instead of a regular select-like handler.
The /export request handler allows a fully sorted result set to be
streamed out of Solr using a special rank query parser and
response writer. These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Depending on your needs, you could also do multiple queries playing with paginated results using both parameter start and rows, or if the number of entries is not known by the client code, use cursorMark.

How to get all of the way IDs that each node is a part of

So I am trying to build an overpass / osm query that will in effect find me all of the nodes that a part of multiple road segments, or 'ways'. I have a challenge in that I am dealing with somewhat large area (Norfolk VA, 100,000 nodes) so I'm trying to find a somewhat performant query.
This following query is useful in that it provides all of the nodes, something I need to iterate over, as any node could be part of another way:
[out:json][timeout:25];
{{geocodeArea:Norfolk, VA}}->.searchArea;
(
(
way["highway"](area.searchArea);
node(w);
);
);
// print results
out body;
>;
out skel qt;
I also found this query which returns to me every node that is a part of multiple ways. Very useful, however very non-performant query, O(n^2), and scales to an entire city very poorly.
way({{bbox}})->.always;
foreach .always -> .currentway(
(.always; - .currentway;)->.allotherways;
node(w.currentway)->.e;
node(w.allotherways)->.f;
node.e.f;
(._ ; .result;) -> .result;
);
.result out meta;
I think the minimum-useful information I need is to have all of the node IDs returned as they are associated with each way (kinda like a map/dict) but I'm really struggling to figure out if that is a possible to make such a call. Appreciate your input!

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Gremlin query to find the count of a label for all the nodes

Sample query
The following query returns me the count of a label say
"Asset " for a particular id (0) has >>>
g.V().hasId(0).repeat(out()).emit().hasLabel('Asset').count()
But I need to find the count for all the nodes that are present in the graph with a condition as above.
I am able to do it individually but my requirement is to get the count for all the nodes that has that label say 'Asset'.
So I am expecting some thing like
{ v[0]:2
{v[1]:1}
{v[2]:1}
}
where v[1] and v[2] has a node under them with a label say "Asset" respectively, making the overall count v[0] =2 .
There's a few ways you could do it. It's maybe a little weird, but you could use group()
g.V().
group().
by().
by(repeat(out()).emit().hasLabel('Asset').count())
or you could do it with select() and then you don't build a big Map in memory:
g.V().as('v').
map(repeat(out()).emit().hasLabel('Asset').count()).as('count').
select('v','count')
if you want to maintain hierarchy you could use tree():
g.V(0).
repeat(out()).emit().
tree().
by(project('v','count').
by().
by(repeat(out()).emit().hasLabel('Asset')).select(values))
Basically you get a tree from vertex 0 and then apply a project() over that to build that structure per vertex in the tree. I had a different way to do it using union but I found a possible bug and had to come up with a different method (actually Gremlin Guru, Daniel Kuppitz, came up with the above approach). I think the use of project is more natural and readable so definitely the better way. Of course as Mr. Kuppitz pointed out, with project you create an unnecessary Map (which you just get rid of with select(values)). The use of union would be better in that sense.