Overpass query. Absence of [maxsize] returns significantly smaller results - openstreetmap

I have two overpass queries.
node(33.68336,-117.89466,34.14946,-117.03498);
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service"](bn);
(._;>;);
out;
The query above returns an osm.xml file that is 167.306 kb big.
[out:xml][maxsize:2000000000];
(
node(33.68336,-117.89466,34.14946,-117.03498);
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|seconda ry_link|tertiary|tertiary_link|road|residential|service"](bn);
(._;>;);
);
out;
The second query returns a file that is 618.994 kb big. Why does the second query return a significantly bigger result? Does the first query not give me the full dataset? Is there a way to get the same result with both queries? (The absence of [maxsize] sometimes leads to an error…)

I feel that there is something missing about your query:
node(33.68336,-117.89466,34.14946,-117.03498); should return all the nodes in this area,which is a lot of data.
then the second line:
way"highway"~“motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service”;
gives an error, as it should be written with brackets and straight quotes as so:
way["highway"~"motorway|motorway_link|trunk|trunk_link|primary|primary_link|secondary|secondary_link|tertiary|tertiary_link|road|residential|service"];
but this looks for all the roads in the world, and your first query is not used any more, as your output is only the second query. But that is a huge amount of data, probably in the GB range.
So I don't see how you would get only 167 kB. I assume you must a bounding box or some other filter that you did not mention.
But in your second example, you make an union of the two queries, as you put them in brackets:
(... ; ...;); out; so you would get all the nodes in the area and all the roads in the world. And again, if you have an extra bounding box or filter, you might get only 619 kB. Supposing that there are a lot of non-road nodes, it makes sense that you get more data, as you get the union of the two searches (all nodes + nodes from the roads)

Related

How to get all of the way IDs that each node is a part of

So I am trying to build an overpass / osm query that will in effect find me all of the nodes that a part of multiple road segments, or 'ways'. I have a challenge in that I am dealing with somewhat large area (Norfolk VA, 100,000 nodes) so I'm trying to find a somewhat performant query.
This following query is useful in that it provides all of the nodes, something I need to iterate over, as any node could be part of another way:
[out:json][timeout:25];
{{geocodeArea:Norfolk, VA}}->.searchArea;
(
(
way["highway"](area.searchArea);
node(w);
);
);
// print results
out body;
>;
out skel qt;
I also found this query which returns to me every node that is a part of multiple ways. Very useful, however very non-performant query, O(n^2), and scales to an entire city very poorly.
way({{bbox}})->.always;
foreach .always -> .currentway(
(.always; - .currentway;)->.allotherways;
node(w.currentway)->.e;
node(w.allotherways)->.f;
node.e.f;
(._ ; .result;) -> .result;
);
.result out meta;
I think the minimum-useful information I need is to have all of the node IDs returned as they are associated with each way (kinda like a map/dict) but I'm really struggling to figure out if that is a possible to make such a call. Appreciate your input!

Is there a way to get results for an overpass query paginated?

Let's say I want to get restaurants in Berlin and I have this query:
[out:json];
area["boundary"="administrative"]["name"="Berlin"] -> .a;
(
node(area.a)["amenity"="restaurant"];
); out center;
Let's say this result set is too big to extract in just one request to overpass. I would like to be able to use something like SQL's OFFSET and LIMIT arguments to get the first 100 results (0-99), process them, then get the next 100 (100-199) and so on.
I can't find an option to do that in the API, is it possible at all? If not, how should I query my data to get it divided into smaller sets?
I know I can increase the memory limit or the timeout, but this still leaves me handling one massive request instead on n small ones, which is how I would like to do it.
OFFSET is not supported by Overpass API, but you can limit the number of result this is getting returned by the query via an additional parameter in the out statement. The following example would return only 100 restaurants in Berlin:
[out:json];
area["boundary"="administrative"]["name"="Berlin"] -> .a;
(
node(area.a)["amenity"="restaurant"];
); out center 100;
One approach to limit the overall data volume could be to count the number of objects in a bounding box, and if that number is too large, split the bounding box in 4 parts. counting is supported via out count;. Once the number of objects is feasible, just use out; to get some results.
node({{bbox}})["amenity"="restaurant"];
out count;

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Multi-Select Query Through PHRETS RETS System

I've got a system running RETS through the PHRETS system. I have a form, that runs through a query to pull out results, and we're adding in multi-select boxes.
So far, my code looks like this for the query: (SUB_AREA_NAME=|AreaA,AreaB,AreaC,AreaD)
This works for allowing many results to come up. Problem is this:
For some reason, the system is doing a 'and' operation instead of an 'or' operation. So anytime we search up more then one place, if any of the results come up empty, they will all come up empty.
For example:
Lets say AreaA has 3 houses. AreaB has 0 houses, AreaC has 10 houses, and AreaD has 1 house.
If you look up:
AreaA + AreaC you will get 13 results.
AreaA + AreaC + AreaD you will get 14 results.
AreaD by itself you will get 1 result.
AreaA + AreaB you will get 0 results.
AreaA + AreaB + AreaC + AreaD you will get 0 results.
Basically, because AreaB has no results, if you query that area with any other area that does have results, it will still come up as no results.
I need to know how to query multiple selections from one category, while showing all the results even if one area doesn't have any.
Thanks.
Some (most) RETS server implementations are not done correctly. Your query is right according to RETS specs. You just need to find out what will work for your particular situation.
For example, you could try ((SUB_AREA_NAME=AreaA)|(SUB_AREA_NAME=AreaB)|(SUB_AREA_NAME=AreaC)|(SUB_AREA_NAME=AreaD)) and see if that works.
In some cases I've seen this work, notice I removed the pipe even though that is the OR conjunction, (SUB_AREA_NAME=AreaA,AreaB,AreaC,AreaD)
Other times it won't work with the commas and you need to use 4 seperate queries.
And even other times I have see the server foul up and not encode the commas properly so you have to do something like this (SUB_AREA_NAME=|AreaA%2CAreaB%2CAreaC%2CAreaD)

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.