I have two datasets :
A "customer" dataset with customer names and geographical coordinates (x,y)
A "stations" dataset with stations names and geographical coordinates (x,y)
What I need to do :
Find for each customer, the nearest station from the "stations" dataset
At the end, i need a dataset with :
customer_name, customerX, customerY, nearest_station_name, nearest_station_x, nearest_station_y
Nearest Definition :
For example for customer "c":
s1 is the station 1
s2 is the station 2
if ((Xs1-Xc)² + (Ys1-Yc)²) < ((Xs2-Xc)² + (Ys2-Yc)²) Then the Nearest station is S1
if ((Xs1-Xc)² + (Ys1-Yc)²) = ((Xs2-Xc)² + (Ys2-Yc)²) Then the Nearest stations is either
if ((Xs1-Xc)² + (Ys1-Yc)²) > ((Xs2-Xc)² + (Ys2-Yc)²) Then the Nearest station is S2
That mean i need to know for each customer and each station, the result of (Xsi-Xc)² + (Ysi-Yc)²
Do you know if i can do that in spark scala or spark sql or bigquery without having to code a UDF?
Thank you for your help.
I tried, for every customer, to loop thru the stations list in order to find the nearest but its too complicated and should be a UDF, which i dont want if not mandatory ...
Double nearestStationDistance = Double.MAX_VALUE;
Station nearestStation = null;
for(Station station : stations){
Double distance = ((station.x - customer.x)² + (station.y - customer.y)²);
if(distance < nearestStationDistance ){
nearestStationDistance = distance;
nearestStation = station
}
}
return nearestStation;
And after extract informations from the "Station" object to get the name and the coordinates in order to complete the customer dataset.
I wrote couple posts about doing it in BigQuery:
https://mentin.medium.com/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
https://mentin.medium.com/nearest-neighbor-using-bq-scripting-373241f5b2f5
The solution is simple to express in SQL:
SELECT
a.id,
ARRAY_AGG(b.id ORDER BY ST_Distance(a.geog, b.geog) LIMIT 1)
[ORDINAL(1)] as neighbor_id
FROM people_table a JOIN restaurant_table b
GROUP BY a.id
But that solution does not scale when tables are large, and the posts discuss options to speed things up.
Related
I have a Postgres database with a postgis extention installed and filles with open street map data.
With the following SQL statement :
SELECT
l.osm_id,
sum(
st_area(st_intersection(ST_Buffer(l.way, 30), p.way))
/
st_area(ST_Buffer(l.way, 30))
) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p ON ST_Intersects(l.way, ST_Buffer(p.way,30))
WHERE p.natural in ('water') or p.landuse in ('forest') GROUP BY l.osm_id;
I calculate a "green" score.
My goal is to create a "green" score for each osm_id.
Which means; how much of a road is near a water, forrest or something similar.
For example a road that is enclosed by a park would have a score of 1.
A road that only runs by a river for a short period of time would have a score of for example 0.4
OR so is my expectation.
But by inspection the result of this calculation I get sometimes Values of
212.11701212511463 for a road with the OSM ID -647522
and 82 for a road with osm ID -6497265
I do get values between 0 and 1 too but I don't understand why I do also get such huge values.
What am I missing ?
I was expecting values between 1 and 0.
Using a custom unique ID that you must populate, the query can also union eventually overlapping polygons:
SELECT
l.uid,
st_area(
ST_UNION(
st_intersection(ST_Buffer(l.way, 30), p.way))
) / st_area(ST_Buffer(l.way, 30)) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p
ON st_dwithin(l.way, p.way,30)
WHERE p.natural in ('water') or p.landuse in ('forest')
GROUP BY l.uid;
I have a query which I want to know relatively how many locations are up to 100 meters away (relate to all distances):
select person_tbl.tdm, sum((st_distance (person_tbl.geo, location_tbl.geo) < 100)::INT)::FLOAT / count(*)
from persons as person_tbl, locations as location_tbl
where person_tbl.geo is not null
group by person_tbl.tdm
The 2 tables contains geometry indexs:
create index idx on persons using gist(geo)
create index idx on locations using gist(geo)
The first table (persons) the values of geo field is POLYGON
The second table (locations) the values of geo field are POINT Z or POLYGON Z or MULTIPOLYGON Z
The first table persons contains ~2M rows and the second table locations contains ~500 rows
The query took too long (~2 hours).
The values of max_parallel_processes and max_parallel_workers is 8
Is there something I can do to optimize the query calculation time (2 hours seems too long) ?
Is there a better way to write the query ? or do I need to define the indexes in other way ?
I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.
I am working in Tableau and trying to figure out how to create a filter exclusion. For example I have the following fields.
Hospital CallType CallDate
I want to filter out all hospitals where one of the Calls has a call type of ColdCall and a Call DateBetween X and Y.
I can do this easily in SQL but don't have access to this data in the SQL Database. It would be the following:
Select
Hospital
,CallType
,CallDate
Into
#TempTable
From
Database
Select
Hospital
,CallType
,CallDate
Into
#ExclusionTable
From
Database
Where
CallType = 'Cold'
and
CallDate Between X and Y
Select
Hospital
,CallType
,CallDate
From
#TempTable
Where
Hospital not in
(Select
Hospital
From
#ExclusionTable)
Any suggestions would be greatly appreciated.
Thanks,
Simple. Create a calculated field Filter:
IF CallType = "Cold" AND CallDate < X AND CallDate > Y
THEN 1
ELSE 0
END
Then drag Hospital to filter, go to Condition tab, select by field, get your Filter field, use sum > 0. It will filter out any hospital that have at least one call with your conditions (because all the calls that don't meet will be zero, and if at least one is not zero, the sum will be over 0)
For X and Y, I'd create parameters. It's easier (and safer) than trying to write the dates directly on the field. And you can manipulate then more easily too
I am very new to the spatial realm of SQL Server and need some help. I have a waypoint organizing app and I am trying to generate some queries that follow along the premise of finding waypoints that are part of geographic polygons like lakes, rivers, etc. I have preloaded my tables with data I have downloaded. I used shape2sql.exe to load shapefiles into the appropriate db tables.
Tables are as follows:
Water table - id, name, geog(geography data type)
State table - id, state_name, state_abbr, geog(geography data type)
County table - id, name, state_name, geog(geography data type)
Waypoint table - id, name, lat, lon, waterid
How do I write queries against these tables to return things like:
- all waypoints in 'michigan'
- all waypoints on 'bass lake' in 'montcalm' county in 'michigan' (there are multiple bass lakes in michigan and the country hence the county/state part)
- auto assign the water id column of the waypoint table by "processing" a group of waypoints and finding what lake they actually belong to
- etc.
Thanks!
Learned so far:
select geog.ToString() as Points, geog.STArea() as Area, geog.STLength() as Length
from water
where name like '%bass lake%' and STATE = 'mi'
will return the record for Bass Lake and the polygon with the actual coordinates for the lake.
POLYGON ((-87.670498549804691 46.304831340698243, -87.670543549804691 46.307117340698241, -87.676573549804687 46.313480340698241, -87.68120854980468 46.314821340698245, -87.685168549804686 46.315703340698242, -87.6877605498047 46.313390340698241, -87.685051549804683 46.308827340698244, -87.682360549804685 46.305650340698243, -87.677734549804683 46.304768340698246, -87.674440549804686 46.304336340698242, -87.670498549804691 46.304831340698243)) 1022083.96662664 4027.52433709888
Shooting from the hip, here, but maybe like this:
UPDATE waypoints
SET waypoints.WaterId = water.Id
FROM dbo.Waypoints AS waypoints LEFT JOIN
dbo.Water AS water ON geography::Point(waypoints.Lat, waypoints.Lon, 4326).STIntersects(water.geog)
Should set the waterId on the wapoints table to one of the matching water ids, from the water table.
This should get you all the waypoints on BASS LAKE
SELECT waypoints.*
FROM dbo.Waypoints as waypoints INNER JOIN
dbo.Water AS water ON geography::Point(waypoints.Lat, waypoints.Lon, 4326).STIntersects(water.geog) = 1
WHERE water.Name = 'BASS_LAKE' -- OR WHATEVER
Ok - learning as I go so here are some answers to my own questions for anyone what would like to know.
Here is one query for finding various waypoints with conditions in the where clause:
SELECT * FROM WaypointTable wp
JOIN WaterTable w
ON wp.geogcolumn.STIntersects(w.geogcolumn) = 1
WHERE w.name LIKE '%bass lake%'
AND w.state = 'mi';
Here is a query for assigning water id's to waypoints based on where they 'fit':
UPDATE WaypointTable wp
SET WaterID = (
SELECT ID
FROM WaterTable
WHERE geogcolumns.STIntersects(wp.geogcolumn) = 1
);
Both of these queries work extremely well and fast! Love it!