Using the ST_Disjoint() Function gives unexpected result - postgresql

I am fiddeling around with this dataset http://s3.cleverelephant.ca/postgis-workshop-2020.zip. It is used in this workshop http://postgis.net/workshops/postgis-intro/spatial_relationships.html.
I want to identify all the features, that do not have a subway station. I thought this spatial join is rather straight forward
SELECT
census.boroname,
COUNT(census.boroname)
FROM nyc_census_blocks AS census
JOIN nyc_subway_stations AS subway
ON ST_Disjoint(census.geom, subway.geom)
GROUP BY census.boroname;
However, the result set is waaaaay to large.
"Brooklyn" 4753693
"Manhattan" 1893156
"Queens" 7244123
"Staten Island" 2473146
"The Bronx" 2683246
When I run a test
SELECT COUNT(id) FROM nyc_census_blocks;
I get 38794 as a result. So there are way less features in nyc_census_blocks than I have in the result-set from the spatial join.
Why is that? Where is the mistake I am making?

The problem is that with ST_Disjoint you're getting for every record of nyc_census_block the total number of stations that are disjoint with nyc_subway_stations, which means in case of no intersection all records of nyc_subway_stations (491). That's why you're getting such a high count.
Alternatively you can count how many subways and census blocks do intersect, e.g. in a CTE or subquery, and in another query count how many of them return 0:
WITH j AS (
SELECT
gid,census.boroname,
(SELECT count(*)
FROM nyc_subway_stations subway
WHERE ST_Intersects(subway.geom,census.geom)) AS qt
FROM nyc_census_blocks AS census
)
SELECT boroname,count(*)
FROM j WHERE qt = 0
GROUP BY boroname;
boroname | count
---------------+-------
Brooklyn | 9517
Manhattan | 3724
Queens | 14667
Staten Island | 5016
The Bronx | 5396
(5 rows)

Related

Calculating a score for each road in Openstreetmap produces unexpected result. What am I missing?

I have a Postgres database with a postgis extention installed and filles with open street map data.
With the following SQL statement :
SELECT
l.osm_id,
sum(
st_area(st_intersection(ST_Buffer(l.way, 30), p.way))
/
st_area(ST_Buffer(l.way, 30))
) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p ON ST_Intersects(l.way, ST_Buffer(p.way,30))
WHERE p.natural in ('water') or p.landuse in ('forest') GROUP BY l.osm_id;
I calculate a "green" score.
My goal is to create a "green" score for each osm_id.
Which means; how much of a road is near a water, forrest or something similar.
For example a road that is enclosed by a park would have a score of 1.
A road that only runs by a river for a short period of time would have a score of for example 0.4
OR so is my expectation.
But by inspection the result of this calculation I get sometimes Values of
212.11701212511463 for a road with the OSM ID -647522
and 82 for a road with osm ID -6497265
I do get values between 0 and 1 too but I don't understand why I do also get such huge values.
What am I missing ?
I was expecting values between 1 and 0.
Using a custom unique ID that you must populate, the query can also union eventually overlapping polygons:
SELECT
l.uid,
st_area(
ST_UNION(
st_intersection(ST_Buffer(l.way, 30), p.way))
) / st_area(ST_Buffer(l.way, 30)) as green_fraction
FROM planet_osm_line AS l
INNER JOIN planet_osm_polygon AS p
ON st_dwithin(l.way, p.way,30)
WHERE p.natural in ('water') or p.landuse in ('forest')
GROUP BY l.uid;

Choosing data structure/storage solution for complex geo queries

I have a dataset of entities with their type and lat/long. Like this:
Name Type Lat Long
House1 Big 1 2
House11 Bigger 2 2
House12 Biggest 3 2
House13 Small 4 2
House14 Medium 5 2
So these are houses with their type and location. Now I need to answer queries like: "Find all house of type Big which have a Small and a Medium house in its 10km radius"
What kind of data structure/storage solution would be right here? I looked at Elasticsearch and Redis but looks like I need to iterate over all the houses of the given type (Big for the sample query above) to answer this.
It's perfectly feasible directly from PostgreSQL with PostGIS.
Considering your table structure ...
CREATE TEMPORARY TABLE t (name TEXT, type TEXT, geom GEOGRAPHY);
... and your test data ...
INSERT INTO t VALUES ('House1','Big', ST_MakePoint(1,2));
INSERT INTO t VALUES ('House11','Bigger', ST_MakePoint(2,2));
INSERT INTO t VALUES ('House12','Biggest', ST_MakePoint(3,2));
INSERT INTO t VALUES ('House13','Small', ST_MakePoint(4,2));
INSERT INTO t VALUES ('House14','Medium', ST_MakePoint(5,2));
(Note: here makes no sense to split lat,long in different columns. PostGIS can store both in a single GEOGRAPHY or GEOMETRY column. See ST_MakePoint for more details.)
"Find all house of type Big which have a Small and a Medium house in
its 10km radius"
Try something like this using ST_Distance:
WITH j AS (SELECT * FROM t WHERE type = 'Big')
SELECT
j.name,j.type,
ST_Distance(j.geom,t.geom) AS distance,
t.name, t.type
FROM j,t
WHERE
ST_Distance(j.geom,t.geom) > 10000 AND
t.type IN ('Small','Medium');
name | type | distance | name | type
--------+------+-----------------+---------+--------
House1 | Big | 333756.3481116 | House13 | Small
House1 | Big | 445008.41595616 | House14 | Medium
(2 Zeilen)
(This query returns records which are more than 10k meters away from the Big type house. Just adapt the first where statement to your needs)
EDIT: Query based on the comments.
WITH j AS (SELECT *, ARRAY(SELECT DISTINCT t2.type
FROM t t2
WHERE t2.type IN ('Small','Medium') AND
ST_Distance(t2.geom,t1.geom) < 100000
) AS nearHouseType
FROM t t1 WHERE type = 'Big')
SELECT *
FROM j
WHERE j.nearHouseType #> '{Medium, Small}'::TEXT[]

Tableau - Calculating average where date is less than value from another data source

I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.

Selecting count of values in multiple columns using two tables

I'm still new to tsql and trying to figure out how to build this query.
I have two tables. One called mirror which has an official list of all campuses and is used to populate a drop down list of campuses for users on a webform. They then have 5 choices they can select, which then populates another table with their request when they submit the form(Request). ie. CampusChoice1, CampusChoice2..etc.
I am trying to build a page to display the end results of all the collected data. After some reading I'm thinking I might need to use PIVOT to make this happen but I can't get my head to see the query.
I can make a rudimentary query for each choice1-5, but I kind of wanted them all together will nulls or zeros where some campuses were not chosen.
Something like
--Simple count on single col
SELECT CampusChoice1, COUNT(*) as '#'
FROM Request
Group By CampusChoice1
Or
--But this doesn't give the results I want, since it does not account for all the POSSIBLE choices.
SELECT CampusChoice1, COUNT() as '#',
CampusChoice2, COUNT() as '#',
CampusChoice3, COUNT() as '#',
CampusChoice4, COUNT() as '#',
CampusChoice5, COUNT(*) as '#'
FROM Operations.dbo.TransferRequest
Group By CampusChoice1, CampusChoice2, CampusChoice3, CampusChoice4, CampusChoice5
Any ideas how I could show this? Am I on the right track at least with the PIVOT table?
Not sure if I understood your question correctly, but assuming that you have this:
CampusChoice | Other data ...
------------------------------
CampusChoice1 | ...
CampusChoice2 | ...
CampusChoice1 | ...
Then for the example above with only 3 rows you want this end result:
CampusChoice1 | 2 | CampusChoice2 | 1 | CampusChoice3 | 0 | ...
The T-SQL to achieve this is:
select
'CampusChoice1',
sum( case when CampusChoice = 'CampusChoice1' then 1 else 0 end ) '#',
'CampusChoice2',
sum( case when CampusChoice = 'CampusChoice2' then 1 else 0 end ) '#',
'CampusChoice3',
sum( case when CampusChoice = 'CampusChoice3' then 1 else 0 end ) '#',
...
from
...
Use the sum combined with the case to sum 1's for each row for CampusChoice1 and 0's for each row not CampusChoice1, repeating this for each CampusChoiceN.

How to create a custom windowing function for PostgreSQL? (Running Average Example)

I would really like to better understand what is involved in creating a UDF that operates over windows in PostgreSQL. I did some searching about how to create UDFs in general, but haven't found an example of how to do one that operates over a window.
To that end I am hoping that someone would be willing to share code for how to write a UDF (can be in C, pl/SQL or any of the procedural languages supported by PostgreSQL) that calculates the running average of numbers in a window. I realize there are ways to do this by applying the standard average aggregate function with the windowing syntax (rows between syntax I believe), I am simply asking for this functionality because I think it makes a good simple example. Also, I think if there was a windowing version of average function then the database could keep a running sum and observation count and wouldn't sum up almost identical sets of rows at each iteration.
You have to look to postgresql source code postgresql/src/backend/utils/adt/windowfuncs.c and postgresql/src/backend/executor/nodeWindowAgg.c
There are no good documentation :( -- fully functional window function should be implemented only in C or PL/v8 - there are no API for other languages.
http://www.pgcon.org/2009/schedule/track/Version%208.4/128.en.html presentation from author of implementation in PostgreSQL.
I found only one non core implementation - http://api.pgxn.org/src/kmeans/kmeans-1.1.0/
http://pgxn.org/dist/plv8/1.3.0/doc/plv8.html
According to the documentation "Other window functions can be added by the user. Also, any built-in or user-defined normal aggregate function can be used as a window function." (section 4.2.8). That worked for me for computing stock split adjustments:
CREATE OR REPLACE FUNCTION prod(float8, float8) RETURNS float8
AS 'SELECT $1 * $2;'
LANGUAGE SQL IMMUTABLE STRICT;
CREATE AGGREGATE prods ( float8 ) (
SFUNC = prod,
STYPE = float8,
INITCOND = 1.0
);
create or replace view demo.price_adjusted as
select id, vd,
prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as adjf,
rawprice * prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as price
from demo.prices_raw left outer join demo.adjustments using (id,vd);
Here are the schemas of the two tables:
CREATE TABLE demo.prices_raw (
id VARCHAR(30),
vd DATE,
rawprice float8 );
CREATE TABLE demo.adjustments (
id VARCHAR(30),
vd DATE,
sdiv float);
Starting with table
payments
+------------------------------+
| customer_id | amount | item |
| 5 | 10 | book |
| 5 | 71 | mouse |
| 7 | 13 | cover |
| 7 | 22 | cable |
| 7 | 19 | book |
+------------------------------+
SELECT customer_id,
AVG(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
we get
+----------------------------------+
| customer_id | avg_amount | item |
| 5 | 40.5 | book |
| 5 | 40.5 | mouse |
| 7 | 18 | cover |
| 7 | 18 | cable |
| 7 | 18 | book |
+----------------------------------+
AVG being an aggregate function, it can act as a window function. However not all window functions are aggregate functions. The aggregate functions are the non-sophisticated window functions.
In the query above, let's not use the built-in AVG function and use our own implementation. Does the same, just implemented by the user. The query above becomes:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The only difference from the former query is that AVG has been replaced with my_avg. We now need to implement our custom function.
On how to compute the average
Sum up all the elements, then divide by the number of elements. For customer_id of 7, that would be (13 + 22 + 19) / 3 = 18.
We can devide it in:
a step-by-step accumulation -- the sum.
a final operation -- division.
On how the aggregate function gets to the result
The average is computed in steps. Only the last value is necessary.
Start with an initial value of 0.
Feed 13. Compute the intermediate/accumulated sum, which is 13.
Feed 22. Compute the accumulated sum, which needs the previous sum plus this element: 13 + 22 = 35
Feed 19. Compute the accumulated sum, which needs the previous sum plus this element: 35 + 19 = 54. This is the total that needs to be divided by the number of element (3).
The result of step 3. is fed to another function, that knows how to divide the accumulated sum by the number of elements
What happened here is that the state started with the initial value of 0 and was changed with every step, then passed to the next step.
State travels between steps for as long as there is data. When all data is consumed state goes to a final function (terminal operation). We want the state to contain all the information needed for the accumulator as well as by the terminal operation.
In the specific case of computing the average, the terminal operation needs to know how many elements the accumulator worked with because it needs to divide by that. For that reason, the state needs to include both the accumulated sum and the number of elements.
We need a tuple that will contain both. Pre-defined POINT PostgreSQL type to the rescue. POINT(5, 89) means an accumulated sum of 5 elements that has the value of 89. The initial state will be a POINT(0,0).
The accumulator is implemented in what's called a state function. The terminal operation is implemented in what's called a final function.
When defining a custom aggregate function we need to specify:
the aggregate function name and return type
the initial state
the type of the state that the infrastructure will pass between steps and to the final function
a state function -- knows how to perform the accumulation steps
a final function -- knows how to perform the terminal operation. Not always needed (e.g. in a custom implementation of SUM the final value of the accumulated sum is the result.)
Here's the definition for the custom aggregate function.
CREATE AGGREGATE my_avg (NUMERIC) ( -- NUMERIC is what the function returns
initcond = '(0,0)', -- this is the initial state of type POINT
stype = POINT, -- this is the type of the state that will be passed between steps
sfunc = my_acc, -- this is the function that knows how to compute a new average from existing average and new element. Takes in the state (type POINT) and an element for the step (type NUMERIC)
finalfunc my_final_func -- returns the result for the aggregate function. Takes in the state of type POINT (like all other steps) and returns the result as what the aggregate function returns - NUMERIC
);
The only thing left is to define two functions my_acc and my_final_func.
CREATE FUNCTION my_acc (state POINT, elem_for_step NUMERIC) -- performs accumulated sum
RETURNS POINT
LANGUAGE SQL
AS $$
-- state[0] is the number of elements, state[1] is the accumulated sum
SELECT POINT(state[0]+1, state[1] + elem_for_step);
$$;
CREATE FUNCTION my_final_func (POINT) -- performs devision and returns final value
RETURNS NUMERIC
LANGUAGE SQL
AS $$
-- $1[1] is the sum, $1[0] is the number of elements
SELECT ($1[1]/$1[0])::NUMERIC;
$$;
Now that the functions are available CREATE AGGREGATE defined above will run successfully. Now that we have the aggregate defined, the query based on my_avg instead of the built-in AVG can be run:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The results are identical with what you get when using the built-in AVG.
The PostgreSQL documentation suggests that the users are limited to implementing user-defined aggregate functions:
In addition to these [pre-defined window] functions, any built-in or user-defined general-purpose or statistical aggregate (i.e., not ordered-set or hypothetical-set aggregates) can be used as a window function;
What I suspect ordered-set or hypothetical-set aggregates means:
the value returned is identical to all other rows (e.g. AVG and SUM. In contrast RANK returns different values for all rows in group depending on more sophisticated criteria)
it makes no sense to ORDER BY when PARTITIONing because the values are the same for all rows anyway. In contrast we want to ORDER BY when using RANK()
Query:
SELECT customer_id, item, rank() OVER (PARTITION BY customer_id ORDER BY amount desc) FROM payments;
Geometric mean
The following is a user-defined aggregate function that I found no built-in aggregate for and may be useful to some.
The state function computes the average of the natural logarithms of the terms.
The final function raises constant e to whatever the accumulator provides.
CREATE OR REPLACE FUNCTION sum_of_log(state POINT, curr_val NUMERIC)
RETURNS POINT
LANGUAGE SQL
AS $$
SELECT POINT(state[0] + 1,
(state[1] * state[0]+ LN(curr_val))/(state[0] + 1));
$$;
CREATE OR REPLACE FUNCTION e_to_avg_of_log(POINT)
RETURNS NUMERIC
LANGUAGE SQL
AS $$
select exp($1[1])::NUMERIC;
$$;
CREATE AGGREGATE geo_mean (NUMBER)
(
stype = NUMBER,
initcond = '(0,0)', -- represent POINT value
sfunc = sum_of_log,
finalfunc = e_to_avg_of_log
);
PL/R provides such functionality. See here for some examples. That said, I'm not sure that it (currently) meets your requirement of "keep[ing] a running sum and observation count and [not] sum[ming] up almost identical sets of rows at each iteration" (see here).