Choosing data structure/storage solution for complex geo queries - postgresql

I have a dataset of entities with their type and lat/long. Like this:
Name Type Lat Long
House1 Big 1 2
House11 Bigger 2 2
House12 Biggest 3 2
House13 Small 4 2
House14 Medium 5 2
So these are houses with their type and location. Now I need to answer queries like: "Find all house of type Big which have a Small and a Medium house in its 10km radius"
What kind of data structure/storage solution would be right here? I looked at Elasticsearch and Redis but looks like I need to iterate over all the houses of the given type (Big for the sample query above) to answer this.

It's perfectly feasible directly from PostgreSQL with PostGIS.
Considering your table structure ...
CREATE TEMPORARY TABLE t (name TEXT, type TEXT, geom GEOGRAPHY);
... and your test data ...
INSERT INTO t VALUES ('House1','Big', ST_MakePoint(1,2));
INSERT INTO t VALUES ('House11','Bigger', ST_MakePoint(2,2));
INSERT INTO t VALUES ('House12','Biggest', ST_MakePoint(3,2));
INSERT INTO t VALUES ('House13','Small', ST_MakePoint(4,2));
INSERT INTO t VALUES ('House14','Medium', ST_MakePoint(5,2));
(Note: here makes no sense to split lat,long in different columns. PostGIS can store both in a single GEOGRAPHY or GEOMETRY column. See ST_MakePoint for more details.)
"Find all house of type Big which have a Small and a Medium house in
its 10km radius"
Try something like this using ST_Distance:
WITH j AS (SELECT * FROM t WHERE type = 'Big')
SELECT
j.name,j.type,
ST_Distance(j.geom,t.geom) AS distance,
t.name, t.type
FROM j,t
WHERE
ST_Distance(j.geom,t.geom) > 10000 AND
t.type IN ('Small','Medium');
name | type | distance | name | type
--------+------+-----------------+---------+--------
House1 | Big | 333756.3481116 | House13 | Small
House1 | Big | 445008.41595616 | House14 | Medium
(2 Zeilen)
(This query returns records which are more than 10k meters away from the Big type house. Just adapt the first where statement to your needs)
EDIT: Query based on the comments.
WITH j AS (SELECT *, ARRAY(SELECT DISTINCT t2.type
FROM t t2
WHERE t2.type IN ('Small','Medium') AND
ST_Distance(t2.geom,t1.geom) < 100000
) AS nearHouseType
FROM t t1 WHERE type = 'Big')
SELECT *
FROM j
WHERE j.nearHouseType #> '{Medium, Small}'::TEXT[]

Related

Using the ST_Disjoint() Function gives unexpected result

I am fiddeling around with this dataset http://s3.cleverelephant.ca/postgis-workshop-2020.zip. It is used in this workshop http://postgis.net/workshops/postgis-intro/spatial_relationships.html.
I want to identify all the features, that do not have a subway station. I thought this spatial join is rather straight forward
SELECT
census.boroname,
COUNT(census.boroname)
FROM nyc_census_blocks AS census
JOIN nyc_subway_stations AS subway
ON ST_Disjoint(census.geom, subway.geom)
GROUP BY census.boroname;
However, the result set is waaaaay to large.
"Brooklyn" 4753693
"Manhattan" 1893156
"Queens" 7244123
"Staten Island" 2473146
"The Bronx" 2683246
When I run a test
SELECT COUNT(id) FROM nyc_census_blocks;
I get 38794 as a result. So there are way less features in nyc_census_blocks than I have in the result-set from the spatial join.
Why is that? Where is the mistake I am making?
The problem is that with ST_Disjoint you're getting for every record of nyc_census_block the total number of stations that are disjoint with nyc_subway_stations, which means in case of no intersection all records of nyc_subway_stations (491). That's why you're getting such a high count.
Alternatively you can count how many subways and census blocks do intersect, e.g. in a CTE or subquery, and in another query count how many of them return 0:
WITH j AS (
SELECT
gid,census.boroname,
(SELECT count(*)
FROM nyc_subway_stations subway
WHERE ST_Intersects(subway.geom,census.geom)) AS qt
FROM nyc_census_blocks AS census
)
SELECT boroname,count(*)
FROM j WHERE qt = 0
GROUP BY boroname;
boroname | count
---------------+-------
Brooklyn | 9517
Manhattan | 3724
Queens | 14667
Staten Island | 5016
The Bronx | 5396
(5 rows)

Comparing two text array columns using % and ordering by number of matches

I have a user table that contains a "skills" column which is a text array. Given some input array, I would like to find all the users whose skills % one or more of the entries in the input array, and order by number of matches (according to the % operator from pg_trgm).
For example, I have Array['java', 'ruby', 'postgres'] and I want users who have these skills ordered by the number of matches (max is 3 in this case).
I tried unnest() with an inner join. It looked like I was getting somewhere, but I still have no idea how I can capture the count of the matching array entries. Any ideas on what the structure of the query may look like?
Edit: Details:
Here is what my programmers table looks like:
id | skills
----+-------------------------------
1 | {javascript,rails,css}
2 | {java,"ruby on rails",adobe}
3 | {typescript,nodejs,expressjs}
4 | {auth0,c++,redis}
where skills is a text array.
Here is what I have so far:
SELECT * FROM programmers, unnest(skills) skill_array(x)
INNER JOIN unnest(Array['ruby', 'node']) search(y)
ON skill_array.x % search.y;
which outputs the following:
id | skills | x | y
----+-------------------------------+---------------+---------
2 | {java,"ruby on rails",adobe} | ruby on rails | ruby
3 | {typescript,nodejs,expressjs} | nodejs | node
3 | {typescript,nodejs,expressjs} | expressjs | express
*Assuming pg_trgm is enabled.
For an exact match between the user skills and the searched skills, you can proceed like this :
You put the searched skills in the target_skills text array
You filter the users from the table user_table whose user_skills array has at least one common element with the target_skills array by using the && operator
For each of the selected users, you select the common skills by using unnest and INTERSECT, and you calculate the number of these common skills
You order the result by the number of common skills DESC
In this process, the users with skill "ruby" will be selected for the target skill "ruby", but not the users with skill "ruby on rails".
This process can be implemented as follow :
SELECT u.user_id
, u.user_skills
, inter.skills
FROM user_table AS u
CROSS JOIN LATERAL
( SELECT array( SELECT unnest(u.user_skills)
INTERSECT
SELECT unnest(target_skills)
) AS skills
) AS inter
WHERE u.user_skills && target_skills
ORDER BY array_length(inter.skills, 1) DESC
or with this variant :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN LATERAL unnest(target_skills) AS t_skill
WHERE u.user_skills && array[t_skill]
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC
This query can be accelerated by creating a GIN index on the user_skills column of the user_table.
For a partial match between the user skills and the target skills (ie the users with skill "ruby on rails" must be selected for the target skill "ruby"), you need to use the pattern matching operator LIKE or the regular expression, but it is not possible to use them with text arrays, so you need first to transform your user_skills text array into a simple text with the function array_to_string. The query becomes :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN unnest(target_skills) AS t_skill
WHERE array_to_string(u.user_skills, ' ') ~ t_skill
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC ;
Then you can accelerate the queries by creating the following GIN (or GiST) index :
DROP INDEX IF EXISTS user_skills ;
CREATE INDEX user_skills
ON user_table
USING gist (array_to_string(user_skills, ' ') gist_trgm_ops) ; -- gin_trgm_ops and gist_trgm_ops indexes are compliant with the LIKE operator and the regular expressions
In any case, managing the skills as text will ever fail if there are typing errors or if the skills list is not normalized.
I accepted Edouard's answer, but I thought I'd show something else I adapted from it.
CREATE OR REPLACE FUNCTION partial_and_and(list1 TEXT[], list2 TEXT[])
RETURNS BOOLEAN AS $$
SELECT EXISTS(
SELECT * FROM unnest(list1) x, unnest(list2) y
WHERE x % y
);
$$ LANGUAGE SQL IMMUTABLE;
Then create the operator:
CREATE OPERATOR &&% (
LEFTARG = TEXT[],
RIGHTARG = TEXT[],
PROCEDURE = partial_and_and,
COMMUTATOR = &&%
);
And finally, the query:
SELECT p.id, p.skills, array_agg(t_skill) AS inter_skills
FROM programmers AS p
CROSS JOIN LATERAL unnest(Array['ruby', 'java']) AS t_skill
WHERE p.skills &&% array[t_skill]
GROUP BY p.id, p.skills
ORDER BY array_length(inter_skills, 1) DESC;
This will output an error saying column 'inter_skills' does not exist (not sure why), but oh well point is the query seems to work. All credit goes to Edouard.

Is there any way to match multiple date ranges for inclusion in other multiple ranges in postgresql

For example I have in database allowed ranges - (08:00-12:00), (12:00-15:00) and requested range I want to test - (09:00-14:00). Is there any way to understand that my test range is included in allowed range in database. It can be splited in even more parts, I just want to know if my range fully fits to list of time ranges in database.
You don't provide table structure, so I have no idea of data type. lets assume those are texts:
t=# select '(8:00, 12:30)' a,'(12:00, 15:00)' b,'(09:00, 14:00)' c;
a | b | c
---------------+----------------+----------------
(8:00, 12:30) | (12:00, 15:00) | (09:00, 14:00)
(1 row)
then how you can do it:
t=# \x
Expanded display is on.
t=# with d(a,b,c) as (values('(8:00, 12:30)','(12:00, 15:00)','(09:00, 14:00)'))
, w as (select '2017-01-01 ' h)
, timerange as (
select
tsrange(concat(w.h,split_part(substr(a,2),',',1))::timestamp,concat(w.h,split_part(a,',',2))::timestamp) ta
, tsrange(concat(w.h,split_part(substr(b,2),',',1))::timestamp,concat(w.h,split_part(b,',',2))::timestamp) tb
, tsrange(concat(w.h,split_part(substr(c,2),',',1))::timestamp,concat(w.h,split_part(c,',',2))::timestamp) tc
from w
join d on true
)
select *, ta + tb glued, tc <# ta + tb fits from timerange;
-[ RECORD 1 ]----------------------------------------
ta | ["2017-01-01 08:00:00","2017-01-01 12:30:00")
tb | ["2017-01-01 12:00:00","2017-01-01 15:00:00")
tc | ["2017-01-01 09:00:00","2017-01-01 14:00:00")
glued | ["2017-01-01 08:00:00","2017-01-01 15:00:00")
fits | t
first you need to "cast" your time to timestamp, as there is no timerange in postgres, so we take same day for all times (w.h = 2017-01-01) and convert a,b,c to ta,tb,tc with default including brackets (which totally fits our case).
then use union https://www.postgresql.org/docs/current/static/functions-range.html#RANGE-FUNCTIONS-TABLE operator to get "glued" interval
lastly check if the range is contained by the larger one with <# operator

Recursive postgres query to view

I have the following table which models a very simple hierarchical data structure with each element pointing to its parent:
Table "public.device_groups"
Column | Type | Modifiers
--------------+------------------------+---------------------------------------------------------------
dg_id | integer | not null default nextval('device_groups_dg_id_seq'::regclass)
dg_name | character varying(100) |
dg_parent_id | integer |
I want to query the recursive list of subgroups of a specific group.
I constructed the following recursive query which works fine:
WITH RECURSIVE r(dg_parent_id, dg_id, dg_name) AS (
SELECT dg_parent_id, dg_id, dg_name FROM device_groups WHERE dg_id=1
UNION ALL
SELECT dg.dg_parent_id, dg.dg_id, dg.dg_name
FROM r pr, device_groups dg
WHERE dg.dg_parent_id = pr.dg_id
)
SELECT dg_id, dg_name
FROM r;
I now want to turn this into a view where I can choose which group I want to drill down for using a WHERE clause. This means I want to be able to do:
SELECT * FROM device_groups_recursive WHERE dg_id = 1;
And get all the (recursive) subgroups of the group with id 1
I was able to write a function (by wrapping the query from above), but I would like to have a view instead of the function.
Side-Node: I know of the shortcoming of an adjacency list representation, I cannot change it currently.

Postgresql Select all columns and column names with a specific value for a row

I have a table with many(+1000) columns and rows(~1M). The columns have either the value 1 , or are NULL.
I want to be able to select, for a specific row (user) retrieve the column names that have a value of 1.
Since there are many columns on the table, specifying the columns would yield a extremely long query.
You're doing something SQL is quite bad at - dynamic access to columns, or treating a row as a set. It'd be nice if this were easier, but it doesn't work well with SQL's typed nature and the concept of a relation. Working with your data set in its current form is going to be frustrating; consider storing an array, json, or hstore of values instead.
Actually, for this particular data model, you could probably use a bitfield. See bit(n) and bit varying(n).
It's still possible to make a working query with your current model PostgreSQL extensions though.
Given sample:
CREATE TABLE blah (id serial primary key, a integer, b integer, c integer);
INSERT INTO blah(a,b,c) VALUES (NULL, NULL, 1), (1, NULL, 1), (NULL, NULL, NULL), (1, 1, 1);
I would unpivot each row into a key/value set using hstore (or in newer PostgreSQL versions, the json functions). SQL its self provides no way to dynamically access columns, so we have to use an extension. So:
SELECT id, hs FROM blah, LATERAL hstore(blah) hs;
then extract the hstores to sets:
SELECT id, k, v FROM blah, LATERAL each(hstore(blah)) kv(k,v);
... at which point your can filter for values matching the criteria. Note that all columns have been converted to text, so you may want to cast it back:
SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1;
You also need to exclude id from matching, so:
regress=> SELECT id, k FROM blah, LATERAL each(hstore(blah)) kv(k,v) WHERE v::integer = 1 AND
k <> 'id';
id | k
----+---
1 | c
2 | a
2 | c
4 | a
4 | b
4 | c
(6 rows)