number of points within a radius of another set of points - postgresql

I have two tables. One is a list of stores (with lat/long). The other is a list of customer addresses (with lat/long). What I want is a query that will return the number of customers within a certain radius for each store in my table. This gives me the total number of customers within 10,000 meters of ANY store, but I'm not sure how to loop it to return one row for each store with a count.
Note that I'm doing this queries using cartoDB, where the_geom is basically long/lat.
SELECT COUNT(*) as customer_count FROM customer_table
WHERE EXISTS(
SELECT 1 FROM store_table
WHERE ST_Distance_Sphere(store_table.the_geom, customer_table.the_geom) < 10000
)
This results in a single row :
customer_count
4009
Suggestions on how to make this work against my problem? I'm open to doing this other ways that might be more efficient (faster).
For reference, the column with store names, which would be in one column is store_identifier.store_table

I'll assume that you use the_geom to represent the coordinate (lat/lon) of store and customer. I will also assume that the_geom is of geography type. Your query will be something like this
select s.id, count(*) as customer_count
from customers c
inner join stores s
on st_dwithin(c.the_geom, s.the_geom, 10000)
group by s.id
This should give you neat table with a store id and count of customers within 10,000 meters from the store.
If the_geom is of type geometry, you query will be very similar but you should use st_distance_sphere() instead in order to express distance in kilometers (not degrees).

Related

What is the best approach?

At work we have a SQL Server 2019 instance. There are two big tables in the same database that have to be joined to obtain specific data: one contains GPS data taken at 4 minutes interval, but there could be in between records as well. The important thing here is that there is a non-key attribute called file_id, a timestamp (DATE_TIME column), latitude and longitude. The other attributes are not relevant, and the key is autogenerated (identity column), so it's of no use to me.
The other table contains transaction records that have among other attributes a timestamp (FECHATRX column), and the same non-key file ID attribute the GPS table has, and also an autogenerated key with no relation at all with the other key.
For each file ID there are several records in both tables that have to be somewhat joined in order to obtain for a given file ID and transaction record both its latitude and longitude. The tables aren't ordered at all.
My idea is to pair records of the same file ID and I imagine it to be this way (I haven't done it yet because it was explained to me earlier today):
Order both tables by file ID and timestamp
For the same file ID all the transaction table records who have a timestamp equal or greater than the first timestamp from the GPS table and lower than the following timestamp from the same GPS table will be given both latitude and longitude values from that first record, for they are considered to belong to that latitude-longitude pair (actually they probably are somewhere in the middle, but this is an assumption and everybody agrees with this)
When a transaction record has a timestamp equal or greater than the second timestamp, then the third timestamp will act as an end point, all the records in between from the transaction table will obtain the same coordinates from the second record until one timestamp equals or be greater than the third, and so on until a new file ID is reached or there are no records left in one or both tables.
To me this sounds like nested cursors and several variables to save the first GPS record's values while we are also saving the second GPS record's timestamp for comparison purposes, and of course the file ID itself as a control variable, but is this the best way to obtain the latitude / longitude data for each and every transaction record from the GPS table?
Are other approaches better than using nested cursors?
As I said I haven't done anything yet, the only thing I can do is to show you some data from both tables, I just wanted to know if there is another (and simpler) way of doing this than using nested cursors.
Thank you.
Alejandro
No need to reorder tables or use a complex cursor loop. A properly constructed index can provide an efficient join, and a CROSS APPLY or OUTER_APPLY can be used to handle the complex "select closest prior GPS coordinate" lookup logic.
Assuming your table structure is something like:
GPS(gps_id, file_id, timestamp, latitude, longitude, ...)
Transaction(transaction_id, timestamp, file_id, ...)
First create an index on the GPS table to allow efficient lookup by file_id and timestamp.
CREATE INDEX IX_GPS_FileId_Timestamp
ON GPS(file_id, timestamp)
INCLUDE(latitude, longitude)
The INCLUDE clause is optional, but it allows the index to serve up lat/long without the need to access the primary table.
You can then use a query something like:
SELECT *
FROM Transaction T
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp <= T.timestamp
ORDER BY G.timestamp DESC
) G1
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp >= T.timestamp
ORDER BY G.timestamp
) G2
CROSS APPLY and OUTER APPLY are like INNER JOIN and LEFT JOIN, but have more flexibility to define a subquery with complex conditions to handle cases like this.
The G1 subquery will efficiently select the immediately prior or equal GPS timestamp record with the same file_id. G2 does the same for equal or immediately following. Per your requirements, you only need G1, but having both might give you the opportunity to interpolate between the two points or to handle cases where there is no preceding matching record.
See this fiddle for a demo.

How to get latest data for a column when using grouping in postgres

I am using postgres alongside sequelize. I have encountered a case where I need to write a coustom query which groups the records are a particular field. I know for the remaning columns that are not used for grouping, I need to use a aggregate function like SUM. But the problem is that for some columns I need to get the one what is the latest one (DESC sorted by created_at). I see no function in sql to do so. Is my only option to write subqueries or is there a better way? Thanks?
For better understanding, If you look at the below picture, I want the group the records with address. So after the query there should only be two records, one with sydney and the other with new york. But when it comes to the distance, I want the result of the query to contain the distance form the row that was most recently created, i.e with the latest created_at.
so the final two query results should be:
sydney 100 2022-09-05 18:14:53.492131+05:45
new york 40 2022-09-05 18:14:46.23328+05:45
select address, distance, created_at
from(
select address, distance, created_at, row_number() over(partition by address order by created_at DESC) as rn
from table) x
where rn = 1

ST_contains taking too much time

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.
As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);

Filtering dependent data table, returns results from main table

Can I search in dependent data tables but returns results from main table?
This problem occurs where we have N x N relation in database like in example below: every user can have multiple locations but even if user has many location it is still one physical person.
I want to query sphinx with condition in table locations and return set should be from table users.
Query results will be filter by geo coordinates GEODIST() but its only
information because its not the main subject of this question. Goal is
for example: find persons who have location in 20 kilometers radius from some explicit point.
SQL structure
TABLE users
id PRIMARY KEY
name TEXT
etc...
TABLE locations
id PRIMARY KEY
name TEXT
coord_x FLOAT
coord_y FLOAT
etc...
TABLE user_location
user_id INTEGER FK
location_id INTEGER FK
Of course I can simply JOIN this 3 tables in Sphinx sql_query and filter this set but then I got duplicated persons when person have more than one location.
Any tips how to achieve this goal with Sphinx Search?
Of course I can simply JOIN this 3 tables in Sphinx sql_query and filter this set but then I got duplicated persons when person have more than one location.
Just add a GROUP BY to the sphinx query, then will only ever get own row per user.
You will need to make the users.id a sphinx attribute (so can group on it) and use a primary key from user_location as the sphinx document-id (so its unique)
(gets more complicated if have users that don't have locations, and still want to be able to search then - without the location filter. But it can still be done. Perhaps use a second source on the index, to find the unlocationed users)
SELECT DISTINCT u.*
FROM users u
JOIN user_location ul ON ul.user_id = u.id
JOIN locations l ON l.id = ul.location_id
WHERE ((l.coord_x - <<your X>>) * (l.coord_x - <<your X>>)) +
((l.coord_y - <<your Y>>) * (l.coord_y - <<your Y>>)) < 400;
You might want to wrap this in a SQL language function that takes the location coordinates as parameters and possibly the distance too. Note that this code assumes that coord_x and coord_y are in kilometers. If in some other unit, change the value 400 accordingly.
Note also that the query does not compute the distance to the given point by taking the square root of the squared differences in the two cardinal directions: you are not interested in the distance itself but only in locations being closer than a specified distance from a specified point. So you square that distance and then forget about the square root which is computationally expensive. If your locations table has many records, you will notice the difference.
SELECT *
FROM users u
WHERE EXISTS (
SELECT * FROM user_location ul
JOIN locations l ON l.id = ul.location_id
WHERE ul.user_id = u.id
AND l.coord_x ...
AND l.coord_y ...
);

Fully matching sets of records of two many-to-many tables

I have Users, Positions and Licenses.
Relations are:
users may have many licenses
positions may require many licenses
So I can easily get license requirements per position(s) as well as effective licenses per user(s).
But I wonder what would be the best way to match the two sets? As logic goes user needs at least those licenses that are required by a certain position. May have more, but remaining are not relevant.
I would like to get results with users and eligible positions.
PersonID PositionID
1 1 -> user 1 is eligible to work on position 1
1 2 -> user 1 is eligible to work on position 2
2 1 -> user 2 is eligible to work on position 1
3 2 -> user 3 is eligible to work on position 2
4 ...
As you can see I need a result for all users, not a single one per call, which would make things much much easier.
There are actually 5 tables here:
create table Person ( PersonID, ...)
create table Position (PositionID, ...)
create table License (LicenseID, ...)
and relations
create table PersonLicense (PersonID, LicenseID, ...)
create table PositionLicense (PositionID, LicenseID, ...)
So basically I need to find positions that a particular person is licensed to work on. There's of course a much more complex problem here, because there are other factors, but the main objective is the same:
How do I match multiple records of one relational table to multiple records of the other. This could as well be described as an inner join per set of records and not per single record as it's usually done in TSQL.
I'm thinking of TSQL language constructs:
rowsets but I've never used them before and don't know how to use them anyway
intersect statements maybe although these probably only work over whole sets and not groups
Final solution (for future reference)
In the meantime while you fellow developers answered my question, this is something I came up with and uses CTEs and partitioning which can of course be used on SQL Server 2008 R2. I've never used result partitioning before so I had to learn something new (which is a plus altogether). Here's the code:
with CTEPositionLicense as (
select
PositionID,
LicenseID,
checksum_agg(LicenseID) over (partition by PositionID) as RequiredHash
from PositionLicense
)
select per.PersonID, pos.PositionID
from CTEPositionLicense pos
join PersonLicense per
on (per.LicenseID = pos.LicenseID)
group by pos.PositionID, pos.RequiredHash, per.PersonID
having pos.RequiredHash = checksum_agg(per.LicenseID)
order by per.PersonID, pos.PositionID;
So I made a comparison between these three techniques that I named as:
Cross join (by Andriy M)
Table variable (by Petar Ivanov)
Checksum - this one here (by Robert Koritnik, me)
Mine already orders results per person and position, so I also added the same to the other two to make return identical results.
Resulting estimated execution plan
Checksum: 7%
Table variable: 2% (table creation) + 9% (execution) = 11%
Cross join: 82%
I also changed Table variable version into a CTE version (instead of table variable a CTE was used) and removed order by at the end and compared their estimated execution plans. Just for reference CTE version 43% while original version had 53% (10% + 43%).
One way to write this efficiently is to do a join of PositionLicences with PersonLicences on the licenceId. Then count the non nulls grouped by position and person and compare with the count of all licences for position - if equal than that person qualifies:
DECLARE #tmp TABLE(PositionId INT, LicenseCount INT)
INSERT INTO #tmp
SELECT PositionId as PositionId
COUNT(1) as LicenseCount
FROM PositionLicense
GROUP BY PositionId
SELECT per.PersonID, pos.PositionId
FROM PositionLicense as pos
INNER JOIN PersonLicense as per ON (pos.LicenseId = per.LicenseId)
GROUP BY t.PositionID, t.PersonId
HAVING COUNT(1) = (
SELECT LicenceCount FROM #tmp WHERE PositionId = t.PositionID
)
I would approach the problem like this:
Get all the (distinct) users from PersonLicense.
Cross join them with PositionLicense.
Left join the resulting set with PersonLicense using PersonID and LicenseID.
Group the results by PersonID and PositionID.
Filter out those (PersonID, PositionID) pairs where the number of licenses in PositionLicense does not match the number of those in PersonLicense.
And here's my implementation:
SELECT
u.PersonID,
pl.PositionID
FROM (SELECT DISTINCT PersonID FROM PersonLicense) u
CROSS JOIN PositionLicense pl
LEFT JOIN PersonLicense ul ON u.PersonID = ul.PersonID
AND pl.LicenseID = ul.LicenseID
GROUP BY
u.PersonID,
pl.PositionID
HAVING COUNT(pl.LicenseID) = COUNT(ul.LicenseID)