What is the best approach? - tsql

At work we have a SQL Server 2019 instance. There are two big tables in the same database that have to be joined to obtain specific data: one contains GPS data taken at 4 minutes interval, but there could be in between records as well. The important thing here is that there is a non-key attribute called file_id, a timestamp (DATE_TIME column), latitude and longitude. The other attributes are not relevant, and the key is autogenerated (identity column), so it's of no use to me.
The other table contains transaction records that have among other attributes a timestamp (FECHATRX column), and the same non-key file ID attribute the GPS table has, and also an autogenerated key with no relation at all with the other key.
For each file ID there are several records in both tables that have to be somewhat joined in order to obtain for a given file ID and transaction record both its latitude and longitude. The tables aren't ordered at all.
My idea is to pair records of the same file ID and I imagine it to be this way (I haven't done it yet because it was explained to me earlier today):
Order both tables by file ID and timestamp
For the same file ID all the transaction table records who have a timestamp equal or greater than the first timestamp from the GPS table and lower than the following timestamp from the same GPS table will be given both latitude and longitude values from that first record, for they are considered to belong to that latitude-longitude pair (actually they probably are somewhere in the middle, but this is an assumption and everybody agrees with this)
When a transaction record has a timestamp equal or greater than the second timestamp, then the third timestamp will act as an end point, all the records in between from the transaction table will obtain the same coordinates from the second record until one timestamp equals or be greater than the third, and so on until a new file ID is reached or there are no records left in one or both tables.
To me this sounds like nested cursors and several variables to save the first GPS record's values while we are also saving the second GPS record's timestamp for comparison purposes, and of course the file ID itself as a control variable, but is this the best way to obtain the latitude / longitude data for each and every transaction record from the GPS table?
Are other approaches better than using nested cursors?
As I said I haven't done anything yet, the only thing I can do is to show you some data from both tables, I just wanted to know if there is another (and simpler) way of doing this than using nested cursors.
Thank you.
Alejandro

No need to reorder tables or use a complex cursor loop. A properly constructed index can provide an efficient join, and a CROSS APPLY or OUTER_APPLY can be used to handle the complex "select closest prior GPS coordinate" lookup logic.
Assuming your table structure is something like:
GPS(gps_id, file_id, timestamp, latitude, longitude, ...)
Transaction(transaction_id, timestamp, file_id, ...)
First create an index on the GPS table to allow efficient lookup by file_id and timestamp.
CREATE INDEX IX_GPS_FileId_Timestamp
ON GPS(file_id, timestamp)
INCLUDE(latitude, longitude)
The INCLUDE clause is optional, but it allows the index to serve up lat/long without the need to access the primary table.
You can then use a query something like:
SELECT *
FROM Transaction T
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp <= T.timestamp
ORDER BY G.timestamp DESC
) G1
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp >= T.timestamp
ORDER BY G.timestamp
) G2
CROSS APPLY and OUTER APPLY are like INNER JOIN and LEFT JOIN, but have more flexibility to define a subquery with complex conditions to handle cases like this.
The G1 subquery will efficiently select the immediately prior or equal GPS timestamp record with the same file_id. G2 does the same for equal or immediately following. Per your requirements, you only need G1, but having both might give you the opportunity to interpolate between the two points or to handle cases where there is no preceding matching record.
See this fiddle for a demo.

Related

Table specifically built for a dashboard has several filters.... best way to index?

I have created a materialized view for the purposes of feeding into a dashboard.
My goal is to make this table selectable in the fastest way possible and I'm not sure how to approach it. I was hoping that if I describe the table and how it will be used, someone could offer some direction.
The context is a website with funnel steps.Each row is an instance of a user triggering a funnel step such as add to cart, checkout, payment details and then finally transaction.
Since the table is for the purposes of analytics, it will be refreshed automatically with cron once a day only, in the morning, so I'm not worried about real time update speed, only select speed with various where clauses.
Suppose I have the fields described below:
(N = ~13M and expected to be ~20 by January, growing several million per month)
Table is unique with the combination of session id, user id and funnel step.
- Session Id (Id, so some duplication but generally very very granular - Varchar)
- User Id (Id, so some duplication but generally very very granular - Varchar)
- Date (Date)
- Funnel Step (10 distinct value - Varchar)
- Device Category (3 distinct values - Varchar)
- Country (~ 100 distinct values - varchar)
- City (~1000+ distinct values - varchar)
- Source (several thousand distinct values, nevertheless, stakeholder would like a filter - varchar)
Would I index each field individually? Or, should I index all fields in a oner? Per the documentation, I think I can index up to 32 fields at once. But would that be advisable here given my primary goal of select query speed over everything else?
The table will feed into dashboard that reads the table and dynamically translates filter inputs into where clauses. Each time the user adjusts a filter, the table will be read and grouped and aggregated based on the filter / where clause inputs.
Example query:
select
event_action,
count(distinct user_id) as users
from website_data.ecom_funnel
where date >= $input_start_date
and date <= $input_end_date
and device_category in ($mobile, $desktop, $tablet)
and country in ($list of all countries minus any not selected)
and source in ($list of all sources minus any not selected)
group by 1 order by users desc
This will result in a funnel shaped table of data.
I cannot aggregate before hand because the primary metric of concern is users, not sessions. These must be de-duped from the underlying table. Classic example... Suppose a person visits a website once a day for a week. Then the sum of unique visitors for that week is 1, however if I summed visitors by day I would get 7. Similar with my table, some users take multiple sessions to complete the funnel. So, this is why I cannot pre aggregate the table, since I need to apply filters to the underlying data and then count(distinct user id).
Here's explain on a subset of fields if it is useful:
QUERY PLAN
Sort (cost=862194.66..862194.68 rows=9 width=24)
Sort Key: (count(DISTINCT client_id)) DESC
-> GroupAggregate (cost=847955.01..862194.51 rows=9 width=24)
Group Key: event_action
-> Sort (cost=847955.01..852701.48 rows=1898589 width=37)
Sort Key: event_action
-> Seq Scan on ecom_funnel (cost=0.00..589150.14 rows=1898589 width=37)
Filter: ((device_category = ANY ('{mobile,desktop}'::text[])) AND (source = 'google'::text))
My overarching, specific question is, given my use case, should I index each field individually or should I create one single index? Does it matter?
On top of that, any tips for optimising this materialized view to run a select query faster would be appreciated.
Looking at your filter conditions, you should check the cardinality of device_category field by posting
select device_category, count(*) from website_data.ecom_funnel group by device_category
and looking at the values to determine if an index should firstly include this column. Possible index here (without knowing the cardinality) would be multicolumn and include:
(device_category, date)
Saying that, there's no benefit from creating indexes on each separate column as your query wouldn't use them all, so it does matter. You would slow down other CRUD operations that aren't Read operation.
Creating an index on all columns won't probably speed it up too much for you as well, but that's based on the data lying under the hood (in the table) and how your filters compare to the overall query without them (cardinality of values in columns being filtered). This would most likely create a huge overhead of going through the index tree and then obtaining rowids to return the data you need.
Summing up, I would try to narrow the index down to the columns that matter most in your filtering which means they cut most of the data being retrieved. If your query is meant to return majority of rows from the table then there's a need to aggregate, unfortunately, as this wouldn't speed things up.
Hope it helps.
Edit: I've just read that you already posted count of distinct values among your table. I'm not sure what Funnel Step is bound to in your table, but assuming it's a column named event_action, it might be beneficial to instead create an index that would help in grouping as well by doing:
(date, event_action)
It seems like you have omitted the GROUP BY clause at all, which should be included and it should be grouping by event_action, since that's what your select part is doing.
If you narrow the date down to several days/months every time you perform a select query, it might be a huge benefit to create index with first date column.
Remember, that position of column in an index matters.
If you look for values from several months let's say, you should preaggregate and store precalculated values from each month in another table and then UNION ALL that data to the current query which would only select data from current (still being updated) time.

ST_contains taking too much time

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.
As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);

Postgres 9.4 which type of index would be ideal for a float column

I was in mysql and now have joined postgres and I have a table that is getting up to 300,000 new records a day but also has many reads. I have 2 columns that I think would be ideal for indexes: latitudes and longitudes and I Know that postgres has different types of indexes and my question is which type of index would be best for a table that has many writes and reads? This is the query for the reads
SELECT p.fullname,s.post,to_char(s.created_on, 'MON DD,YYYY'),last_reply,s.id,
r.my_id,s.comments,s.city,s.state,p.reputation,s.profile_id
FROM profiles as p INNER JOIN streams as s ON (s.profile_id=p.id) Left JOIN
reputation as r ON (r.stream_id=s.id and r.my_id=?) where s.latitudes >=?
AND ?>= s.latitudes AND s.longitudes>=? AND ?>=s.longitudes order by
s.last_reply desc limit ?"
As you can see the 2 columns in the where clause are latitudes and longitudes
PostgreSQL has the point data type with many operators that have good support from the gist index. So if at all possible change your table definition to use a point rather than 2 floats.
Inserting point data is very easy, just use point(longitudes, latitudes) for the column, instead of putting the two values in separate columns. Same with getting data out: lnglat[0] is the longitude and lnglat[1] is the latitude.
The index would be something like this:
CREATE INDEX idx_mytable_lnglat ON streams USING gist (lnglat pointops);
There is also the box data type, which would be great for grouping all your parameters and finding a point in a box is highly optimized in the gist index.
With a point in the table and a box to search on, your query reduces to this:
SELECT p.fullname, s.post, to_char(s.created_on, 'MON DD,YYYY'), last_reply, s.id,
r.my_id, s.comments, s.city, s.state, p.reputation, s.profile_id
FROM profiles AS p
JOIN streams AS s ON (s.profile_id = p.id)
LEFT JOIN reputation AS r ON r.stream_id = s.id AND r.my_id = ?
WHERE s.lnglat && box(?, ?, ?, ?)
ORDER BY s.last_reply DESC
LIMIT ?;
The phrase s.lnglat && box(?, ?, ?, ?) means "the value of column lnglat overlaps with (meaning: is inside) the box".
If the latitude or longitude columns are sorted, you would probably want to use a B-tree index.
From the Postgres documentation page on indices:
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of the [greater than / lesser than-type operators]
You can read more about indices here.
Edit: Some of the G* indices look like they might be of use if you need to index on both latitude and longitude, since they appear to allow multi-dimensional (e.g. 2d) indexing.
Edit2: In order to actually create the index, you'd want to do something along the lines of (although you may need to change the table name to suite your needs):
CREATE INDEX idx_lat ON s(latitudes);
Take note that B-tree indices are default so you don't need to specify the type.
Read more about index creation here.

number of points within a radius of another set of points

I have two tables. One is a list of stores (with lat/long). The other is a list of customer addresses (with lat/long). What I want is a query that will return the number of customers within a certain radius for each store in my table. This gives me the total number of customers within 10,000 meters of ANY store, but I'm not sure how to loop it to return one row for each store with a count.
Note that I'm doing this queries using cartoDB, where the_geom is basically long/lat.
SELECT COUNT(*) as customer_count FROM customer_table
WHERE EXISTS(
SELECT 1 FROM store_table
WHERE ST_Distance_Sphere(store_table.the_geom, customer_table.the_geom) < 10000
)
This results in a single row :
customer_count
4009
Suggestions on how to make this work against my problem? I'm open to doing this other ways that might be more efficient (faster).
For reference, the column with store names, which would be in one column is store_identifier.store_table
I'll assume that you use the_geom to represent the coordinate (lat/lon) of store and customer. I will also assume that the_geom is of geography type. Your query will be something like this
select s.id, count(*) as customer_count
from customers c
inner join stores s
on st_dwithin(c.the_geom, s.the_geom, 10000)
group by s.id
This should give you neat table with a store id and count of customers within 10,000 meters from the store.
If the_geom is of type geometry, you query will be very similar but you should use st_distance_sphere() instead in order to express distance in kilometers (not degrees).

How to optimize a table for queries sorted by insertion order in Postgres

I have a table of time series data where for almost all queries, I wish to select data ordered by collection time. I do have a timestamp column, but I do not want to use actual Timestamps for this, because if two entries have the same timestamp it is crucial that I be able to sort them in the order they were collected, which is information I have at Insert time.
My current schema just has a timestamp column. How would I alter my schema to make sure I can sort based on collection/insertion time, and make sure querying in collection/insertion order is efficient?
Add column based on sequence (i.e. serial), and create index on (timestamp_column, serial_column). Then you can have insertion order (more or less) by doing:
ORDER BY timestamp_column, serial_column;
You could use a SERIAL column called insert_order. This way there will be no two rows with the same value. However, I am not sure that you requirement of being in absolute time order is possible to achieve.
For example suppose there are two transactions, T1 and T2 and they do happen at the same time, and you are running on a machine with multiple processor, so in fact both T1 and T2 did the insert at exactly the same instant. Is this a case that you are concerned about? There was not enough info your question to know exactly.
Also with a serial column you have the issue of gaps, for example T1 cloud grab serial value 14 and T2 can grab value 15, then T1 rolls back and T2 does not, so you have to expect that the insert_order column might have gaps in it.