Designed PostGIS Database...Points table and polygon tables...How to make more efficient? - postgresql

This is a conceptual question, but I should have asked it long ago on this forum.
I have a PostGIS database, and I have many tables in it. I have researched some on the use of keys in databases, but I'm not sure how to incorporate keys in the case of the point data that is dynamic and increases with time.
I'm storing point data in one table, and this data grows each day. It's about 10 million rows right now and will probably grow about 10 million rows each year or so. There are lat, lon, time, and the_geom columns.
I have several other tables, each representing different polygon groups (converted shapefiles to tables with shp2pgsql), like counties, states, etc.
I'm writing queries that relate the point data to the spatial tables to see if points are inside of the polygons, resulting in things like "55 points in X polygon in the past 24 hours", etc.
The problem is, I don't have a key that relates the point table to the other tables. I think this is probably inhibiting query efficiency, but I'm not sure.
I know this question is fairly vague, and I'm happy to clarify anything, but I basically have a bunch of points in a table that I'm spatially comparing to other tables, and I'm trying to find the best way to design things.
Thanks for any help!

If you don't have already, you should build a spatial index on both the point and polygons table.
Anyway, spatial comparisons are usually slower than numerical comparison.
So adding one or more keys to the point table referencing the other tables, and using them on your select queries instead of spatial operations, will surely speed up.
Obviously, inserts will be slower, but, given the numbers you gave (10millions per year), it should not be a problem.
Probably, just adding a foreign key to the smallest entities (cities for example) and joining the others to get results (countries, states...) will be faster than spatial comparison.

Foreign keys (and other constraints) are not needed to query. Moreover they arise as a consequence of whatever design arises appropriate to the application per priciples of good design.
They just tell the DBMS that a list of values under a list of columns in a table also appear elsewhere as a list of values under a list of columns in some table. (For avoiding errors and improving optimization.)
You still would want indices on columns that will be involved in joins. Eg you might want X coordinates in two tables to hav sorted indices, in the same order. This is independent of whether one column's values form a subset of another's, ie whether a foreign key constraint holds between them.

Related

Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);

Intersecting two tables with one common row elements in matlab

I have two different tables (.csv files) as:
I need to merge these two tables in MATLAB, while intersecting first columns of both the tables. I want to make a new separate table with six number of columns(combined columns of both the tables) and number of rows will be equal to the number of intersecting elements of first column of both the tables.
How should I do the intersection and merging of these two tables?
I'm proposing an answer. I'm not claiming it is the best answer. In fact, I think there are probably much faster ones! Also note that I do not have MATLAB in front of me right now and can't test this. There might be some mistakes.
First of all, read the .csv files into memory. In table 1, convert the first column into numeric data (currently, it looks like they are strings). After this step, you want to have two double matricies I'll call table1 (which is 3296x5) and table2 (which is 3184x3).
Second, (this is where it gets mildly interesting, step 1 was the boring stuff), is to find all IDs that are common to both tables. This can be done by calling commonIDs=unique([table1(:,1) ; table2(:,1)]).
Third, get the indicies of the common rows for table1. Then repeat for table2. This is done using the ismember function as follows:
goodEntries1=ismember(table1(:,1),commonIDs);
goodEntries2=ismember(table2(:,1),commonIDs);
Lastly, we extract data and combine to get a result. Note that I only include the ID column once:
result=[table1(goodEntries1,:) table2(goodEntries2,2:end)];
You will need to test this to make sure it is robust. I think that this will keep the right rows together, but depending on how ismember works, you might end up combining rows out of order (for instance, table1's ID=5 with table2's ID=6).

Selecting the proper db index

I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...
Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

Spatial index slowing down query

Background
I have a table that contains POLYGONS/MULTIPOLYGONS which represent customer territories:
The table contains roughly 8,000 rows
Approximately 90% of the polygons are circles
The remainder of the polygons represent one or more states, provinces, or other geographic regions. The raw polygon data for these shapes was imported from US census data.
The table has a spatial index and a clustered index on the primary key. No changes to the default SQL Server 2008 R2 settings were made. 16 cells per object, all levels medium.
Here's a simplified query that will reproduce the issue that I'm experiencing:
DECLARE #point GEOGRAPHY = GEOGRAPHY::STGeomFromText('POINT (-76.992188 39.639538)', 4326)
SELECT terr_offc_id
FROM tbl_office_territories
WHERE terr_territory.STIntersects(#point) = 1
What seems like a simple, straightforward query takes 12 or 13 seconds to execute, and has what seems like a very complex execution plan for such a simple query.
In my research, several sources have suggested adding an index hint to the query, to ensure that the query optimizer is properly using the spatial index. Adding WITH(INDEX(idx_terr_territory)) has no effect, and it's clear from the execution plan that it is referencing my index regardless of the hint.
Reducing polygons
It seemed possible that the territory polygons imported from the US Census data are unnecessarily complex, so I created a second column, and tested reduced polygons (w/ Reduce() method) with varying degrees of tolerance. Running the same query as above against the new column produced the following results:
No reduction: 12649ms
Reduced by 10: 7194ms
Reduced by 20: 6077ms
Reduced by 30: 4793ms
Reduced by 40: 4397ms
Reduced by 50: 4290ms
Clearly headed in the right direction, but dropping precision seems like an inelegant solution. Isn't this what indexes are supposed to be for? And the execution plan still seems strangly complex for such a basic query.
Spatial Index
Out of curiosity, I removed the spatial index, and was stunned by the results:
Queries were faster WITHOUT an index (sub 3 sec w/ no reduction, sub 1 sec with reduction tolerance >= 30)
The execution plan looked far, far simpler:
My questions
Why is my spatial index slowing things down?
Is reducing my polygon complexity really necessary in order to speed up my query? Dropping precision could cause problems down the road, and doesn't seem like it will scale very well.
Other Notes
SQL Server 2008 R2 Service Pack 1 has been applied
Further research suggested running the query inside a stored procedure. Tried this and nothing appeared to change.
My first thoughts are to check the bounding coordinates of the index; see if they cover the entirety of your geometries. Second, spatial indexes left at the default 16MMMM, in my experience, perform very poorly. I'm not sure why that is the default. I have written something about the spatial index tuning on this answer.
First make sure the index covers all of the geometries. Then try reducing cells per object to 8. If neither of those two things offer any improvement, it might be worth your time to run the spatial index tuning proc in the answer I linked above.
Final thought is that state boundaries have so many vertices and having many state boundary polygons that you are testing for intersection with, it very well could take that long without reducing them.
Oh, and since it has been two years, starting in SQL Server 2012, there is now a GEOMETRY_AUTO_GRID tessellation that does the index tuning for you and does a great job most of the time.
This might just be fue to the simpler execution plan being executed in parallel, whereas the other one is not. However, there is a warning on the first execution plan that might be worth investigating.