Selecting the proper db index - postgresql

I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...

Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.

Related

How does an index's fill factor relate to a query plan?

When a PostgreSQL query's execution plan is generated, how does an index's fill factor affect whether the index gets used in favor of a sequential scan?
A fellow dev and I were reviewing the performance of a PostgreSQL (12.4) query with a windowed function of row_number() OVER (PARTITION BY x, y, z) and seeing if we could speed it up with an index on said fields. We found that during the course of the query the index would get used if we created it with a fill factor >= 80 but not at 75. This was a surprise to us as we did not expect the fill factor to be considered in creating the query plan.
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used. What causes this behavior and should we consider it when selecting an index's fill factor on a table that will have frequent inserts and deletes and be periodically vacuumed?
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used.
So, it is not the fill factor, but rather the size of the index (which is influenced by the fill factor). This agrees with my memory that index size is a (fairly weak) influence on the cost estimate. That influence is almost zero if you are reading only one tuple, but larger if you area reading many tuples.
If the cost estimates of the plan are close to each other, then small differences such as this will be enough to drive one over the other. But that doesn't mean you should worry about them. If one plan is clearly superior to the other, then you should think about why the estimates are so close together to start with when the realities are not close together.

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James
That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);

Designed PostGIS Database...Points table and polygon tables...How to make more efficient?

This is a conceptual question, but I should have asked it long ago on this forum.
I have a PostGIS database, and I have many tables in it. I have researched some on the use of keys in databases, but I'm not sure how to incorporate keys in the case of the point data that is dynamic and increases with time.
I'm storing point data in one table, and this data grows each day. It's about 10 million rows right now and will probably grow about 10 million rows each year or so. There are lat, lon, time, and the_geom columns.
I have several other tables, each representing different polygon groups (converted shapefiles to tables with shp2pgsql), like counties, states, etc.
I'm writing queries that relate the point data to the spatial tables to see if points are inside of the polygons, resulting in things like "55 points in X polygon in the past 24 hours", etc.
The problem is, I don't have a key that relates the point table to the other tables. I think this is probably inhibiting query efficiency, but I'm not sure.
I know this question is fairly vague, and I'm happy to clarify anything, but I basically have a bunch of points in a table that I'm spatially comparing to other tables, and I'm trying to find the best way to design things.
Thanks for any help!
If you don't have already, you should build a spatial index on both the point and polygons table.
Anyway, spatial comparisons are usually slower than numerical comparison.
So adding one or more keys to the point table referencing the other tables, and using them on your select queries instead of spatial operations, will surely speed up.
Obviously, inserts will be slower, but, given the numbers you gave (10millions per year), it should not be a problem.
Probably, just adding a foreign key to the smallest entities (cities for example) and joining the others to get results (countries, states...) will be faster than spatial comparison.
Foreign keys (and other constraints) are not needed to query. Moreover they arise as a consequence of whatever design arises appropriate to the application per priciples of good design.
They just tell the DBMS that a list of values under a list of columns in a table also appear elsewhere as a list of values under a list of columns in some table. (For avoiding errors and improving optimization.)
You still would want indices on columns that will be involved in joins. Eg you might want X coordinates in two tables to hav sorted indices, in the same order. This is independent of whether one column's values form a subset of another's, ie whether a foreign key constraint holds between them.

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!