Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day - postgresql

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?

Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.

Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);

Related

Randomly sampling n rows in impala using random() or tablesample system()

I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH

Time Series Predictions in PostgreSQL

I am new to PostgreSQL and database systems, and I am currently trying to create a database to store observed values as well as all predictions made in the past for some time series.
I have already built a table (actually a view) for observed values, with rows looking basically like:
(time, object, value)
Now I want to store predictions, which means for each time, what has been predicted by some software for the following next N time steps, N being variable since the software has different prediction types.
I have thought about multiple solutions, which are the following:
Store each prediction as a row, using max(N)=240 columns i.e (time, object, value 1, value 2, ..., value 240).
Store each prediction as a row, with the prediction values as a binary JSON, i.e (time, object, JSONB prediction).
Store each prediction value as a row, with a column specifying the delay of the prediction in hours, i.e
(time, object, delay, value).
I don't know how each of these choices would affect performance when I will retrieve and compute summary values on the predictions. A typical thing I would like to do is to retrieve the performance of the prediction for some delay, i.e. how big is the prediction error when we predict x days ahead, and I need this query to be executed pretty fast, to display it in a dashboard.
Which choice do you think is the best? Or do you have any other idea?
Thanks a lot!
Without further information about the access patterns for the collected data i would strongly recommend to use jsonb.
Using one column per timestep will result in bloat of the system catalog and statistics.
If you need to filter on the values of the predictions, you don't want to maintain 240 indexes also.
If you don't need to use these values within a WHERE condition you may use json instead of jsonb.

Create view or table

I have a table that will have about 3 * 10 ^ 12 lines (3 trillion), but with only 3 attributes.
In this table you will have the IDs of 2 individuals and the similarity between them (it is a number between 0 and 1 that I multiplied by 100 and put as a smallint to decrease the space).
It turns out that I need to perform, for a certain individual that I want to do the research, the summarization of these columns and returning how many individuals have up to 10% similarity, 20%, 30%. These values ​​are fixed (every 10) until identical individuals (100%).
However, as you may know, the query will be very slow, so I thought about:
Create a new table to save summarized values
Create a VIEW to save these values.
As individuals are about 1.7 million, the search would not be so time consuming (if indexed, returns quite fast). So, what can I do?
I would like to point out that my population will be almost fixed (after the DB is fully populated, it is expected that almost no increase will be made).
A view won't help, but a materialized view sounds like it would fit the bill, if you can afford a sequential scan of the large table whenever the materialized view gets updated.
It should probably contain a row per user with a count for each percentile range.
Alternatively, you could store the aggregated data in an independent table that is updated by a trigger on the large table whenever something changes there.

Selecting the proper db index

I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...
Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!