I have a PostgreSQL database where physical activities store a certain energy decimal value, e.g.
ACTIVITY ENERGY
-----------------
Activity1 0.7
Activity2 1.3
Activity3 4.5
I have a Classification system that classifies each Energy value as
Light: 0 - 2.9
Moderate: 3.0 - 5.9
Vigorous: >= 6.0
The Classification and Energy Values are subject to change. I need a way to quickly get the Type of each activity. But how to store these in a way which is easy to retrieve?
One solution, is define MIN/MAX lookups of the Type "Classification" -- and pull up all available classifications; then do a CASE/WHEN to go through each one.
LOOKUP_ID LOOKUP_NAME LOOKUP_VALUE LOOKUP_TYPE
---------------------------------------------------------
1 LIGHT_MIN 0 CLASSIFICATION
2 LIGHT_MAX 2.9 CLASSIFICATION
3 MODERATE_MIN 3 CLASSIFICATION
4 MODERATE_MAX 5.9 CLASSIFICATION
5 VIGOROUS_MIN 6 CLASSIFICATION
6 VIGOROUS_MAX null CLASSIFICATION
But this doesn't look very easy to me -- if a developer needs to get the current Classiication they'll have to step through different cases and compare them.
Is there a better strategy to capture these ranges, or is this the right one?
Use a range type
create table classification
(
description text,
energy numrange
);
insert into classification
(description, energy)
values
('Light', numrange(0,3.0,'[)')),
('Moderate', numrange(3.0, 6.0, '[)')),
('Vigorous', numrange(6.0, null, '[)'));
Then you can join those two tables using the <# operator:
select *
from activity a
join classification c on a.energy <# c.energy
The nice thing about the range type is that you can prevent inserting overlapping ranges by using an exclusion constraint
alter table classification
add constraint check_range_overlap
exclude using gist (energy with &&);
Given the above sample data, the following insert would be rejected:
insert into classification
(description, energy)
values
('Strenuous', numrange(8.0, 11.0, '[)'));
I don't think this is a great solution, but it seems preferable to the model above.
Create a table with the ranges and classifications:
create table classification (
energy_min numeric,
energy_max numeric,
classification text
);
Then do a join on that table as follows:
a.activity, a.energy, c.classification
from
activities a
left join classification c on
a.energy >= c.energy_min and
(a.energy <= c.energy_max or c.energy_max is null);
If the possible classifications is relatively small, this should work well enough. I don't think it's efficient on the back end, as it's likely doing a cross-join on the classification table. That said, if it's three (or even ten) records, it's not that big of a deal.
It should scale very well and enable you to modify values on the fly and get results quickly.
If you really want to get fancy, you can also include effectivity from and thru dates on the "classification" table that enable you to change the classifications but also retain historical classifications for older records.
Related
I'm trying to apply a rank that could be based in 3 other columns.
I've tried to use the formula
{FIXED Column1,Column2 : RANK(MIN(Column3),'asc') }
but I got the error message level of detail expressions cannot contain table calculations or the attr function in Tableau
what I wanna do is to have the rank based on the Column1 and Column2 columns and ranking by the dates (Column3)
here is an example of the data (hope it helps)
You can't call a table calc function from within a LOD calculation.
LOD calcs execute at the data source relatively early in the order of operations. Table calcs execute much later on the client side, operating on the summary query results returned by the data source.
So essentially, table calcs can see the results of LOD calcs and take them as input, but not the other way around.
Tables calcs operate on multiple rows in a summary table at a time, and so can compute values that look across whole sections of that table, such as ranks, running sums, percents etc. Table calcs are the only calculations native to Tableau that take the order of rows in a table into account. Read the help material on table calcs to learn about partitioning and addressing - essential concepts for using table calcs.
All you have to do is create a quick table calc for Rank(column3) is to right click on the measure, select edit table calculation. --> select specific dimensions & select both column 1 & 2 --> Restart at every column 2.
I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH
I am reading through Postgres' query optimizer's statistical estimator's code to understand how it works.
For reference, Postgres' query optimizer's statistical estimator estimates the size of the output of an operation (e.g. join, select) in a Postgres plan tree. This allows Postgres to choose between the different ways a query can be executed.
Postgres' statistical estimator uses cached statistics about the contents of each a relation's columns to help estimate output size. The two key saved data structures seem to be:
A most common value (MCV) list: a list of each of the most common values stored in that column and the frequency that they appear in the column.
A histogram of the data stored in that column.
For example, given the table:
X Y
1 A
1 B
1 C
2 A
2 D
3 B
The most common values list for Y would contain {1:0.5, 2:0.333}.
However, when Postgres completes the first join in a multi join operation like in the example below:
SELECT *
FROM A, B, C, D
WHERE A.ID = B.ID AND B.ID2 = C.ID2 AND C.ID3 = D.ID3
the resulting table does not have an MCV (or histogram) (since we've just created the table and we haven't ANALYZEd it! This will make it harder to estimate the output size/cost of the remaining joins.
Does Postgres automatically generate/estimate the MCV (and histogram) for this table to help statistical estimation? If it does, how does it create this MCV?
For reference, here's what I've looked at so far:
The documentation giving a high level overview of how Postgres statistical planner works:
https://www.postgresql.org/docs/12/planner-stats-details.html
The code which carries out the majority of POSTGRES's statistical estimation:
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/selfuncs.c
The code which generates a relation's MCV:
https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c
Generic logic for clause selectivities:
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c
A pointer to the right code file to look at would be much appreciated! Many thanks for your time. :)
The result of a join is called a join relation in PostgreSQL jargon, but that does not mean that it is a “materialized” table that is somehow comparable to a regular PostgreSQL table (which is called a base relation).
In particular, since the join relation does not physically exist, it cannot be ANALYZEd to collect statistics. Rather, the row count is estimated based on the size of the joined relations and the selectivity of the join conditions. This selectivity is a number between 0 (the condition excludes all rows) and 1 (the condition does not filter out anything).
The relevant code is in calc_joinrel_size_estimate in src/backend/optimizer/path/costsize.c, which you are invited to study.
The key points are:
Join conditions that correspond to foreign keys are considered specially:
If all columns in a foreign key are join conditions, then we know that the result of such a join must be as big as the referenced table, so the selectivity is 1 / referenced table size.
Other join conditions are estimated separately by guessing what percentage of rows will be eliminated by that condition.
In the case of an left (or right) outer join, we know that the result size must be at least as big as the left (or right) side.
Finally, the size of the cartesian join (the product of the relation sizes) is multiplied with all selectivities calculated above.
Note that this treats all conditions as independent, which causes bad estimates if the conditions are correlated. But since PostgreSQL doesn't have cross-table statistics, it cannot do better.
I have one data table with various identifiers in 3 columns (Called BU, Company, and Group). I created a cross table that sums the face by 2 layers – an identifier (‘Actual’ and ‘Plan’) and a reporting period (‘9/30/16’ and '9/30/17'). The table was easy, aside from the variance section. I am currently using the formula to compute the variance
SN(Sum([Face]) - Sum([Face]) OVER (ParallelPeriod([Axis.Columns])),
Sum([Face])) AS [PlanVariance]
Unfortunately, this gives me the correct values in the Plan Variance section of the cross table, for the plan identifier. However, it provides the wrong values in the actual identifier. (The actual identifier under plan variance is equal to the actual identifier under the Sum (Face) section. If I remove the SN function, the Plan Variance is empty for all identifiers that have no face for a group AND is empty for the actual section under Plan Variance.
Is there a way to create a cross table that would show the variance for the Plan Identifier ONLY? Can I stop the cross table from calculating the plan variance on the actual segment? Or is there a way to have the actual field hidden in the plan variance section of the final visualization?
Thanks for any help/advice you can provide!
after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);