Understanding storage requirements of a PostgreSQL table - postgresql

I have a PostgreSQL table that stores OHLCV data from an exchange. The table has recently become large and I'm trying to reduce its size by understanding how the different table components contribute to the total size.
This is the table's schema:
+-----------+-----------------------------+-------------------------+
| Column | Type | Modifiers |
|-----------+-----------------------------+-------------------------|
| ts | timestamp without time zone | not null default now() |
| millisecs | bigint | not null |
| open | numeric | not null |
| high | numeric | not null |
| low | numeric | not null |
| close | numeric | not null |
| volume | numeric | not null |
| symbol_id | integer | not null |
+-----------+-----------------------------+-------------------------+
The size of the table without indices and toasts is:
SELECT pg_size_pretty(pg_relation_size('ohlcv'));
+----------------+
| pg_size_pretty |
|----------------|
| 6871 MB |
+----------------+
However, when I get the size of each column and add up the results, I get:
SELECT
pg_size_pretty(
sum(pg_column_size(ts)) +
sum(pg_column_size(millisecs)) +
sum(pg_column_size(open)) +
sum(pg_column_size(high)) +
sum(pg_column_size(low)) +
sum(pg_column_size(close)) +
sum(pg_column_size(volume)) +
sum(pg_column_size(symbol_id))
) FROM ohlcv;
+----------------+
| pg_size_pretty |
|----------------|
| 3769 MB |
+----------------+
This is a fairly large difference in size. I realized NUMERIC is a variable-size type if the precision and scale are not specified, so I figured the difference in sizes is due to column padding. I ran the following query to test my theory:
SELECT
pg_size_pretty((
max(pg_column_size(ts)) +
max(pg_column_size(millisecs)) +
max(pg_column_size(open)) +
max(pg_column_size(high)) +
max(pg_column_size(low)) +
max(pg_column_size(close)) +
max(pg_column_size(volume)) +
max(pg_column_size(symbol_id))
) * count(*)) FROM ohlcv;
+----------------+
| pg_size_pretty |
|----------------|
| 5797 MB |
+----------------+
This query gets the maximum number of bytes for each column and then multiplies that by the number of rows.
The result is closer to the table size but still more than 1 GB smaller.
Finally, there's this query that returns a number somewhere between the two:
SELECT pg_size_pretty(sum(pg_column_size(ohlvc.*))) FROM ohlvc;
+----------------+
| pg_size_pretty |
|----------------|
| 6240 MB |
+----------------+
Can someone help me understand in detail the differences between these sizes? I understand there's overhead per-table, but what accounts for the difference between the results of the last 2 queries?
By the way, I have already vacuumed the table with
VACUUM FULL VERBOSE ohlcv;

Related

Any way to find and delete almost similar records with SQL?

I have a table in Postgres DB, that has a lot of almost identical rows. For example:
1. 00Zicky_-_San_Pedro_Danilo_Vigorito_Remix
2. 00Zicky_-_San_Pedro__Danilo_Vigorito_Remix__
3. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx__
4. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx_
5. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
6. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
I think about to writing a little golang script to remove duplicates, but maybe SQL can do it?
Table definition:
\d+ songs
Table "public.songs"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
song_id | integer | | not null | nextval('songs_song_id_seq'::regclass) | plain | | |
song_name | character varying(250) | | not null | | extended | | |
fingerprinted | smallint | | | 0 | plain | | |
file_sha1 | bytea | | | | extended | | |
total_hashes | integer | | not null | 0 | plain | | |
date_created | timestamp without time zone | | not null | now() | plain | | |
date_modified | timestamp without time zone | | not null | now() | plain | | |
Indexes:
"pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap
Tried several methods to find duplicates, but that methods search only for exact similarity.
There is no operator which is essentially A almost = B. (Well there is full text search, but that seems to be a little excessive here.) If the only difference is the number of - and _ then just get rid of them and compare the the resulting difference. If they are equal, then one is a duplicate. You can use the replace() function to remove them. So something like: (see demo)
delete
from songs s2
where exists ( select null
from songs s1
where s1.song_id < s2.song_id
and replace(replace(s1.name, '_',''),'-','') =
replace(replace(s2.name, '_',''),'-','')
);
If your table is large this will not be fast, but a functional index may help:
create index song_name_idx on songs
(replace(replace(name, '_',''),'-',''));

Dump integer-only sql table to binary file?

I have a PostgrqSQL table w/ a bunch of columns that are just different sizes of integer.
Table "public.place2022_final"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
toff | integer | | |
palette_index | bigint | | |
censorship | boolean | | |
row0 | integer | | |
col0 | integer | | |
row1 | integer | | |
col1 | integer | | |
uint_id | bigint | | |
seqno | bigint | | |
I can export it to a CSV, but for my purposes I really want the data to be small. Is there a way I can create a minimal dump to a binary file, w/ a format something like
<8 bytes for # of rows in table><4 bytes for row 1 toff><8 bytes for row 1 palette_index>...<do that for all fields, then repeat for all rows>.
I also know for a fact that all these bigints can be squashed down to 32-bit ints... so doing that "conversion" here would be nice too.

How to get non-aggregated measures?

I calculate my metrics with SQL and publish the resulting table to Tableau Server. Afterward, use this data source to create charts and dashboards.
For one analysis, I already calculated the measures per day with SQL. When I use the resulting table in Tableau, it aggregates these measures to SUM by default. However, I don't want to have SUM or AVG of the average or SUM of the Percentiles.
What I want is the result when I don't select date dimension and not GROUP BY date in SQL as attached below.
Here is the query:
SELECT
-- date,
COUNT(DISTINCT id) AS count_of_id,
AVG(timediff_in_sec) AS avg_timediff,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_50
FROM
(
--subquery
) AS t1
-- GROUP BY date
Here are the first 10 rows of the resulting table:
+------------+--------------+-------------+---------------+---------------+
| date | avg_timediff | count_of_id | percentile_25 | percentile_50 |
+------------+--------------+-------------+---------------+---------------+
| 10/06/2020 | 61,65186364 | 22 | 8,5765 | 13,3015 |
| 11/06/2020 | 127,2913333 | 3 | 15,6045 | 17,494 |
| 12/06/2020 | 306,0348214 | 28 | 12,2565 | 17,629 |
| 13/06/2020 | 13,2664 | 5 | 11,944 | 13,862 |
| 14/06/2020 | 16,728 | 7 | 14,021 | 17,187 |
| 15/06/2020 | 398,6424595 | 37 | 11,893 | 19,271 |
| 16/06/2020 | 293,6925152 | 33 | 12,527 | 17,134 |
| 17/06/2020 | 155,6554286 | 21 | 13,452 | 16,715 |
| 18/06/2020 | 383,8101429 | 7 | 266,048 | 493,722 |
+------------+--------------+-------------+---------------+---------------+
How can I achieve the desired output above?
Drag them all into the dimensions list, then they will be static dimensions. For your use you could also just drag the Date field to Rows. Aggregating 1 value, which you have for each date, returns the same value whatever the aggregation type.

Does size of row or column effect aggregation queries in Postgresql?

Consider the following table definition:
Column | Type | Collation | Nullable | Default
-----------------+--------------------------+-----------+----------+-------------
id | uuid | | not null |
reference_id | text | | |
data | jsonb | | |
tag | character varying(255) | | |
created_at | timestamp with time zone | | |
updated_at | timestamp with time zone | | |
is_active | boolean | | not null | true
status | integer | | | 0
message | text | | |
batch_id | uuid | | not null |
config | jsonb | | |
Overall table size to be over 500M and every row in the table contains a data column to have a JSON of over 50MB.
Questions -
Does the size of the data column effect aggregation such as count?
Assume we are running the below query -
select count(*)
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
Does the size of the data column effect aggregation such as sum?
Assume we are running the below queries -
Query 1 -
select sum(data->>'count'::int)
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
Query 2 -
select sum(jsonb_array_length(data->'some_array'))
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
The best way to know is to measure.
Once the data is large enough to always be TOASTed, then its size will no longer affect the performance of queries which do not need to access the TOASTed data contents, like your first one. Your last two do need to access the contents and their performance will depend on the size.

PostgreSQL two groups segregated but not ordered only by zero price column

I need help with a bit of a crazy single-query goal please that I'm not sure if GROUP BY or sub-SELECT applies to?
The following query:
SELECT id_finish, description, inside_rate, outside_material, id_part, id_metal
FROM parts_finishing AS pf
LEFT JOIN parts_finishing_descriptions AS fd ON (pf.id_description=fd.id);
Returns the results like the following:
+-------------+-------------+------------------+--------------------------------+
| description | inside_rate | outside_material | id_part - id_finish - id_metal |
+-------------+-------------+------------------+--------------------------------+
| Nickle | 0 | 33.44 | 4444-44-44, 5555-55-55 |
+-------------+-------------+------------------+--------------------------------+
| Bend | 11.22 | 0 | 1111-11-11 |
+-------------+-------------+------------------+--------------------------------+
| Pack | 22.33 | 0 | 2222-22-22, 3333-33-33 |
+-------------+-------------+------------------+--------------------------------+
| Zinc | 0 | 44.55 | 6000-66-66 |
+-------------+-------------+------------------+--------------------------------+
I need the results to return in the fashion below but there are catches:
I need to group by either the inside_rate column or the outside_material column but ORDER BY the description column but not ORDER BY or sort them by price (inside_rate and outside_material are the prices). So we know that they belong to a group if inside_rate is 0 or to the other group if outside_material is 0.
I need to ORDER BY the description column desc secondary after they are returned per group.
I need to return a list of parts (composed of three separate columns) for that inside/outside group / price for that finishing.
Stack format fix.
+-------------+-------------+------------------+--------------------------------+
| description | inside_rate | outside_material | id_part - id_finish - id_metal |
+-------------+-------------+------------------+--------------------------------+
| Bend | 11.22 | 0 | 1111-11-11 |
+-------------+-------------+------------------+--------------------------------+
| Pack | 22.33 | 0 | 2222-22-22, 3333-33-33 |
+-------------+-------------+------------------+--------------------------------+
| Nickle | 0 | 33.44 | 4444-44-44, 5555-55-55 |
+-------------+-------------+------------------+--------------------------------+
| Zinc | 0 | 44.55 | 6000-66-66 |
+-------------+-------------+------------------+--------------------------------+
The tables I'm working with and their data types:
Table "public.parts_finishing"
Column | Type | Modifiers
------------------+---------+-------------------------------------------------------------
id | bigint | not null default nextval('parts_finishing_id_seq'::regclass)
id_part | bigint |
id_finish | bigint |
id_metal | bigint |
id_description | bigint |
date | date |
inside_hours_k | numeric |
inside_rate | numeric |
outside_material | numeric |
sort | integer |
Indexes:
"parts_finishing_pkey" PRIMARY KEY, btree (id)
Table "public.parts_finishing_descriptions"
Column | Type | Modifiers
------------+---------+------------------------------------------------------------------
id not null | bigint | default nextval('parts_finishing_descriptions_id_seq'::regclass)
date | date |
description | text |
rate_hour | numeric |
type | text |
Indexes:
"parts_finishing_descriptions_pkey" PRIMARY KEY, btree (id)
The second table's first column is just id. (Why are we still dealing with a 1024 static width layout in 2015?)
I'd make an SQL fiddle though it refuses to load for me regardless of the browser.
Not entirely sure I understand your question. Might look like this:
SELECT pd.description, pf.inside_rate, pf.outside_material
, concat_ws(' - ', pf.id_part::text
, pf.id_finish::text
, pf.id_metal::text) AS id_part_finish_metal
FROM parts_finishing pf
LEFT JOIN parts_finishing_descriptions fd ON pf.id_description = fd.id
ORDER BY (pf.inside_rate = 0) -- 1. sorts group "inside_rate" first
, pd.description DESC NULLS LAST -- 2. possible NULL values last
;