How can pg_column_size be smaller than octet_length? - postgresql

I'm looking for getting anticipated table size by referring column type and length size. I'm trying to use pg_column_size for this.
When testing the function, I realized something seems wrong with this function.
The result value from pg_column_size(...) is sometimes even smaller than the return value from octet_length(...) on the same string.
There is nothing but numeric characters in the column.
postgres=# \d+ t5
Table "public.t5"
Column | Type | Modifiers | Storage | Stats target | Description
--------+-------------------+-----------+----------+--------------+-------------
c1 | character varying | | extended | |
Has OIDs: no
postgres=# select pg_column_size(c1), octet_length(c1) as octet from t5;
pg_column_size | octet
----------------+-------
2 | 1
704 | 700
101 | 7000
903 | 77000
(4 rows)
Is this the bug or something? Is there someone with the some formula to calculate anticipated table size from column types and length values of it?

I'd say pg_column_size is reporting the compressed size of TOASTed values, while octet_length is reporting the uncompressed sizes. I haven't verified this by checking the function source or definitions, but it'd make sense, especially as strings of numbers will compress quite well. You're using EXTENDED storage so the values are eligible for TOAST compression. See the TOAST documentation.
As for calculating expected DB size, that's whole new question. As you can see from the following demo, it depends on things like how compressible your strings are.
Here's a demonstration showing how octet_length can be bigger than pg_column_size, demonstrating where TOAST kicks in. First, let's get the results on query output where no TOAST comes into play:
regress=> SELECT octet_length(repeat('1234567890',(2^n)::integer)), pg_column_size(repeat('1234567890',(2^n)::integer)) FROM generate_series(0,12) n;
octet_length | pg_column_size
--------------+----------------
10 | 14
20 | 24
40 | 44
80 | 84
160 | 164
320 | 324
640 | 644
1280 | 1284
2560 | 2564
5120 | 5124
10240 | 10244
20480 | 20484
40960 | 40964
(13 rows)
Now let's store that same query output into a table and get the size of the stored rows:
regress=> CREATE TABLE blah AS SELECT repeat('1234567890',(2^n)::integer) AS data FROM generate_series(0,12) n;
SELECT 13
regress=> SELECT octet_length(data), pg_column_size(data) FROM blah;
octet_length | pg_column_size
--------------+----------------
10 | 11
20 | 21
40 | 41
80 | 81
160 | 164
320 | 324
640 | 644
1280 | 1284
2560 | 51
5120 | 79
10240 | 138
20480 | 254
40960 | 488
(13 rows)

Related

Copy file from CSV into Postgresql table - timestamp problem

I have data in csv format with rows of daily company stock quotes which look like this:
INTSW2027243,20200319,7.7700,7.7800,7.3600,7.3600,2442
INTSW2027391,20200319,7.4200,7.6000,6.8300,6.8900,15262
INTSW2027409,20200319,7.4800,7.5600,7.4200,7.5600,743
INTSW2028365,20200319,0.7100,0.7200,0.5400,0.5500,47495
Atari,20200319,351.0000,365.5000,350.0000,357.0000,9040
The second column of the file is the date: 2020-03-19 in this case.
I use the COPY FROM command to update the postgres companies table.
COPY companies (ticker, date, open, high, low, close, vol) FROM '/home/user/Downloads/company.csv' using delimiters ',' with null as '\null';
Whenever I use the COPY FROM command to copy the file into postgresql table, my date '20200319' changes to 1970-08-22 20:11:59, and I end up with my last record which looks something like that:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+------
2248402 | Atari | 1970-08-22 20:11:59 | 351 | 365.5 | 350 | 357 | 9040
If I manually update the companies table with the following command, I get proper results:
INSERT INTO companies (ticker, date, open, high, low, close, vol) VALUES ('Atari', to_timestamp('20200319', 'YYYYMMDD')::timestamp without time zone ,351.0000,365.5000,350.0000,357.0000,9040);
However the above solution doesn't work if the data is stored in a csv file.
Proper result:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+------
2250513 | Atari | 2020-03-19 00:00:00 | 351 | 365.5 | 350 | 357 | 9040
My Questions:
Is there a way to change the output date format in COPY FROM command?
What is the proper way to update large postgres tables with daily quotes from csv files in bulk by means of sql commands?
My postgres version:
psql (PostgreSQL) 11.7
Edit:
This is Not psql copy question.
Ok, I took advice from madflow's comments and partially from Abelisto, and adjusted my SET datestyle.
Initially I tied:
SET datestyle = 'YYYYMMDD'; (and many more combinations of it)
But was getting the following error:
invalid value for parameter "DateStyle": "YYYYMMDD"
I then moved on to trying: set datestyle to "YMD";
And got: SET
Now when I try: show datestyle;
I get:
DateStyle
-----------
ISO, YMD
(1 row)
And, when I try the following command:
COPY companies (ticker, date, open, high, low, close, vol) FROM '/home/user/Downloads/company.csv' using delimiters ',' with null as '\null';
It looks like I'm finally getting the right date format, so no need to adjust the COPY FROM command:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+-------
1379256 | Atari | 2020-03-16 00:00:00 | 294 | 337.5 | 256 | 337 | 48690
1379257 | Atari | 2020-03-17 00:00:00 | 347 | 381 | 338 | 357 | 36945
1379258 | Atari | 2020-03-18 00:00:00 | 364 | 380 | 350 | 357 | 19650
2251920 | Atari | 2020-03-19 00:00:00 | 351 | 365.5 | 350 | 357 | 9040
So, thanks guys for suggestions!

Size of Postgresql table - increased on table row or column?

So I have a strange problem. Recently I came across a database structure that was something like this:
| salt # 01:00 | salt # 02:00 | salt # 02:00 |
|:-------------|--------------|--------------|
| 0 | 3 | 2 |
Where each datestamp was a separate column. I didn't think such a structure was optimal, so I rearranged it to look something like this:
| creation_time | salt |
|:--------------------|----------|
| 2018-08-15 01:00:00 | 0 |
| 2018-08-15 02:00:00 | 3 |
| 2018-08-15 03:00:00 | 2 |
For some reason, this incredibly increases the size of the table. In one table, I went from having 3269 rows in the previous table to 5142823 rows, which is fine, but the size increased from 1048 kB to 674 MB.
So is this some kind of hack to reduce table size i.e. defining each creation_time as a column instead of including it in the row? Could I do anything to reduce the size?

query count of rows where id is less than a series of values in Redshift

I have a table etl_control which stores latest_id of x_data table everyday. Now I have a requirement to get the number of rows for each day.
My idea is to run a query to get the count based on a condition x_data.id <= etl_control.latest_id for everyday and get the count.
The table structures are as follows.
etl_control:
record_date | latest_id |
---------------------------------
2016-11-01 | 55 |
2016-11-02 | 125 |
2016-11-03 | 154 |
2016-11-04 | 190 |
2016-11-05 | 201 |
2016-11-06 | 225 |
2016-11-07 | 287 |
x_data:
id | value |
---------------------------------
10 | xyz |
11 | xyz |
21 | xyz |
55 | xyz |
101 | xyz |
108 | xyz |
125 | xyz |
142 | xyz |
154 | xyz |
160 | xyz |
166 | xyz |
178 | xyz |
190 | xyz |
191 | xyz |
The end result should have the number of rows in x_data for each day. I tried a number of variations using JOIN, WITH and COUNT(*) OVER. But the biggest hurdle is to iteratively compare x_data.id with etl_control.latest_id.
Really sorry folks. Got the answer myself after posting the question.
The query is really simple.
WITH data AS (
SELECT e.latest_id
FROM x_data AS x, etl_control AS e
WHERE x.id <= e.latest_id)
SELECT latest_id, count(*) FROM data GROUP BY latest_id;
This basically creates a temp table with latest_id repeated for each row. The latest_id is always greater than or equal to the id from x_data.
A simple group by on this temp table would give the expected result.

Amazon Redshift table block allocation

Our cluster is a 4 node cluster. We have a table consisting 72 columns.When we query svv_diskusage table to check the allocation of columns in each slice we observed that every column has been allocated into 2 blocks (0 and 1). But for few columns we have datatype of varchar(1) which should not be occupying two blocks of space.
Is it possible that if one of the columns occupies more than a block(in case of varchar(1500)), then the same is allocated for all the other columns of the table. If yes, how this effects the overall database size of the cluster.
Each Amazon Redshift storage block is 1MB in size. Each block contains data from only one column within one table.
The SVV_DISKUSAGE system view contains a list of these blocks, eg:
select db_id, trim(name) as tablename, col, tbl, max(blocknum)
from svv_diskusage
where name='salesnew'
group by db_id, name, col, tbl
order by db_id, name, col, tbl;
db_id | tablename | col | tbl | max
--------+------------+-----+--------+-----
175857 | salesnew | 0 | 187605 | 154
175857 | salesnew | 1 | 187605 | 154
175857 | salesnew | 2 | 187605 | 154
175857 | salesnew | 3 | 187605 | 154
175857 | salesnew | 4 | 187605 | 154
175857 | salesnew | 5 | 187605 | 79
175857 | salesnew | 6 | 187605 | 79
175857 | salesnew | 7 | 187605 | 302
175857 | salesnew | 8 | 187605 | 302
175857 | salesnew | 9 | 187605 | 302
175857 | salesnew | 10 | 187605 | 3
175857 | salesnew | 11 | 187605 | 2
175857 | salesnew | 12 | 187605 | 296
(13 rows)
The number of blocks required to store each column depends upon the amount of data and the compression encoding used for that table.
Amazon Redshift also stores the minvalue and maxvalue of the data that is stored in each block. This is visible in the SVV_DISKUSAGE table. These values are often called Zone Maps and they are used to identify blocks that can be skipped when scanning data. For example, if a WHERE clause looks for rows with a value of 5 in that column, then blocks with a minvalue of 6 can be entirely skipped. This is especially useful when data compressed.
To investigate why your data is consuming two blocks, examine:
The minvalue and maxvalue of each block
The number of values (num_values) stored in each block
Those values will give you an idea of how much data is stored in each block, and whether that matches your expectations.
Also, take a look at the Distribution Key (DISTKEY) used on the table. If the DISTKEY is set to ALL, then table data is replicated between multiple nodes. This could also explain your block count.
Finally, if data has been deleted from the table, then old values might be consuming disk space. Run a VACUUM command on the table to remove deleted data.
A good reference is: Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?

Scala find missing values in a range

For a given range, for instance
val range = (1 to 5).toArray
val ready = Array(2,4)
the missing values (not ready) are
val missing = range.toSet diff ready.toSet
Set(5, 1, 3)
The real use case includes thousands of range instances with (possibly) thousands of missing or not ready values. Is there a more time-efficient approach in Scala?
The diff operation is implemented in Scala as a foldLeft over the left operand where each element of the right operand is removed from the left collection. Let's assume that the left and right operand have m and n elements, respectively.
Calling toSet on an Array or Range object will return a HashTrieSet, which is a HashSet implementation and, thus, offers a remove operation with complexity of almost O(1). Thus, the overall complexity for the diff operation is O(m).
Considering now a different approach, we'll see that this is actually quite good. One could also solve the problem by sorting both ranges and then traversing them once in a merge-sort fashion to eliminate all elements which occur in both ranges. This will give you a complexity of O(max(m, n) * log(max(m, n))), because you have to sort both ranges.
Update
I ran some experiments to investigate whether you can speed up your computation by using mutable hash sets instead of immutable. The result as shown in the tables below is that it depends on the size ration of range and ready.
It seems as if using immutable hash sets is more efficient if the ready.size/range.size < 0.2. Above this ratio, the mutable hash sets outperform the immutable hash sets.
For my experiments, I set range = (1 to n), with n being the number of elements in range. For ready I selected a random sub set of range with the respective number of elements. I repeated each run 20 times and summed up the times calculated with System.currentTimeMillis().
range.size == 100000
+-----------+-----------+---------+
| Fraction | Immutable | Mutable |
+-----------+-----------+---------+
| 0.01 | 28 | 111 |
| 0.02 | 23 | 124 |
| 0.05 | 39 | 115 |
| 0.1 | 113 | 129 |
| 0.2 | 174 | 140 |
| 0.5 | 472 | 200 |
| 0.75 | 722 | 203 |
| 0.9 | 786 | 202 |
| 1.0 | 743 | 212 |
+-----------+-----------+---------+
range.size == 500000
+-----------+-----------+---------+
| Fraction | Immutable | Mutable |
+-----------+-----------+---------+
| 0.01 | 73 | 717 |
| 0.02 | 140 | 771 |
| 0.05 | 328 | 722 |
| 0.1 | 538 | 706 |
| 0.2 | 1053 | 836 |
| 0.5 | 2543 | 1149 |
| 0.75 | 3539 | 1260 |
| 0.9 | 4171 | 1305 |
| 1.0 | 4403 | 1522 |
+-----------+-----------+---------+