TimescaleDB: Understanding the return values after creating hypertable and the creation of chunks after populating the hypertable - postgresql

I have an existing table in my database named price (has 264 rows) and I converted it into a hypertable price_hypertable doing:
CREATE TABLE price_hypertable (LIKE price INCLUDING DEFAULTS INCLUDING CONSTRAINTS EXCLUDING INDEXES);
SELECT create_hypertable('price_hypertable', 'start');
and the output it gave me is as follows:
create_hypertable
-------------------------------
(4,public,price_hypertable,t)
(1 row)
The next thing I did was to populate the price_hypertable as follows:
insert into price_hypertable select * from price;
And I got the following output:
INSERT 0 264
Now, I wanted to check the chunks created, for which I did:
select public.show_chunks('price_hypertable');
and the output I got:
show_chunks
----------------------------------------
_timescaledb_internal._hyper_4_3_chunk
_timescaledb_internal._hyper_4_4_chunk
(2 rows)
When I do:
select * from _timescaledb_internal._hyper_4_3_chunk;
select * from _timescaledb_internal._hyper_4_4_chunk ;
I see that the 264 entries are split as follows:
_timescaledb_internal._hyper_4_3_chunk has 98 rows
_timescaledb_internal._hyper_4_4_chunk has 166 rows
I have a few questions about these steps and their outputs:
Can someone please explain to me what do the values 4 and t represent, when I did
SELECT create_hypertable('price_hypertable', 'start');?
After populating the price_hypertable, the data was automatically split into chunks, but of different size. Why does this happen? Why wasn't the data just split in half (132 rows in each chunk instead of 98 and 166)?
Any help is appreciated. Thanks

For the first question, it is easier to see what they represent by executing create_hypertable as
SELECT * FROM create_hypertable('price_hypertable', 'start');
This gives something like:
hypertable_id | schema_name | table_name | created
---------------+-------------+--------------------+---------
4 | public | price_hypertable | t
For the second question, TmTron already answered. This is because the rows are sorted into buckets based on the time, and they are not necessarily evenly spaced. There is no automation that pick the correct interval for each bucket.
You can find information about the return values in the API documentation on create_hypertable which also discuss the parameter chunk_time_interval that can be used to set the chunk size.

related to your 2nd question:
When you don't specify the chunk_time_interval explicitly, the default is 7 days: see create-hypertable, Best Practices.
So the number of rows in each chunks depends on the distribution of your data (according to your start date-time column).

Related

How to find the difference between two table sizes in postgres using shell script

I have one table named 'Table_size_details' in my database which stores the size of the tables in my Postgres database, every Friday.
Date Table_Name Table_size Growth_Difference Growth_Percentage
---- ----------- ---------- ----------------- -----------------
20-08-2021 Demo 1.2 GB
13-08-2021 Demo 578 MB
I have got a task to add two more columns named 'Growth_Difference' and 'Growth_Percentage'. In 'Growth_Difference' column I need to find the difference between current table size(1.3 GB) and previous week table size(578 MB) and display it in MB format. Also I need to find the growth percentage of both- the current table size and previous week's table size.
I have asked to develop using SHELL SCRIPT.
Table_size_old=`psql -d abc -At -c "SELECT Table_size_details from abc order by Date desc limit 1;"`
Table_size_new=`psql -At -c "SELECT pg_size_pretty(pg_total_relation_size('Table_size_details'));`
growth_table=`expr $Table_size_new - $Table_size_old;`
Above logic I have used to find the difference between new and old table size but I'm getting expr: syntax error on growth_table variable line. I believe its because I trying to find the difference between 1.2 GB and 578 MB.
I'm new to shell scripting, could anyone help me to find a solution?
Appreciate your help in advance.
There is no need to even attempt this is a shell script, further since both growth columns are computed values there is no need to store them. This can be done in a single query, or perhaps even better a single query that populates a view. With that getting what you are looking for is a simple Select from that view.
First off however do not store your size as a string with number and unit size code. Store instead a single numeric value in a constant unit size, then convert all values to that constant unit size. For example select GB as the constant unit, then 587MB would be stored as .587, this way there is no unit conversion needed. With that done (or added) create a view as follows:
create or replace view table_size_growth as
select table_name
, run_date
, size_in_gb
, Round( (size_in_gb - gb_last_week)::numeric,6) growth_in_gb
, case when gb_last_week < 0.0000001 -- set to desired precision
then null::double precision
else round((100 * (size_in_gb - gb_last_week)/abs(gb_last_week))::numeric,6)
end growth_in_pct
from (select ts.*, lag(ts.size_in_gb) over( partition by ts.table_name
order by ts.run_date) gb_last_week
from table_size_details ts
) s
order by table_name, run_date;
Your script (or anywhere else) now needs the single query: select * from table_size_growth Note: this provides every week for every table you are capturing. Use where clause as needed. See example here.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Postgres Crosstab Dynamic Number of Columns

In Postgres 9.4, I have a table like this:
id extra_col days value
-- --------- --- -----
1 rev 0 4
1 rev 30 5
2 cost 60 6
i want this pivoted result
id extra_col 0 30 60
-- --------- -- -- --
1 rev 4 5
2 cost 6
this is simple enough with a crosstab.
but i want the following specifications:
day column will be dynamic. sometimes increments of 1,2,3 (days), 0,30,60 days (accounting months), and sometimes in 360, 720 (accounting years).
range of days will be dynamic. (e.g., 0..500 days versus 1..10 days).
the first two columns are static (id and extra_col)
The return type for all the dynamic columns will remain the same type (in this example, integer)
Here are the solutions I've explored, none of which work for me for the following reasons:
Automatically creating pivot table column names in PostgreSQL -
requires two trips to the database.
Using crosstab_hash - is not dynamic
From all the solutions I've explored, it seems the only one that allows this to occur in one trip to the database requires that the same query be run three times. Is there a way to store the query as a CTE within the crosstab function?
SELECT *
FROM
CROSSTAB(
--QUERY--,
$$--RUN QUERY AGAIN TO GET NUMBER OF COLUMNS--$$
)
as ct (
--RUN QUERY AGAIN AND CREATE STRING OF COLUMNS WITH TYPE--
)
Every solution based on any buildin functionality needs to know a number of output columns. The PostgreSQL planner needs it. There is workaround based on cursors - it is only one way, how to get really dynamic result from Postgres.
The example is relative long and unreadable (the SQL really doesn't support crosstabulation), so I will not to rewrite code from blog here http://okbob.blogspot.cz/2008/08/using-cursors-for-generating-cross.html.

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

T-SQL - CROSS APPLY to a PIVOT? (using pivot with a table-valued function)?

I have a table-valued function, basically a split-type function, that returns up to 4 rows per string of data.
So I run:
select * from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
I get:
Seq Data
1 15
2 25
However, my end data needs to look like
15 25
so I do a pivot.
select [1],[2],[3],[4]
from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
pivot (max(data) for seq in ([1],[2],[3],[4]))
as pivottable
which works as expected:
1 2
--- ---
15 25
HOWEVER, that's great for one row. I now need to do it for several hundred records at once. My thought is to do a CROSS APPLY, but not sure how to combine a CROSS APPLY and a PIVOT.
(yes, obviously the easy answer is to write a modified version that returns 4 columns, but that's not a great option for other reasons)
Any help greatly appreciated.
And the reason I'm doing this: the current query uses as scalar-valued version of SPLIT, called 12 times within the same SELECT against the same million rows (where the data string is 500+ bytes).
So far as I know, that would require it scan the same 500bytes * 1000000rows, 12 times.
This is how you use cross apply. Assume table1 is your table and Line is the field in your table you want to split
SELECT * fROM table1 as a
cross apply dbo.split(a.Line) as b
pivot (max(data) for seq in ([1],[2],[3],[4])) as p