postgres crosstab with unknown number of columns - postgresql

I have a table test with looks like the following
That should be transformed to the following table
I can achieve that with the following crosstab statement
SELECT *
FROM crosstab( 'select category, month, sum from test) AS ct(category text, r202208 float, r202209 float);
However this only works when I know the columns beforehand, but the table contains how much money I spend in each category and each month, so I don't know the months upfront. Is it possible to autogenerate the the columns based on the content in the month column of the original table, maybe by using a postgres function?
Of course I could generate the sql query dynamically as string and execute it with java, node or something else. However I get the data as csv from my financial institute and the original table is just a view, so I would like to create that table without the help of external programming

Related

Pivot function without manually typing values in `for in`?

Documentation provides an example of using the pivot() function.
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN ('prop', 'rudder', 'wing')
);
I would like to use pivot() without having to manually specify each value of partname. I want all parts. I tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname);
That gave an error. Then tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN (select distinct partname from part)
);
That also threw an error.
How can I tell Redshift to include all values of partname in the pivot?
I don't think this can be done in a simple single query. This would mean that the query compiler would need to work without knowing how many output columns will be produced. I don't think it can do that.
You can do this in multiple queries - use a query to create the list of partnames and then use this to "generate" a second query that populates the IN list. So something needs issue these queries and generated the second. This can be some code external to Redshift (lots of options) or a stored procedure in Redshift. This code, no matter where it exists, should understand that Redshift has a max number of columns limit - 1,600.
The Redshift docs are fairly good on the topic of dynamic SQL for stored procedures. The EXECUTE statement will be used to fire off the second query in a stored procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html

Is there way in postgresql how to use dynamic table partition name in select?

I have very large table partitioned by date into quarter chunks for performance reasons.
I would like to write sql query that uses dynamic partition name based on which one is the most recent one.
Is there some way how to write this?
example:
partitions names look like this:
schema.table_2020_q4
schema.table_2020_q3
schema.table_2020_q2
schema.table_2020_q1
schema.table_2019_q4
schema.table_2019_q3
...
today is 2020-07-23 = 2020 Q3, so my sql will look like this:
select * from schema.table_2020_q3
when app will run this query on 2020-10-01, it needs to be:
select * from schema.table_2020_q4
Thanks
Your table is indexed by that date, so make the date a part of your WHEREexpression (in the same format you use for partitioning) and the database should do it automatically.

How do I make a simple day dimension table for data warehousing star schema with postgresql?

How would I go about creating and populating a simple DAY dimension table for a star schema in postgreSQL ?
It is for an intro course to data warehousing and so it only has a few fields but most of the examples online are very involved and seem very complicated for a beginner. This isn't for an assignment - it is for studying because I am trying to make my own simple Star Schema with a fact table so I can start getting comfortable with it.
Can anyone give me a simple example of how I'd create the table with just a few fields (day_key as the surrogate key, a string describing the day, and some integer values representing the days or months for example) so I can at least get started on understanding?
A very simple DAY dimension table that should work for most versions of PostgreSQL (I am using 10.5). This is just something that should help someone newer to Data Warehousing make a basic day dimension for use when just getting started.
Create a Day Table
CREATE TABLE day (
day_key SERIAL PRIMARY KEY, -- SERIAL is an integer that will auto-increment as new rows added
description VARCHAR(40), -- a 'string' for a description
full_date DATE, -- an actual date type
month_number INTEGER,
month_name VARCHAR(40),
year INTEGER
);
Inserting Rows into the Day dimension
INSERT INTO day(description, full_date, month_number, month_name, year)
SELECT
to_char(days.d, 'FMMonth DD, YYYY'),
days.d::DATE,
to_char(days.d, 'MM')::integer,
to_char(days.d, 'FMMonth'),
to_char(days.d, 'YYYY')::integer
from (
SELECT generate_series(
('2019-01-01')::date, -- 'start' date
('2019-12-31')::date, -- 'end' date
interval '1 day' -- one for each day between the start and day
)) as days(d);
Result
Notes:
Basically you are just using the rows generated by the nested SELECT generate_series(... to insert into the Day table.
I used the FM above twice to remove some of the white space padding automatically generated in some of these date formatting.
I'd recommend removing the INSERT INTO day(...) line the first time you do this just to make sure the format of each column is what you're after before inserting it into your table.
This is just what I've seen commonly used - check the PostgreSQL documentation has some more thorough and good examples of more ways to format date types and get all kinds of useful dimensions.

Can redshift stored procs be used to make a date range UNION ALL query

Since redshift does not natively support date partitioning, other than in redshift spectrum, all our tables are date partitioned
my_table_name_YYYY_MM_DD
So every time we do queries it's usually looks like this
select columns, i, want from
(select * from tbl1_date UNION ALL
select * from tbl2_date UNION ALL
select * from tbl3_date UNION ALL
select * from tbl4_date);
Where there's one UNION ALL per day.
Can stored procedures generate a date rangeso our business analysts stop losing their hair when I send them a python or bash script to generate the date range?
Yes, you could create a stored procedure that generates dynamic SQL using only the needed tables. See my answer here for a template to start from: Issue with passing column name as a parameter to "PREPARE" in Redshift
However, you should be aware that Redshift is able to achieve most of what you want automatically using a "Time Series Table" view. This documented here:
Using Time Series Tables
Use Time-Series Tables
You define a view that is composed of a UNION ALL over a sequence of identical tables with a sort key defined on a commonly filtered date or timestamp column. When you query that view Redshift is able to eliminate the scans on any UNION'ed tables that would not contain relevant data.
For example:
CREATE OR REPLACE VIEW store_sales_vw
AS SELECT * FROM store_sales_1998
UNION ALL SELECT * FROM store_sales_1999
UNION ALL SELECT * FROM store_sales_2001
UNION ALL SELECT * FROM store_sales_2002
UNION ALL SELECT * FROM store_sales_2003
;
SELECT cd.cd_education_status
,COUNT(*) sales_count
,AVG(ss_quantity) avg_quantity
FROM store_sales_vw vw
JOIN customer_demographics cd
ON vw.ss_cdemo_sk = cd.cd_demo_sk
WHERE ss_sold_ts BETWEEN '1999-09-01' AND '2000-08-31'
GROUP BY cd.cd_education_status
In this example Redshift will only use the store_sales_1999 and store_sales_2000 tables, skipping the other tables in the view. Note that the table skipping is not based the name of the table. Redshift knows the MIN and MAX values of the sort key timestamp in each table.
If you purse this approach please be sure to keep the total size of the UNION fairly low. I recommend (at most) daily tables for the last week [7], weekly tables for the last month [5], quarterly tables for the last year [4], and then yearly tables for older data.
You can use ALTER TABLE … APPEND to merge the daily tables in weekly tables and so on.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1