In a PostgreSQL crosstab, can I automate the tuple part? - postgresql

I'm trying to do get a tall table (with just 3 columns indicating variable, timestamp and value) into a wide format where timestamp is the index, the columns are the variable names, and the values are the values of the new table.
In python/pandas this would be something along the lines of
import pandas as pd
df = pd.read_csv("./mydata.csv") # assume timestamp, varname & value columns
df.pivot(index="timestamp", columns="varname", values="value")
for PostgreSQL there exists crosstab, so far I have:
SELECT * FROM crosstab(
$$
SELECT
"timestamp",
"varname",
"value"
FROM mydata
ORDER BY "timestamp" ASC, "varname" ASC
$$
) AS ct(
"timestamp" timestamp,
"varname1" numeric,
...
"varnameN" numeric
);
The problem is that I can potentially have dozens to hundreds of variable names. The types are always numeric, number of variable names is not stable (we could need more variables or realize that others are not necessary).
Is there a way to automate the "ct" part so that some other query (e.g. select distinct "varname" from mydata) produces it instead of me having to type in every single variable name present?
PS: PSQL version is 12.9 at home, 14.0 in production. Number of rows in the original table is around 2 million, however I'm going to filter by timestamp and varname, so potentially only a few hundreds of thousands rows. After filtering I got ~50 unique varnames, but that will increase in a few weeks.

Related

Athena - Union tables with incompatible data types

We have two tables with a column differing in its data type. A column in first table is of type int, while the same column on second table is of type float/real. if it was a naked column I could have CAST'ed to a common type, the problem here is, these columns are deep inside a struct.
Error i'm getting is,
SYNTAX_ERROR: line 23:1: column 4 in row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value integer)) query has incompatible types: Union, row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value real))
The query (simplified) is,
WITH t1 AS (
SELECT
"so"."createdon"
, "so"."modifiedon"
, "so"."deletedon"
, "so"."createdby"
, "so"."priceconfiguration"
, "so"."year"
, "so"."month"
, "so"."day"
FROM
my_db.raw_price so
UNION ALL
SELECT
"ao"."createdon"
, "ao"."modifiedon"
, "ao"."deletedon"
, "ao"."createdby"
, "ao"."priceconfiguration"
, "ao"."year"
, "ao"."month"
, "ao"."day"
FROM
my_db.src_price ao
)
SELECT t1.* FROM t1 ORDER BY "modifiedon" DESC
In fact, the real table is more complex than this and the column priceconfiguration is nested deep inside the tables. So CASTing the column under question is directly not possible, unless all the structs are un-nested to CAST the offending column.
Is there a way to UNION these two tables without unnesting and casting?
The solution was to upgrade the Athena Engine Version to v2.
V2 Engine has more support for schema evolution. As per the AWS doc,
Schema evolution support has been added for data in Parquet format.
Added support for reading array, map, or row type columns from
partitions where the partition schema is different from the table
schema. This can occur when the table schema was updated after the
partition was created. The changed column types must be compatible.
For row types, trailing fields may be added or dropped, but the
corresponding fields (by ordinal) must have the same name.
Ref:
https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html

How to create a column that holds an array in Postgres?

Background:
I am making a db for a reservartions calendar. The reservations are hourly based, so I need to insert many items to one column called "hours_reserved".
Example tables of what I need:
Table "Space"
Column / Values
id / 1
date / 5.2.2020
hours / { 8-10, 10-12 }
Table "reservation"
Column / Values
id / 1
space_id / 1
date / 5.2.2020
reserved_hours / 8-10
Table "reservation"
Column / Values
id / 2
space_id / 1
date / 5.2.2020
hours / 10-12
So I need to have multiple items inserted into "space" table "hours" column.
How do I do this in Postgres?
Also is there a better way to accomplish this?
There is more way to do this, depending on the type of the hours field (i.e. text[], json or jsonb) I'd go with jsonb just because you can do a lot of things with it and you'll find this experience to be useful in the short term.
CREATE TABLE "public"."space"
("id" SERIAL, "date_schedule" date, "hours" jsonb, PRIMARY KEY ("id"))
Whenever you insert a record in this table that's manually crafted, write it as text (single quoted json object) and cast it to jsonb
insert into "space"
(date_schedule,hours)
values
('05-02-2020'::date, '["8-10", "10-12"]'::jsonb);
There is more than one way to match these available hours against the reservations and you can take a look at the docs, on the json and jsonb operations. For example, doing:
SELECT id,date_schedule, jsonb_array_elements(hours) hours FROM "public"."space"
would yield
Which has these ugly double quotes (which is correct, since json can hold several kinds of scalars, that column is polimorfic :D)
However, you can perform a little transformation to remove them and be able to perform a join with reservations
with unnested as (
SELECT id,date_schedule, jsonb_array_elements(hours) hours FROM "public"."space"
)
select id,date_schedule,replace(hours::text, '"','') from unnested
The same can be achieved defining the field as text[] (the insertion syntax is different but trivial)
in that scenario your data will look like:
Which you can unwrap as:
SELECT id,date_schedule, unnest(hours) FROM "public"."space"
Apparently
ALTER TABLE mytable
ADD COLUMN myarray text[];
Works fine.
I got a following problem when trying to put(update) into that column using postman(create works fine):
{
"myarray": ["8-10"]
}
Results into:
"message": "error: invalid input syntax for type integer:
\"{\"myarray\":[\"8-10\"]}\""

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Alphanumeric Sorting in PostgreSQL

I have this table with a character varying column in Postgres 9.6:
id | column
------------
1 |IR ABC-1
2 |IR ABC-2
3 |IR ABC-10
I see some solutions typecasting the column as bytea.
select * from table order by column::bytea.
But it always results to:
id | column
------------
1 |IR ABC-1
2 |IR ABC-10
3 |IR ABC-2
I don't know why '10' always comes before '2'. How do I sort this table, assuming the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
When sorting character data types, collation rules apply - unless you work with locale "C" which sorts characters by there byte values. Applying collation rules may or may not be desirable. It makes sorting more expensive in any case. If you want to sort without collation rules, don't cast to bytea, use COLLATE "C" instead:
SELECT * FROM table ORDER BY column COLLATE "C";
However, this does not yet solve the problem with numbers in the string you mention. Split the string and sort the numeric part as number.
SELECT *
FROM table
ORDER BY split_part(column, '-', 2)::numeric;
Or, if all your numbers fit into bigint or even integer, use that instead (cheaper).
I ignored the leading part because you write:
... the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
Related:
Alphanumeric sorting with PostgreSQL
Split comma separated column data into additional columns
What is the impact of LC_CTYPE on a PostgreSQL database?
Typically, it's best to save distinct parts of a string in separate columns as proper respective data types to avoid any such confusion.
And if the leading string is identical for all columns, consider just dropping the redundant noise. You can always use a VIEW to prepend a string for display, or do it on-the-fly, cheaply.
As in the comments split and cast the integer part
select *
from
table
cross join lateral
regexp_split_to_array(column, '-') r (a)
order by a[1], a[2]::integer

sort varchar column with alphanumeric data in Redshift

I have a column in redshift database which contains values like 11E, 11N, 11W,12W,12E,12S,1S,2E. Need to sort column like 1S,2E,11E,11N,11W,12E,12S,12W.
you need to separate the numbers from the characters.
try this order
SELECT a FROM example
ORDER BY
convert(integer,case when length(a)= 2 then left(a,1) else left(a,2) end),
right(a,1)