Athena - Union tables with incompatible data types - hiveql

We have two tables with a column differing in its data type. A column in first table is of type int, while the same column on second table is of type float/real. if it was a naked column I could have CAST'ed to a common type, the problem here is, these columns are deep inside a struct.
Error i'm getting is,
SYNTAX_ERROR: line 23:1: column 4 in row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value integer)) query has incompatible types: Union, row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value real))
The query (simplified) is,
WITH t1 AS (
SELECT
"so"."createdon"
, "so"."modifiedon"
, "so"."deletedon"
, "so"."createdby"
, "so"."priceconfiguration"
, "so"."year"
, "so"."month"
, "so"."day"
FROM
my_db.raw_price so
UNION ALL
SELECT
"ao"."createdon"
, "ao"."modifiedon"
, "ao"."deletedon"
, "ao"."createdby"
, "ao"."priceconfiguration"
, "ao"."year"
, "ao"."month"
, "ao"."day"
FROM
my_db.src_price ao
)
SELECT t1.* FROM t1 ORDER BY "modifiedon" DESC
In fact, the real table is more complex than this and the column priceconfiguration is nested deep inside the tables. So CASTing the column under question is directly not possible, unless all the structs are un-nested to CAST the offending column.
Is there a way to UNION these two tables without unnesting and casting?

The solution was to upgrade the Athena Engine Version to v2.
V2 Engine has more support for schema evolution. As per the AWS doc,
Schema evolution support has been added for data in Parquet format.
Added support for reading array, map, or row type columns from
partitions where the partition schema is different from the table
schema. This can occur when the table schema was updated after the
partition was created. The changed column types must be compatible.
For row types, trailing fields may be added or dropped, but the
corresponding fields (by ordinal) must have the same name.
Ref:
https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html

Related

How handle NULL in visualization

I have two tables (it is example, of course) that I loaded to app from different sources by script
Table 1:
ID
Attribute T1
1
100
3
200
Table 2:
ID
Attribute T2
1
Value 1
2
Value 2
On a list I create table:
ID
Attribute T1
Attribute T2
Finally I have table
ID
Attribute T1
Attribute T2
1
100
Value 1
2
-
Value 2
3
200
-
So, as You know it limits me in filtering and analyzing data, for example I can't show all data that isn't represented in Table 1, or all data for Attribute T1 not equal 100.
I try to use NullAsValue, but it didn't help. Would be appreciate for idea how to manage my case.
To achieve what you're attempting, you'll need to Join or Concatenate your tables. The reason is because Null means something different depending on how the data is loaded.
There's basically two "types" of Null:
"Implied" Null
When you associate several tables in your data model, as you've done in your example, Qlik is essentially treating that as a natural outer join between the tables. But since it's not an actual join that happens when the script executes, the Nulls that arise from data incongruencies (like in your example) are basically implied, since there really is an absence of data there. There's nothing in the data or script that actually says "there are no Attribute T1 values for ID of 2." Because of that, you can't use a function like NullAsValue() or Coalesce() to replace Nulls with another value because those Nulls aren't even there -- there's nothing to actually replace.
The above tables don't have any actual Nulls -- just implied ones from their association and the fact that the ID fields in either table don't have all the same values.
"Realized" Null
If, instead of just using associations, you actually combine the tables using the Join or Concatenate prefixes, then Qlik is forced to actually generate a Null value in the absence of data. Instead of Null being implied, it's actually there in the data model -- it's been realized. In this case, we can actually use functions like NullAsValue() or Coalesce() or Alt() to replace Nulls with another value since we actually have something in our table to replace.
The above joined table has actual Nulls that are realized in the data model, so they can be replaced.
To replace Nulls at that point, you can use the NullAsValue() or Coalesce() functions like this in the Data Load Editor:
table1:
load * inline [
ID , Attribute T1
1 , 100
3 , 200
];
table2:
join load * inline [
ID , Attribute T2
1 , Value 1
2 , Value 2
];
NullAsValue [Attribute T1];
Set NullValue = '-NULL-';
new_table:
NoConcatenate load
ID
, [Attribute T1]
, Coalesce([Attribute T2], '-AlsoNULL-') as [Attribute T2]
Resident table1;
Drop Table table1;
That will result in a table like this:
The Coalesce() and Alt() functions are also available in chart expressions.
Here are some quick links to the things discussed here:
Qlik Null interpretation
Qlik table associations
NullAsValue() function
Coalesce() function
Alt() function

In a PostgreSQL crosstab, can I automate the tuple part?

I'm trying to do get a tall table (with just 3 columns indicating variable, timestamp and value) into a wide format where timestamp is the index, the columns are the variable names, and the values are the values of the new table.
In python/pandas this would be something along the lines of
import pandas as pd
df = pd.read_csv("./mydata.csv") # assume timestamp, varname & value columns
df.pivot(index="timestamp", columns="varname", values="value")
for PostgreSQL there exists crosstab, so far I have:
SELECT * FROM crosstab(
$$
SELECT
"timestamp",
"varname",
"value"
FROM mydata
ORDER BY "timestamp" ASC, "varname" ASC
$$
) AS ct(
"timestamp" timestamp,
"varname1" numeric,
...
"varnameN" numeric
);
The problem is that I can potentially have dozens to hundreds of variable names. The types are always numeric, number of variable names is not stable (we could need more variables or realize that others are not necessary).
Is there a way to automate the "ct" part so that some other query (e.g. select distinct "varname" from mydata) produces it instead of me having to type in every single variable name present?
PS: PSQL version is 12.9 at home, 14.0 in production. Number of rows in the original table is around 2 million, however I'm going to filter by timestamp and varname, so potentially only a few hundreds of thousands rows. After filtering I got ~50 unique varnames, but that will increase in a few weeks.

How to specify on error behavior for postgresql conversion to UUID

I need to write a query to join 2 tables based on UUID field.
Table 1 contains user_uuid of type uuid.
Table 2 has this user_uuid in the end of url, after the last slash.
The problem is that sometimes this url contains other value, not castable to uuid.
My workaround like this works pretty good.
LEFT JOIN table2 on table1.user_uuid::text = regexp_replace(table2.url, '.*[/](.*)$', '\1')
However i have a feeling that better solution would be to try to cast to uuid before joining.
And here i have a problem. Such query:
LEFT JOIN table2 on table1.user_uuid = cast (regexp_replace(table2.url, '.*[/](.*)$', '\1') as uuid)
gives ERROR: invalid input syntax for type uuid: "rfa-hl-21-014.html" SQL state: 22P02
Is there any elegant way to specify the behavior on cast error? I mean without tons of regexp checks and case-when-then-end...
Appreciate any help and ideas.
There are additional considerations when converting a uuid to text. Postgres will yield a converted value in standard form (lower case and hyphened). However there are other formats for the same uuid value that could occur in you input. For example upper case and not hyphened. As text these would not compare equal but as uuid they would. See demo here.
select *
from table1 t1
join table2 t2
on replace(t_uuid::text, '-','') = replace(lower(t2.t_stg),'-','') ;
Since your data clearly contains non-uuid values, you cannot assume standard uuid format either. There are also additional formats (although not apparently often used) for a valid UUID. You may want to review UUID Type documentation
You could cast the uuid from table 1 to text and join that with the suffix from table 2. That will never give you a type conversion error.
This might require an extra index on the expression in the join condition if you need fast nested loop joins.

UNION types text and bigint cannot be matched

I'm running a complex stored procedure and I'm getting an error when I have 3 unions, but with 2 unions no error. If I remove either of the top two unions it runs fine. If I make one of the NULLs a 0, it runs fine. The error is "UNION types text and bigint cannot be matched"
```lang-sql
SELECT NULL AS total_time_spent
FROM tbl1
GROUP BY student_id
UNION ALL
SELECT NULL AS total_time_spent
FROM tbl2
GROUP BY student_id
UNION ALL
SELECT sum(cast(("value" ->> 'seconds') AS integer)) AS total_time_spent
FROM tbl3
GROUP BY student_id
```
I've tried all kinds of casting on the sum result or the sum input. The json that I'm pulling from is either NULL, [] or something like this:
[{"date": "2020-09-17", "seconds": 458}]
According to the SQL standard, the NULL value exists in every data type, but lacking an explicit type cast, the first subquery resolves the data type to to text (earlier versions of PostgreSQL would have used unknown here, but we don't want this data type in query results).
The error message is then a consequence of the type resolution rules for UNION in PostgreSQL.
Use an explicit type case to avoid the problem:
SELECT CAST(NULL AS bigint) FROM ...
UNION ...

Create pivot table with dynamic column names

I am creating a pivot table which represents crash values for particular year. Currently, i am doing a hard code for column names to create pivot table. Is there anyway to make the column names dynamic to create pivot table? years are stored inside an array
{2018,2017,2016 ..... 2008}
with crash as (
--- pivot table generated for total fatality ---
SELECT *
FROM crosstab('SELECT b.id, b.state_code, a.year, count(case when a.type = ''Fatal'' then a.type end) as fatality
FROM '||state_code_input||'_all as a, (select * from source_grid_repository where state_code = '''||upper(state_code_input)||''') as b
where st_contains(b.geom,a.geom)
group by b.id, b.state_code, a.year
order by b.id, a.year',$$VALUES ('2018'),('2017'),('2016'),('2015'),('2014'),('2013'),('2012'),('2011'),('2010'),('2009'),('2008') $$)
AS pivot_table(id integer, state_code varchar, fat_2018 bigint, fat_2017 bigint, fat_2016 bigint, fat_2015 bigint, fat_2014 bigint, fat_2013 bigint, fat_2012 bigint, fat_2011 bigint, fat_2010 bigint, fat_2009 bigint, fat_2008 bigint)
)
In the above code, fat_2018, fat_2017 , fat_2016 etc were hard coded. I need the years after fat_ to be dynamic.
This question has been asked many times, & there are decent (even dynamic) solutions. While CROSSTAB() is available in recent versions of Postgres, not everyone has sufficient privileges to install the prerequisite extensions.
One such solution involves a temp type (temp table) created by an anonymous function & JSON expansion of the resultant type.
See also: DB FIDDLE (UK): https://dbfiddle.uk/Sn7iO4zL
How to pivot or crosstab in postgresql without writing a function?
It is not possible. PostgreSQL is strict type system. Result is a table (relation). A format of this table (columns, columns names, columns types) should be defined before query execution (in planning time). So you cannot to write any query for Postgres that returns dynamic number of columns.