Postgres: Are There Downsides to Using a JSON Column vs. an integer[] Column? - postgresql

TLDR: If I want to save arrays of integers in a Postgres table, are there any pros or cons to using an array column (integer[]) vs. using a JSON column (eg. does one perform better than the other)?
Backstory:
I'm using a PostgreSQL database, and Node/Knex to manage it. Knex doesn't have any way of directly defining a PostgreSQL integer[] column type, so someone filed a Knex bug asking for it ... but one of the Knex devs closed the ticket, essentially saying that there was no need to support PostgreSQL array column types when anyone can instead use the JSON column type.
My question is, what downsides (if any) are there to using a JSON column type to hold a simple array of integers? Are there any benefits, such as improved performance, to using a true array column, or am I equally well off by just storing my arrays inside a JSON column?
EDIT: Just to be clear, all I'm looking for in an answer is either of the following:
A) an explanation of how JSON columns and integer[] columns in PostgreSQL work, including either how one is better than the other or how the two are (at least roughly) equal.
B) no explanation, but at least a reference to some benchmarks that show that one column type or the other performs better (or that the two are equal)

An int[] is a lot more efficient in terms of storage it requires. Consider the following query which returns the size of an array with 500 elements
select pg_column_size(array_agg(i)) as array_size,
pg_column_size(jsonb_agg(i)) as jsonb_size,
pg_column_size(json_agg(i)) as json_size
from generate_series(1,500) i;
returns:
array_size | jsonb_size | json_size
-----------+------------+----------
2024 | 6008 | 2396
(I am quite surprised that the JSON value is so much smaller than the JSONB, but that's a different topic)
If you always use the array as a single value it does not really matter in terms of query performance But if you do need to look into the array and search for specific value(s), that will be a lot more efficient with a native array.
There are a lot more functions and operators available for native arrays than there are for JSON arrays. You can easily search for a single value in a JSON array, but searching for multiple values requires workarounds.
The following query demonstrates that:
with array_test (id, int_array, json_array) as (
values
(1, array[1,2,3], '[1,2,3]'::jsonb)
)
select id,
int_array #> array[1] as array_single,
json_array #> '1' json_single,
int_array #> array[1,2] as array_all,
json_array ?& array['1','2'] as json_all,
int_array && array[1,2] as array_any,
json_array ?| array['1','2'] as json_any
from array_test;
You can easily query an array if it contains one specific value. This also works for JSON arrays. Those are the expressions array_single and json_single. With a native array you could also use 1 = any(int_array) instead.
But check if an array contains all values from a list, or any value from a list does not work with JSON arrays.
The above test query returns:
id | array_single | json_single | array_all | json_all | array_any | json_any
---+--------------+-------------+-----------+----------+-----------+---------
1 | true | true | true | false | true | false

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Get string after ',' delimeter comma or special characters

The field name is message, table name is log.
Data Examples:
Values for message:
"(wsname,cmdcode,stacode,data,order_id) values (hyd-l904149,2,1,,1584425657892);"
"(wsname,cmdcode,stacode,data,order_id) values (hyd-l93mt54,2,1,,1584427657892);"
(command_execute,order_id,workstation,cmdcode,stacode,application_to_kill,application_parameters) values (kill, 1583124192811, hyd-psag314, 10, 2, tsws.exe, -u production ); "
and in log table i need to get separated column wsname with values as hyd-l904149 and hyd-l93mt54 and hyd-psag314, column cmdcode with values as 2,2 and 10 and column stacode with values as 1,1 and 2, e.g.:
wsname cmdcode stacode
hyd-l904149 2 1
hyd-l93mt54 2 1
hyd-psag314 10 2
Use regexp_matches to extract left and right part of values clause, then regexp_split_to_array to split these parts by commas, then filter rows containing wsname using = any(your_array) construct, then select required columns from array.
Or - alternative solution - fix data to be syntactically valid part of insert statement, create auxiliary tables, insert data into them and then just select.
As in comment section I mentioned about inbuilt function in posgressql
split_part(string,delimiter, field_number)
http://www.sqlfiddle.com/#!15/eb1df/1
As the json capabilities of the un-supported version 9.3 are very limited, I would install the hstore extension, and then do it like this:
select coalesce(vals -> 'wsname', vals -> 'workstation') as wsname,
vals -> 'cmdcode' as cmdcode,
vals -> 'stacode' as stacode
from (
select hstore(regexp_split_to_array(e[1], '\s*,\s*'), regexp_split_to_array(e[2], '\s*,\s*')) as vals
from log l,
regexp_matches(l.message, '\(([^\)]+)\)\s+values\s+\(([^\)]+)\)') as x(e)
) t
regexp_matches() splits the message into two arrays: one for the list of column names and one for the matching values. These arrays are used to create a key/value pair so that I can access the value for each column by the column name.
If you know that the positions of the columns are always the same, you can remove the use of the hstore type. But that would require quite a huge CASE expression to test where the actual columns appear.
Online example
With a modern, supported version of Postgres, I would use jsonb_object(text[], text[]) passing the two arrays resulting from the regexp_matches() call.

How to convert a response from KSQL - UDF returning JSON array to columns

I have a custom UDF called getCityStats(string city, double distance) which takes 2 arguments and returns an array of JSON strings ( Objects) as follows
{"zipCode":"90921","mode":3.54}
{"zipCode":"91029","mode":7.23}
{"zipCode":"96928","mode":4.56}
{"zipCode":"90921","mode":6.54}
{"zipCode":"91029","mode":4.43}
{"zipCode":"96928","mode":3.96}
I would like to process them in a KSQL table creation query as
create table city_stats
as
select
zipCode,
avg(mode) as mode
from
(select
getCityStats(city,distance) as (zipCode,mode)
from
city_data_stream
) t
group by zipCode;
In other words can KSQL handle tuple type where an array of Json strings can be processed to return as indicated above in a table creation query?
No, KSQL doesn't currently support the syntax you're suggesting. Whilst KSQL can work with arrays, it doesn't get support any kind of explode function, so you can reference specific index points in the array only.
Feel free to view and upvote if appropriate these issues: #527, #1830, or indeed raise your own if they don't cover what you want to do.

PostgreSql Queries treats Int as string datatypes

I store the following rows in my table ('DataScreen') under a JSONB column ('Results')
{"Id":11,"Product":"Google Chrome","Handle":3091,"Description":"Google Chrome"}
{"Id":111,"Product":"Microsoft Sql","Handle":3092,"Description":"Microsoft Sql"}
{"Id":22,"Product":"Microsoft OneNote","Handle":3093,"Description":"Microsoft OneNote"}
{"Id":222,"Product":"Microsoft OneDrive","Handle":3094,"Description":"Microsoft OneDrive"}
Here, In this JSON objects "Id" amd "Handle" are integer properties and other being string properties.
When I query my table like below
Select Results->>'Id' From DataScreen
order by Results->>'Id' ASC
I get the improper results because PostgreSql treats everything as a text column and hence does the ordering according to the text, and not as integer.
Hence it gives the result as
11,111,22,222
instead of
11,22,111,222.
I don't want to use explicit casting to retrieve like below
Select Results->>'Id' From DataScreen order by CAST(Results->>'Id' AS INT) ASC
because I will not be sure of the datatype of the column due to the fact that JSON structure will be dynamic and the keys and values may change next time. and Hence could happen the same with another JSON that has Integer and string keys.
I want something so that Integers in Json structure of JSONB column are treated as integers only and not as texts (string).
How do I write my query so that Id And Handle are retrieved as Integer Values and not as strings , without explicit casting?
I think your assumtions about the id field don't make sense. You said,
(a) Either id contains integers only or
(b) it contains strings and integers.
I'd say,
If (a) then numerical ordering is correct.
If (b) then lexical ordering is correct.
But if (a) for some time and then (b) then the correct order changes, too. And that doesn't make sense. Imagine:
For the current database you expect the order 11,22,111,222. Then you add a row
{"Id":"aa","Product":"Microsoft OneDrive","Handle":3095,"Description":"Microsoft OneDrive"}
and suddenly the correct order of the other rows changes to 11,111,22,222,aa. That sudden change is what bothers me.
So I would either expect a lexical ordering ab intio, or restrict my id field to integers and use explicit casting.
Every other option I can think of is just not practical. You could, for example, create a custom < and > implementation for your id field which results in 11,111,22,222,aa. ("Order all integers by numerical value and all strings by lexical order and put all integers before the strings").
But that is a lot of work (it involves a custom data type, a custom cast function and a custom operator function) and yields some counterintuitive results, e.g. 11,111,22,222,0a,1a,2a,aa (note the position of 0a and so on. They come after 222).
Hope, that helps ;)
If Id always integer you can cast it in select part and just use ORDER BY 1:
select (Results->>'Id')::int From DataScreen order by 1 ASC

array_agg guaranteed consistent across multiple columns in Postgres?

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?
According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.
It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.