aggregate multiple columns in sql table as json or array - postgresql

I would like to aggregate some columns into an array or json object on redash.
The data table is on presto database. I need to query it from pyspark hive.
The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.
I am not sure what the best data structure should be for this? (json object? array of array ?)
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I want (this may not be the best data structure):
id. cat_item_aggregation
1. [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]
I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by
Postgres - aggregate two columns into one item
also json_build_object from Return as array of JSON objects in SQL (Postgres)
How to group multiple columns into a single array or similar?
but they do not work in redash.
Could anybody let me know what the best data structure should be for this kind of table ?
and how to build it ?
json may be better than array of array because it is hard to decompose the elements from array of array ?
thanks

Related

How to efficiently retrieve rows with large JSON objects in Postgres

I've inherited the task of retrieving data from a Postgres table.
The table has ~1m rows, and there are about 145k rows that I wish to retrieve. These 145k rows have a common string in one of their columns batch_name that I can use to search for them.
The table has two columns payload & result that are of type JSON. The result column contains the data that I wish to retrieve.
When I make even the simplest queries to the table:
SELECT * FROM table_name WHERE batch_name = 'an_id' limit 10
The request takes ~7-10 seconds to return data.
This is despite the fact that the batch_name column has an index on it and it's of type varchar(255)
Whilst investigating this, I've discovered that the JSON objects in the result column and payload column can be absolutely gigantic objects. When prettified, they are sometimes ~27k lines long.
These gigantic JSON objects seem to be the root cause of the problem.
My questions are:
What can I do to improve the efficiency of this query? Or is the ultimate solution here to just modify the table such that we are no longer storing gigantic JSON objects?
Given that I don't need to actually query fields in these JSON objects (but I DO need to retrieve them), would simply storing them as strings improve efficiency?
Why is storing large JSON objects SO inefficient?
Thanks in advance for any help, it's much appreciated.

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

PostgreSql - Gin Index doesn't help in improving performance

I am pretty novice in the PostgreSql world.
I store the following JSON objects in the jsonb column of PostgreSQL as one object per row.
{"cid":"CID1","Display":"User One CID1","F-Name":"Craig","LName":"One"}
{"cid":"CID1","Display":"User One CID1","F-Name":"Leo","LName":"One"}
{"cid":"CID2","OrderNo":"Ordr One Ord1","O-Name":"Michael","LName":"One"}
{"cid":"CID2","OrderNo":"Ordr One Ord1","O-Name":"Sam","LName":"One"}
{"cid":"CID3","InvocNo":"Invc One Inv1","I-Name":"Ron","LName":"One"}
{"cid":"CID3","InvocNo":"Invc One Inv1","I-Name":"Books","LName":"One"}
So these N objects are stored as N rows in a jsonb column (named as res). I have a requirement that would query these JSON objects for text match, contains type queries, on Keys ('Display', 'OrderNo', 'InvocNo', F-Name, O-Name etc).
The JSON objects generated are dynamic JSON and the columns (keys) of one JSON object may not be matching to that of another object. I am currently creating a GIN index on res column like below
CREATE INDEX gin_idx ON mytable USING gin (res)
The performance of filter queries on these columns do not show any improvements while using the GIN index. I have my DB filled with 50,000 rows with such data.
In all these JSON objects only the 'cid' column will be common that would exist in all json objects.
Which type of index will be best suitable in such scenarios considering that one column/key from JSOn object may not be a part of another object?

How to retrieve a list of Columns from a single row in Cassandra?

The below is a sample of my Cassandra CF.
column1 column2 column3 ......
row1 : name:abay,value:10 name:benny,value:7 name:catherine,value:24 ................
ComparatorType:utf8
How can i fetch columns with name ('abay', 'john', 'peter', 'allen') from this row in a single query using Hector API.
The number of names in the list may vary every time.
I know that i can get them in a sorted order using SliceQuery.
But there are cases when i need to fetch data randomnly, as i mentioned above.
Kindly help me.
Based on your query, it seems you have two options.
If you only need to run this query occasionally, you can get all columns for the row and filter them on the client. If you have at most a few thousand columns, this should be ok for an occasional query.
If you need to run this frequently, you'll want to write the data such that you can query using name as the key. This probably means you'll have to write the data twice into two CFs, where one is by your current key, and the other is by name. This is a common Cassandra tactic.