split a field in redshift - amazon-redshift

I have a my table in redshift that contains some concatenated id
Product_id , options_id
1, 2
5, 5;9;7
52, 4;5;8,11
I want to split every my table like this:
Product_id , options_id
1 , 2
5, 5
5, 9
5, 7
52, 4
52, 5
52, 9
in the documentation of redshift, i find a similar function 'split part' but with this function i must enter the number of the part that i want to get exp:
Product_id , options_id
5, 5;9;7
split_part(options_id,';',2) will return 9,
Any help please
Thanks.

So, the problem here is to take one row and split it into multiple rows. That's not too hard in PostgreSQL -- you could use the unnest() function.
However, Amazon Redshift does not implement every function available in PostgreSQL, and unnest() is unsupported.
While it is possible to write a User Defined Function in Redshift, the function can only return one value, not several rows.
A good option is to iterate through each part, extracting each in turn as a row. See the workaround in Error while using regexp_split_to_table (Amazon Redshift) for a clever implementation (but still something of a hack). This is a similar concept to Expanding JSON arrays to rows with SQL on RedShift.
The bottom line is that you can come up with some hacks that will work to a limited degree, but the best option is to clean the data before loading it into Amazon Redshift. At the moment, Redshift is optimized for extremely fast querying over massive amounts of data, but it is not fully-featured in terms of data manipulation. That will probably change in future (just like User Defined functions were not originally available) but for now we have to work within its current functionality.

Stealing from this answer Split column into multiple rows in Postgres
select product_id, p.option
from product_options po,
unnest(string_to_array(po.options_id, ';')) p(option)
sqlfiddle

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

Unique among two columns

Assuming Postgresql >= 10, is there a way to constrain a table to have unique values in two (or more) columns? That is, a value can only appear in one of columns. I'd like to avoid triggers as long as I can. For a single column that would be trivial.
Let's have this table:
CREATE TABLE foo (
col1 INTEGER,
col2 INTEGER
);
So it should be
1 2
4 3
5 7
While 8 4 would be impossible, because there is 4 3 already.
So far I figured it could be possible constrain EXCLUDE ((ARRAY[col1, col2]) WITH &&), but it seems unsupported (yet?):
ERROR: operator &&(anyarray,anyarray) is not a member of operator family "array_ops"
This requirement is also could be seem as an empty inner-joined table (on a.col1 = b.col2). I guess I could use triggers, but I'd like to avoid them as long as I can.
P. S. Here is a related question.
I'm pretty user this answer is quite close to what you're looking to achieve but, as mentioned in the answer. There's no true way to do this as it is not common practice.
In programming, when something like this happens, it would be better to perform some database refactoring to find an alternative, more ideal, solution.
Hope to be of any help!

insert based on value in first row

I have a fixed file that I am importing into a single column with data similar to what you see below:
ABC$ WC 11683
11608000163118430001002010056788000000007680031722800315723
11683000486080280000002010043213000000007120012669100126691
ABC$ WC 000000020000000148000
ABC$ WC 11683
1168101057561604000050200001234000000027020023194001231940
54322010240519720000502000011682000000035640006721001067210
1167701030336257000050200008765000000023610029066101151149
11680010471244820000502000011680000000027515026398201263982
I want to split and insert this data into another table but I want to do so as long as the '11683' is equal to a column value in a different table + 1. I will then increment that value (not seen here).
I tried the following:
declare #blob as varchar(5)
declare #Num as varchar(5)
set #blob = substring(sdg_winn_blob.blob, 23,5)
set #Num = (Cnum.num + 1)
IF #blob = #Num
INSERT INTO SDG_CWF
(
GAME,SERIAL,WINNER,TYPE
)
SELECT convert(numeric, substring(blob,28, 5)),convert(numeric, substring(blob, 8, 9)),
(Case when (substring(blob, 6,2)='10') then '3'
when (substring(blob, 6,2)='11') then '4'
else substring(blob, 7, 1)
End),
(Case when (substring(blob, 52,2)='10') then '3'
when (substring(blob, 52,2)='11') then '4'
else substring(blob, 53, 1)
End)
FROM sdg_winn_blob
WHERE blob not like 'ABC$%'
else
print 'The Job Failed'
The insert works fine until I try to check to see if the number at position (23, 5) is the same as the number in the Cnum table. I get the error:
Msg 4104, Level 16, State 1, Line 4
The multi-part identifier "sdg_winn_blob.blob" could not be bound.
Msg 4104, Level 16, State 1, Line 5
The multi-part identifier "Cnum.num" could not be bound.
It looks like you may be used to a procedural, object oriented style of coding. SQL Server wants you to think quite differently...
This line:
set #blob = substring(sdg_winn_blob.blob, 23,5)
Is failing because SQL interprets it in isolation. Within just that line, you haven't told SQL what the object sdg_winn_blob is, nor its member blob.
Since those things are database tables / columns, they can only be accessed as part of a query including a FROM clause. It's the FROM that tells SQL where these things are.
So you'll need to replace that line (and the immediate next one) with something like the following:
Select #blob = substring(sdg_winn_blob.blob, 23,5)
From sdg_winn_blob
Where...
Furthermore, as far as I can tell, your whole approach here is conceptually iterative: you're thinking about this in terms of looking at each line in turn, processing it, then moving onto the next. SQL does provide facilities to do that (which you've not used here), but they are very rarely the best solution. SQL prefers (and is optimised for) a set based approach: design a query that will operate on all rows in one go.
As it stands I don't think your query will ever do quite what you want, because you're expecting iterative behaviour that SQL doesn't follow.
The way you need to approach this if you want to "think like SQL Server" is to construct (using just SELECT type queries) a set of rows that has the '11683' type values from the header rows, applied to each corresponding "data" row that you want to insert to SDG_CWF.
Then you can use a SQL JOIN to link this row set to your Cnum table and ascertain, for each row, whether it meets the condition you want in Cnum. This set of rows can then just be inserted into SDG_CWF. No variables or IF statement involved (they're necessary in SQL far less often than some people think).
There are multiple possible approaches to this, none of them terribly easy (unless I'm missing something obvious). All will need you to break your logic down into steps, taking your initial set of data (just a blob column) and turning it into something a bit closer to what you need, then repeating. You might want to work this out yourself but if not, I've set out an example in this SQLFiddle.
I don't claim that example is the fastest or neatest (it isn't) but hopefully it'll show what I mean about thinking the way SQL wants you to think. The SQL engine behind that website is using SQL 2008, but the solution I give should work equally well on 2005. There are niftier possible ways if you get access to 2012 or later versions.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.