Convert T-SQL Cross Apply to Redshift - amazon-redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!

I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Why UNION SELECT gives you the number of columns in SQLI

I am trying to figure out why one can figure out the number of columns from using 'union SELECT'
For example, if you have a webpage http://www.vulnerable-site.com/index.php?firstArg=1
I learned that you can put
http://www.vulnerable-site.com/index.php?firstArg=1 union SELECT 1,2,3,4 --
to find out the number of columns in the table. Basically, you keep adding numbers to till you stop getting errors. Why is that?
Can anyone please help me with this basic question.
Thanks
I came to know that the number of columns on both sides of the UNION statement must match. So, this command can be used to find out the number of columns in a table.
UNION keyword can be used to retrieve data from other tables within the database. This results in an SQL injection UNION attack. so Determining the number of columns required in an SQL injection UNION attack. there are two effective methods to determine how many columns are being returned from the original query. "ORDER BY" and "UNION SELECT"

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000