T-SQL - CROSS APPLY to a PIVOT? (using pivot with a table-valued function)? - tsql

I have a table-valued function, basically a split-type function, that returns up to 4 rows per string of data.
So I run:
select * from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
I get:
Seq Data
1 15
2 25
However, my end data needs to look like
15 25
so I do a pivot.
select [1],[2],[3],[4]
from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
pivot (max(data) for seq in ([1],[2],[3],[4]))
as pivottable
which works as expected:
1 2
--- ---
15 25
HOWEVER, that's great for one row. I now need to do it for several hundred records at once. My thought is to do a CROSS APPLY, but not sure how to combine a CROSS APPLY and a PIVOT.
(yes, obviously the easy answer is to write a modified version that returns 4 columns, but that's not a great option for other reasons)
Any help greatly appreciated.
And the reason I'm doing this: the current query uses as scalar-valued version of SPLIT, called 12 times within the same SELECT against the same million rows (where the data string is 500+ bytes).
So far as I know, that would require it scan the same 500bytes * 1000000rows, 12 times.

This is how you use cross apply. Assume table1 is your table and Line is the field in your table you want to split
SELECT * fROM table1 as a
cross apply dbo.split(a.Line) as b
pivot (max(data) for seq in ([1],[2],[3],[4])) as p

Related

More efficient way to fuzzy match large datasets in SAS

I have a dataset with over 33 million records that includes a name field. I need to flag records where this name field value also appears in a second dataset that includes about 5 million records. For my purposes, a fuzzy match would be both acceptable and beneficial.
I wrote the following program to do this. It works but has been thus far running for 4 days, so I'd like to find a more efficient way to write it.
proc sql noprint;
create table INDIV_MATCH as
select A.NAME, SPEDIS(A.NAME, B.NAME) as SPEDIS_VALUE, COMPGED(A.NAME,B.NAME) as COMPGED_SCORE
from DATASET1 A join DATASET2 B
on COMPGED(A.NAME, B.NAME) le 400 and SPEDIS(A.NAME, B.NAME) le 10
order by A.name;
quit;
Any help would be much appreciated!

TimescaleDB: Understanding the return values after creating hypertable and the creation of chunks after populating the hypertable

I have an existing table in my database named price (has 264 rows) and I converted it into a hypertable price_hypertable doing:
CREATE TABLE price_hypertable (LIKE price INCLUDING DEFAULTS INCLUDING CONSTRAINTS EXCLUDING INDEXES);
SELECT create_hypertable('price_hypertable', 'start');
and the output it gave me is as follows:
create_hypertable
-------------------------------
(4,public,price_hypertable,t)
(1 row)
The next thing I did was to populate the price_hypertable as follows:
insert into price_hypertable select * from price;
And I got the following output:
INSERT 0 264
Now, I wanted to check the chunks created, for which I did:
select public.show_chunks('price_hypertable');
and the output I got:
show_chunks
----------------------------------------
_timescaledb_internal._hyper_4_3_chunk
_timescaledb_internal._hyper_4_4_chunk
(2 rows)
When I do:
select * from _timescaledb_internal._hyper_4_3_chunk;
select * from _timescaledb_internal._hyper_4_4_chunk ;
I see that the 264 entries are split as follows:
_timescaledb_internal._hyper_4_3_chunk has 98 rows
_timescaledb_internal._hyper_4_4_chunk has 166 rows
I have a few questions about these steps and their outputs:
Can someone please explain to me what do the values 4 and t represent, when I did
SELECT create_hypertable('price_hypertable', 'start');?
After populating the price_hypertable, the data was automatically split into chunks, but of different size. Why does this happen? Why wasn't the data just split in half (132 rows in each chunk instead of 98 and 166)?
Any help is appreciated. Thanks
For the first question, it is easier to see what they represent by executing create_hypertable as
SELECT * FROM create_hypertable('price_hypertable', 'start');
This gives something like:
hypertable_id | schema_name | table_name | created
---------------+-------------+--------------------+---------
4 | public | price_hypertable | t
For the second question, TmTron already answered. This is because the rows are sorted into buckets based on the time, and they are not necessarily evenly spaced. There is no automation that pick the correct interval for each bucket.
You can find information about the return values in the API documentation on create_hypertable which also discuss the parameter chunk_time_interval that can be used to set the chunk size.
related to your 2nd question:
When you don't specify the chunk_time_interval explicitly, the default is 7 days: see create-hypertable, Best Practices.
So the number of rows in each chunks depends on the distribution of your data (according to your start date-time column).

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Postgres Crosstab Dynamic Number of Columns

In Postgres 9.4, I have a table like this:
id extra_col days value
-- --------- --- -----
1 rev 0 4
1 rev 30 5
2 cost 60 6
i want this pivoted result
id extra_col 0 30 60
-- --------- -- -- --
1 rev 4 5
2 cost 6
this is simple enough with a crosstab.
but i want the following specifications:
day column will be dynamic. sometimes increments of 1,2,3 (days), 0,30,60 days (accounting months), and sometimes in 360, 720 (accounting years).
range of days will be dynamic. (e.g., 0..500 days versus 1..10 days).
the first two columns are static (id and extra_col)
The return type for all the dynamic columns will remain the same type (in this example, integer)
Here are the solutions I've explored, none of which work for me for the following reasons:
Automatically creating pivot table column names in PostgreSQL -
requires two trips to the database.
Using crosstab_hash - is not dynamic
From all the solutions I've explored, it seems the only one that allows this to occur in one trip to the database requires that the same query be run three times. Is there a way to store the query as a CTE within the crosstab function?
SELECT *
FROM
CROSSTAB(
--QUERY--,
$$--RUN QUERY AGAIN TO GET NUMBER OF COLUMNS--$$
)
as ct (
--RUN QUERY AGAIN AND CREATE STRING OF COLUMNS WITH TYPE--
)
Every solution based on any buildin functionality needs to know a number of output columns. The PostgreSQL planner needs it. There is workaround based on cursors - it is only one way, how to get really dynamic result from Postgres.
The example is relative long and unreadable (the SQL really doesn't support crosstabulation), so I will not to rewrite code from blog here http://okbob.blogspot.cz/2008/08/using-cursors-for-generating-cross.html.

Why does this Oracle 10g SQL run slow only when I query a subquery with a where clause?

I can't paste in the entire SQL for various reasons, so consider this example:
select *
from
(select nvl(get_quantity(1), 10) available_qty
from dual)
where available_qty > 30;
get_quantity is a function that makes a calculation based on the ID of a record that's passed through it. If it returns null, I use nvl() to force it to 10.
The query runs very slow when I use the WHERE clause in the parent query. When I comment out the WHERE clause, however, it runs very fast. What I don't get is why it can display the data very fast, but it can't query it just as fast. I am querying the results of a subquery, too. I was under the impression that subqueries return a "rendered" dataset. It's almost as if querying the available_qty identifier is causing it to reference something within the subquery.
This is why I don't think the contents of the get_quantity function are relevant here, so I didn't bother posting it. Instead, I think it's a misunderstanding on my part of how Oracle handles subqueries and whatnot.
Do any of you Oracle gurus have any idea what I am doing wrong?
Afterthought: as I was entering tags for this question, the tag "correlated subquery" came up. In doing some quick research, it seems that this type of subquery somewhat depends on the outer query. Could this be related to my problem?
Let's try an experiment. First we'll run the following query:
select lvl, rnd
from (select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd from dual) b;
The "a" subquery will return 5 rows with values from 1 to 5. The "b" subquery will return one row with a random value. If the function is run before the two tables are join (by Cartesian), the same random value will be returned for each row. The actual results:
LVL RND
---------- ----------
1 .417932089
2 .963531718
3 .617016889
4 .128395638
5 .069405568
5 rows selected.
Clearly the function was run for each of the joined rows, not for the subquery before the join. This is a result of Oracle's optimizer deciding that the best path for the query is to do things in that order. To prevent this, we have to add something to the second subquery that will make Oracle run the subquery in it's entirety before performing the join. We'll add rownum to the subquery, since Oracle knows rownum will change if it's run after the join. The following query demonstrates this:
select lvl, rnd from (
select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd, rownum from dual) b;
As you can see from the results, the function was only run once in this case:
LVL RND
---------- ----------
1 .028513902
2 .028513902
3 .028513902
4 .028513902
5 .028513902
5 rows selected.
In your case, it seems likely that the filter provided by the where clause is making the optimizer take a different path, where it's running the function repeatedly, rather than once. By making Oracle run the subquery as written, you should get more consistent run-times.