Need to split column into rows and columns - tsql

I have a table like this:
ID cst
1 string1;3;string2;string3;34;string4;-1;string5;string6;12;string7;5;string8,string9, 65
2 string10;-3;string11;string12;56;string13;6;string14;string15;9
etc.
Now I want to split the cst column into 5 columns and multiple rows.
So like this:
ID C1 C2 C3 C4 C5
1 string1 3 string2 string3 34
1 string4 -1 string5 string6 12
1 string7 5 string8 string9 65
2 string10 -3 string11 string12 56
2 string13 6 string14 string15 9
etc.
How to accomplish this? I am on SQL-server 2017, so I can use the string_split function. The problem with this function is that it produces only one output column...
Preferably I would like yo create an UDF that outputs a table. The function would use these input parameters: the string, the separator character, the number of columns. So the function can be used dynamically with a varying number of columns.
ps. the strings can be of variable length of course.

Try it along this:
Hint: There are some "normal" commas in your sample data.
I suspected these as wrong and used semicolons.
If this is wrong, you might use a general REPLACE() to use ";" instead of ",".
Create a declared table to simulate your issue
DECLARE #tbl TABLE(ID INT, cst VARCHAR(1000));
INSERT INTO #tbl(ID,cst)
VALUES(1,'string1;3;string2;string3;34;string4;-1;string5;string6;12;string7;5;string8;string9; 65')
,(2,'string10;-3;string11;string12;56;string13;6;string14;string15;9');
--The query (for almost any version of SQL-Server, find v2017+ as UPDATE below)
WITH cte AS
(
SELECT t.ID
,B.Nr
,A.Casted.value('(/x[sql:column("B.Nr")]/text())[1]','varchar(max)') AS ValueAtPosition
,(B.Nr-1) % 5 AS Position
,(B.Nr-1)/5 AS GroupingKey
FROM #tbl t
CROSS APPLY(SELECT CAST('<x>' + REPLACE(t.cst,';','</x><x>') + '</x>' AS XML)) A(Casted)
CROSS APPLY(SELECT TOP(A.Casted.value('count(x)','int')) ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) FROM master..spt_values) B(Nr)
)
SELECT ID
,GroupingKey
,MAX(CASE WHEN Position=0 THEN ValueAtPosition END) AS C1
,MAX(CASE WHEN Position=1 THEN ValueAtPosition END) AS C2
,MAX(CASE WHEN Position=2 THEN ValueAtPosition END) AS C3
,MAX(CASE WHEN Position=3 THEN ValueAtPosition END) AS C4
,MAX(CASE WHEN Position=4 THEN ValueAtPosition END) AS C5
FROM cte
GROUP BY ID,GroupingKey
ORDER BY ID,GroupingKey;
The idea in short:
we use APPLY to add your string casted to XML to the result set. This will help to split the string ("a;b;c" => <x>a</x><x>b</x><x>c</x>)
We use another APPLY to create a tally on the fly with a computed TOP-clause. It will return as many virtual rows as there are elements in the XML
We use sql:column() to grab each element's value by its position and some simple maths to create a grouping key and a running number from 0 to 4 and so on.
We use GROUP BY together with MAX(CASE...) to place the values in the fitting column (old-fashioned pivot or conditional aggregation).
Hint: If you want this fully generically, with a number of columns not knwon in advance. You cannot use any kind of function or ad-hoc query. You would rather need some kind of dynamic statement creation together with EXEC within a stored procedure.
to be honest: This might be a case of XY-problem. Such approaches are the wrong idea - at least in almost all situations I can think of.
UPDATE for SQL-Server 2017+
You are on v2017, this allows for JSON, which is a bit faster in position safe string splitting. Try this:
SELECT t.ID
,A.*
FROM #tbl t
CROSS APPLY OPENJSON(CONCAT('["',REPLACE(t.cst,';','","'),'"]')) A
The general idea is the same. We transform a string to a JSON-array ("a,b,c" => ["a","b","c"]) and read it with APPLY OPENJSON().
You can perform the same maths at the "key" column and do the rest as above.
Just because it is ready here, this is the full query for v2017+
WITH cte AS
(
SELECT t.ID
,A.[key]+1 AS Nr
,A.[value] AS ValueAtPosition
,A.[key] % 5 AS Position
,A.[key]/5 AS GroupingKey
FROM #tbl t
CROSS APPLY OPENJSON(CONCAT('["',REPLACE(t.cst,';','","'),'"]')) A
)
SELECT ID
,GroupingKey
,MAX(CASE WHEN Position=0 THEN ValueAtPosition END) AS C1
,MAX(CASE WHEN Position=1 THEN ValueAtPosition END) AS C2
,MAX(CASE WHEN Position=2 THEN ValueAtPosition END) AS C3
,MAX(CASE WHEN Position=3 THEN ValueAtPosition END) AS C4
,MAX(CASE WHEN Position=4 THEN ValueAtPosition END) AS C5
FROM cte
GROUP BY ID,GroupingKey
ORDER BY ID,GroupingKey;

The easiest option here honestly might be the following steps:
Write out the current table to a CSV flat file, using semicolon as the separator (which is also the separator for the current cst column
Then load the CSV using SQL Server's bulk loading tool, again with semicolon as the column separator. This will yield a table with 16 columns, ID, and then C1 through and including C15.
Create a new table (ID, C1, C2, C3, C4, C5)
Then populate the above table using:
INSERT INTO newTable (ID, C1, C2, C3, C4, C5)
SELECT ID, C1, C2, C3, C4, C5 FROM loadedTable UNION ALL
SELECT ID, C6, C7, C8, C9, C10 FROM loadedTable UNION ALL
SELECT ID, C11, C12, C13, C14, C15 FROM loadedTable;
While the above suggestion might seem like a lot of work, SQL Server has poor support for regex and complex string splitting operations, especially on earlier versions. Working directly with your current table might be either not possible or more work than the above.

Related

Select top 5 and bottom 5 columns from a list of columns based on their values

I have a requirement where I need to Select top 5 and bottom 5 columns from a list of columns based on their values.
If more than 1 column has same value then select any one from them.
Eg
CREATE TABLE #b(Company VARCHAR(10),A1 INt,A2 INt,A3 INt,A4 INt,B1 INt,G1 INt,G2 INt,G3 INt,HH5 INt,SS6 INt)
INSERT INTo #b
SELECT 'test_A',8,10,6,10,0,6,0,6,13,4 UNION ALL
SELECT 'test_B',17,7,0,1,3,18,0,6,9,5 UNION ALL
SELECT 'test_C',0,0,6,1,2,6,3,4,3,2 UNION ALL
SELECt 'test_D',13,1,4,1,4,1,9,0,0,5
SELECT * FROM #b
Desired Output:
Company
Top5
Bottom5
test_A
HH5,A2,A1,A3,SS6
B1,SS6,A3,A1,A2
test_B
G1,A1,HH5,A2,G3
A3,A4,B1,SS6,G3
I am able to find the top values but not the column names.
Here is I am stuck at, I am able to find the max scores but not sure how to find the column that holds this max value.
SELECT Company,(
SELECT MAX(myval)
FROM (VALUES (A1),(A2),(A3),(A4),(B1),(G1),(G2),(G3),(HH5)) AS temp(myval))
AS MaxOfColumns
FROM #b
As Larnu suggested, the first step would be to UNPIVOT the data into a form like (Company, ColumnName, Value). You can then use the ROW_NUMBER() window function to assign ordinals 1 - 10 to each value for each company based on the sorted value.
Next, you can wrap the above in a Common Table Expression (CTE) to feed a query that, for each Company, uses conditional aggregation with the STRING_AGG() to selectively combine the top 5 and bottom 5 column names to produce the desired result.
Something like:
;WITH Data AS (
SELECT
Company,
ColumnName,
Value,
ROW_NUMBER() OVER(PARTITION BY Company ORDER BY Value DESC, ColumnName) AS Ord
FROM #b
UNPIVOT (
Value FOR ColumnName IN (A1, A2, A3, A4, B1, G1, G2, G3, HH5, SS6)
) U
)
SELECT
D.Company,
STRING_AGG(CASE WHEN D.Ord BETWEEN 1 AND 5 THEN D.ColumnName END, ', ')
WITHIN GROUP (ORDER BY D.ORD) AS Top5,
STRING_AGG(CASE WHEN D.Ord BETWEEN 6 AND 10 THEN D.ColumnName END, ', ')
WITHIN GROUP (ORDER BY D.ORD) AS Bottom5
FROM Data D
GROUP BY D.Company
ORDER BY D.Company
For older SQL Server versions that don't support STRING_AGG(), the FOR XML PATH(''),TYPE construct can be used to concatenate text. The .value('text()[1]', 'varchar(max)') function is then used to safely extract the result from the XML, and finally the STUFF() function is used to strip out the leading separator (comma-space).
;WITH Data AS (
SELECT
Company,
ColumnName,
Value,
ROW_NUMBER() OVER(PARTITION BY Company ORDER BY Value DESC, ColumnName) AS Ord
FROM #b
UNPIVOT (
Value FOR ColumnName IN (A1, A2, A3, A4, B1, G1, G2, G3, HH5, SS6)
) U
)
SELECT B.Company, C.Top5, C.Bottom5
FROM #b B
CROSS APPLY (
SELECT
STUFF((
SELECT ', ' + D.ColumnName
FROM Data D
WHERE D.Company = B.Company
AND D.Ord BETWEEN 1 AND 5
ORDER BY D.ORD
FOR XML PATH(''),TYPE
).value('text()[1]', 'varchar(max)'), 1, 2, '') AS Top5,
STUFF((
SELECT ', ' + D.ColumnName
FROM Data D
WHERE D.Company = B.Company
AND D.Ord BETWEEN 6 AND 10
ORDER BY D.ORD
FOR XML PATH(''),TYPE
).value('text()[1]', 'varchar(max)'), 1, 2, '') AS Bottom5
) C
ORDER BY B.Company
See this db<>fiddle fr a demo.
If you also want lists of the top 5 and bottom 5 values, you can repeat the aggregations above while substituting CONVERT(VARCHAR, D.Value) for D.ColumnName where appropriate.

SSRS Expression Split string in rows and column

I am working with SQL Server 2008 Report service. I have to try to split string values in different columns in same row in expression but I can't get the excepted output. I have provided input and output details. I have to split values by space (" ") and ("-").
Input :
Sample 1:
ASY-LOS,SLD,ME,A1,A5,J4A,J4B,J4O,J4P,J4S,J4T,J7,J10,J2A,J2,S2,S3,S3T,S3S,E2,E2F,E6,T6,8,SB1,E1S,OTH AS2-J4A,J4B,J4O,J4P,J4S,J4T,J7,J1O,J2A,S2,S3,J2,T6,T8,E2,E4,E6,SLD,SB1,OTH
Sample 2:
A1 A2 A3 A5 D2 D3 D6 E2 E4 E5 E6 EOW LH LL LOS OTH P8 PH PL PZ-1,2,T1,T2,T3 R2-C,E,A RH RL S1 S2-D S3
Output should be:
Thank you.
I wrote this before I saw your comment about having to do it in the report. If you can explain why you cannot do this in the dataset query then there may be a way around that.
Anyway, here's one way of doing this using SQL
DECLARE #t table (RowN int identity (1,1), sample varchar(500))
INSERT INTO #t (sample) SELECT 'ASY-LOS,SLD,ME,A1,A5,J4A,J4B,J4O,J4P,J4S,J4T,J7,J10,J2A,J2,S2,S3,S3T,S3S,E2,E2F,E6,T6,8,SB1,E1S,OTH AS2-J4A,J4B,J4O,J4P,J4S,J4T,J7,J1O,J2A,S2,S3,J2,T6,T8,E2,E4,E6,SLD,SB1,OTH'
INSERT INTO #t (sample) SELECT 'A1 A2 A3 A5 D2 D3 D6 E2 E4 E5 E6 EOW LH LL LOS OTH P8 PH PL PZ-1,2,T1,T2,T3 R2-C,E,A RH RL S1 S2-D S3'
drop table if exists #s1
SELECT RowN, sample, SampleIdx = idx, SampleValue = [Value]
into #s1
from #t t
CROSS APPLY
spring..fn_Split(sample, ' ') as x
drop table if exists #s2
SELECT
s1.*
, s2idx = Idx
, s2Value = [Value]
into #s2
FROM #s1 s1
CROSS APPLY spring..fn_Split(SampleValue, '-')
SELECT SampleKey = [1],
Output = [2] FROM #s2
PIVOT (
MAX(s2Value)
FOR s2Idx IN ([1],[2])
) p
This produced the following results
If you do not have a split function, here is the script to create the one I use
CREATE FUNCTION [dbo].[fn_Split]
/* Define I/O parameters WARNING! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE! */
(#pString VARCHAR(8000)
,#pDelimiter CHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
/*"Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000: enough to cover VARCHAR(8000)*/
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)--10E+1 or 10 rows
,E2(N) AS (SELECT 1 FROM E1 a,E1 b)--10E+2 or 100 rows
,E4(N) AS (SELECT 1 FROM E2 a,E2 b)--10E+4 or 10,000 rows max
/* This provides the "base" CTE and limits the number of rows right up front
for both a performance gain and prevention of accidental "overruns" */
,cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(#pString), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
)
/* This returns N+1 (starting position of each "element" just once for each delimiter) */
,cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N + 1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
)
/* Return start and length (for use in SUBSTRING later) */
,cteLen(N1, L1) AS (
SELECT s.N1
,ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1), 0) - s.N1, 8000)
FROM cteStart s
)
/* Do the actual split.
The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. */
SELECT
idx = ROW_NUMBER() OVER (ORDER BY l.N1)
,value = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l

Lag Function to populate missing rows

I am trying to come up with a SQL to get to this data in a Postgres 9.6 database table.
Table Data
I have tried various variations of windows function but none of these seems to work,
Based on input column C3, I am projecting a fourth column C4 and the output should resemble as below.
Final Desired output
How can I accomplish this using SQL? The table can have up to 100 Million records.
I was able to get desired output using the following SQL
with t2 as (select c1, c3 from Test_table where c3 is not null)
update Test_table t1
set c3 = t2.c3
from t2
where t1.c1 <= t2.c1
and t1.c3 is null;
select
c1
,c2
,C3
,dense_rank() over(order by c3) cr
from Test_table
order by c1;
Use a window function to select the smallest c3 in a window ordered by c1 descending, but sort the whole output by c1 ascending:
select c1, c2, c3, min(c3) over (order by c1 desc) as c4 from t order by c1;
I ran the SQL provided by you and I get this output. I thought the picture described what I really wanted. Hopefully, showing your output from running your SQL and desired output may help.
SQL output from your query and desired output

how to "join" two tables in postgres

I have two tables contains almost the same things. But they got datas from different sources, and in perfect world they are identical. Practicaly - they differs. The goal is to find matching records and connect each other, and then the umatched records are the result.
first_table:
id1, date1, value1
second_table:
id2, date2, value2
I create third table "joiner":
id1,id2
And now use this spell:
INSERT INTO joiner (SELECT id1,id2 FROM first_table,second_table WHERE value1=value2 and date1=date2 ORDER BY date1,date2,id1,id2);
(sorting is important, because sometimes some packages are missed, so i have to add it later)
And everything would be great, but... sometimes there are more than one record with the same value and date, and there's no way to identify it. The accepted solution is to join first from first_table with first from second_table, and second from first_table with second from second_table ,etc.
And here comes the problem.
Because the joiner has the unique keys on each column - the insert raises unique_violation error, because the example result is:
id1|id2
-------
a1| b1
a1| b2
a2| b1
a2| b2
If I use SELECT distinct id1,id2 of course nothing changes (a1,b1)!=(a1,b2)
If I use SELECT distinct on (id1) id1,id2 - the result sometimes is:
id1|id2
-------
a1| b1
a2| b1
I tried to use WHERE NOT EXISTS (SELECT 1 FROM first_table f WHERE f.id1<>first_table.id1) AND NOT EXISTS (SELECT 1 FROM second_table s WHERE s.id2<>second_table.id2) - still nothing
I tried to add function with EXCEPTION, but this is also wrong - because it raises exception but joiner is still empty...
Any ideas?
update
I don't know why some people vote down for my question without any comment. Maybe because it is not enough clear - so especially for those example:
first_table:
id1, value1, date1
1,10, 2015-03-01
2,11, 2015-03-01
3,10, 2015-03-01
4,14, 2015-03-02
second_table:
id2, value2, date2
1,10, 2015-03-01
2,11, 2015-03-01
3,10, 2015-03-01
4,15, 2015-03-02
expected joiner
id1, id2
1,1
2,2
3,3
As you can see id1=4 and id2=4 doesn't have joiner - because value differs (auditor needs to manualy check and fix).
And there is a problem with id1=1 and id1=3 - are identical, so joiner without uniqness would looks like:
id1, id2
1,1
1,3
2,2
3,1
3,3
Which is wrong.
The solution to your problem is to use row_number() to enumerate the values for common date/value pairs in each table.
You query can be improved in other ways as well:
When using insert, always list the columns.
Learn to use proper explicit join syntax. Simple rule: never use comma in the from clause.
Use table aliases to specify where columns are coming from.
The query is:
INSERT INTO joiner(id1, id2)
SELECT id1, id2
FROM (select ft.*, row_number() over (partition by value1, date1 order by value1) as seqnum
from first_table ft
) ft JOIN
(select st.*, row_number() over (partition by value2, date2 order by value2) as seqnum
from second_table st
) st
ON ft.value1 = st.value2 and ft.date1 = st.date2 and ft.seqnum = st.seqnum
ORDER BY ft.date1, st.date2, ft.id1, st.id2;
I don't think the order by is important, but I'm leaving it in because you think it is relevant.

Find equal twin record postgresql

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I have a record with id 22 and I know it has a twin because I run this (simplified code):
SELECT min(co_id),co_name,count(*) FROM co
GROUP BY co_name
HAVING count(*) > 1
The result shows there are one twin (count 2) and I get the oldest id by min(co_id)
My question is how I search for the twin co_id? Just passing the oldest id?
Something like:
SELECT co_id FROM co
WHERE co_name EQUAL TO co_id='22'
LIMIT 2
Sample data:
id co_name
22 Volvo
23 Volvo
24 Ford
25 Ford
I know id 22 and I want to search for the twin 23 based on the content of 22.
The closest I found is this. Which is far from generic. And a nightmare for comparing 60 field:
SELECT id,
(SELECT max(b.id) from co b
WHERE a.co_name = b.co_name
LIMIT 1) as twin
FROM co a
WHERE id='22'
How do I do this in a more simple and generic way? I just want the twin record co_id.
Thank you in advance!
select max_co,co_name from (
select max(co_id) max_co,min(co_id) min_co,co_name from co
group by co_name having count(*)>1) where min_co=(your old co id as input);
You can join your table with itself:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
this will return all duplicated records (but not the original record with the lowest id). Or since you're using Postgresql you can use a window function:
SELECT *
FROM (
SELECT
id,
co_name,
row_number() OVER (PARTITION by co_name ORDER BY id) as row
FROM
co_name
) s
WHERE
row>1;
Please see an example here.
If you want to compare multiple columns, the JOIN solution would be more flexible. I don't know exactly how you want to compare your columns and how you exactly define "twin" rows, but you a query like this should help:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
if you just want duplicated records of id=22 then you can try with this:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
WHERE
c2.id=22
or if you just want a single twin, comparing 60 columns, you can try with this query:
SELECT MIN(ID) as Twin /* or MAX(ID), depending what you're after */
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
WHERE
c2.id=22
I found one solution that is working on 60 columns if I use variables in stead of hardcode in the query. Thanks everybody for all input. Some of them were about the same track.
SELECT id,
(SELECT max(b.id) from co b
WHERE concat(a.co_name,etc) = concat(b.co_name,etc)
LIMIT 1) as twin
FROM co a
WHERE id='22'
Not the best one, but fetch one twin at a time. And it is far from generic. Thanks for pointing me in the right direction. A generic solution would be nicer.