I want to filter a column of rowtype and aggregate rowtypes when they have complement information.
So my data looks like that :
|col1|rowcol |
|----|--------------------------------|
|1 |{col1=2, col2=null, col3=4} |
|1 |{col1=null, col2=3, col3=null} |
|2 |{col1=7, col2=8, col3=null} |
|2 |{col1=null, col2=null, col3=56} |
|3 |{col1=1, col2=3, col3=7} |
Here is some code you can use to have an working example:
select col1, cast(rowcol as row(col1 integer, col2 integer, col3 integer))
from (
values
(1, row(2,null,4)),
(1, row(null,3,null)),
(2, row(7,8,null)),
(2, row(null,null,56)),
(3, row(1,3,7))
)
AS x (col1, rowcol)
I am expecting the result as following:
|col1|rowcol |
|----|-------------------------------|
|1 |{col1=2, col2=3, col3=4} |
|2 |{col1=7, col2=8, col3=56} |
|3 |{col1=1, col2=3, col3=7} |
Maybe someone can help me...
Thanks in advance
You need to group them by col1 and process to merge not nulls, for example using max:
-- sample data
WITH dataset (col1, rowcol) AS (
VALUES
(1, row(2,null,4)),
(1, row(null,3,null)),
(2, row(7,8,null)),
(2, row(null,null,56)),
(3, row(1,3,7))
)
--query
select col1,
cast(row(max(r.col1), max(r.col2), max(r.col3)) as row(col1 integer, col2 integer, col3 integer)) rowcol
from (
select col1,
cast(rowcol as row(col1 integer, col2 integer, col3 integer)) r
from dataset
)
group by col1
order by col1 -- for ordered output
Output:
col1
rowcol
1
{col1=2, col2=3, col3=4}
2
{col1=7, col2=8, col3=56}
3
{col1=1, col2=3, col3=7}
Related
I have a data frame that looks something along the lines of:
+-----+-----+------+-----+
|col1 |col2 |col3 |col4 |
+-----+-----+------+-----+
|1.1 |2.3 |10.0 |1 |
|2.2 |1.5 |5.0 |1 |
|3.3 |1.3 |1.5 |1 |
|4.4 |0.5 |7.0 |1 |
|5.5 |1.2 |8.1 |2 |
|6.6 |2.3 |8.2 |2 |
|7.7 |4.5 |10.3 |2 |
+-----+-----+------+-----+
I would like to subtract each row from the row above but only if they have the same entry in col4, so 2-1, 3-2 but not 5-4. Also col4 should not be changed, so the result would be
+-----+-----+------+------+
|col1 |col2 |col3 |col4 |
+-----+-----+------+------+
|1.1 |-0.8 |-5.0 |1 |
|1.1 |-0.2 |-3.5 |1 |
|1.1 |-0.8 |5.5 |1 |
|1.1 |1.1 |0.1 |2 |
|1.1 |2.2 |2.1 |2 |
+-----+-----+------+------+
This sounds like it'd be simple, but I can't seem to figure it out
You could accomplish this using spark-sql i.e. creating a temporary view with your dataframe and applying the following sql. It uses window functions LAG to subtract the previous row value ordered by col1 and partitioned by col4. The first row value in each group partitioned by col4 is identified using row_number and filtered.
df.createOrReplaceTempView('my_temp_view')
results = sparkSession.sql('<insert sql below here>')
SELECT
col1,
col2,
col3,
col4
FROM (
SELECT
(col1 - (LAG(col1,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col1,
(col2 - (LAG(col2,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col2,
(col3 - (LAG(col3,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col3,
col4,
ROW_NUMBER() OVER (PARTITION BY col4 ORDER BY col1) rn
FROM
my_temp_view
) t
WHERE rn <> 1
db-fiddle
Here just the idea with a self-JOIN based on RDD with zipWithIndex and back to DF - some overhead, that you can tailor, z being your col4.
At scale I am not sure about the performance that Catalyst Optimizer will apply, I looked at .explain(true); not convinced entirely, but I find it hard to interpret the output sometimes. Ordering of data is guaranteed.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, ArrayType, LongType}
val df = sc.parallelize(Seq( (1.0, 2.0, 1), (0.0, -1.0, 1), (3.0, 4.0, 1), (6.0, -2.3, 4))).toDF("x", "y", "z")
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
val rddWithId = df.rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show(false)
dfZippedWithId.printSchema()
val res = dfZippedWithId.as("dfZ1").join(dfZippedWithId.as("dfZ2"), $"dfZ1.z" === $"dfZ2.z" &&
$"dfZ1.rowid" === $"dfZ2.rowid" -1
,"inner")
.withColumn("newx", $"dfZ2.x" - $"dfZ1.x")//.explain(true)
res.show(false)
returns the input:
+---+----+---+-----+
|x |y |z |rowid|
+---+----+---+-----+
|1.0|2.0 |1 |0 |
|0.0|-1.0|1 |1 |
|3.0|4.0 |1 |2 |
|6.0|-2.3|4 |3 |
+---+----+---+-----+
and the result which you can tailor by selecting and adding extra calculations:
+---+----+---+-----+---+----+---+-----+----+
|x |y |z |rowid|x |y |z |rowid|newx|
+---+----+---+-----+---+----+---+-----+----+
|1.0|2.0 |1 |0 |0.0|-1.0|1 |1 |-1.0|
|0.0|-1.0|1 |1 |3.0|4.0 |1 |2 |3.0 |
+---+----+---+-----+---+----+---+-----+----+
I would like to do a "filldown" type operation on a dataframe in order to remove nulls and make sure the last row is a kind of summary row, containing the last known values for each column based on the timestamp, grouped by the itemId. As I'm using Azure Synapse Notebooks the language can be Scala, Pyspark, SparkSQL or even c#. However the problem here is that the real solution has up to millions of rows and hundreds of columns, so I need a dynamic solution that can take advantage of Spark. We can provision a big cluster to how to make sure we take good advantage of it?
Sample data:
// Assign sample data to dataframe
val df = Seq(
( 1, "10/01/2021", 1, "abc", null ),
( 2, "11/01/2021", 1, null, "bbb" ),
( 3, "12/01/2021", 1, "ccc", null ),
( 4, "13/01/2021", 1, null, "ddd" ),
( 5, "10/01/2021", 2, "eee", "fff" ),
( 6, "11/01/2021", 2, null, null ),
( 7, "12/01/2021", 2, null, null )
).
toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")
df.show
Expected results with rows 4 and 7 as summary rows:
+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
| 1|10/01/2021| 1| abc| null|
| 2|11/01/2021| 1| abc| bbb|
| 3|12/01/2021| 1| ccc| bbb|
| 4|13/01/2021| 1| ccc| ddd|
| 5|10/01/2021| 2| eee| fff|
| 6|11/01/2021| 2| eee| fff|
| 7|12/01/2021| 2| eee| fff|
+-------+----------+------+-------+-------+
I have reviewed this option but had trouble adapting it for my use case.
Spark / Scala: forward fill with last observation
I have a kind of working SparkSQL solution but it will be very verbose for the high volume of columns, hoping for something easier to maintain:
%%sql
WITH cte (
SELECT
eventId,
itemId,
ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
attrib1,
attrib2
FROM df
)
SELECT
eventId,
itemId,
CASE rn WHEN 1 THEN attrib1
ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) )
END AS attrib1_xlast,
CASE rn WHEN 1 THEN attrib2
ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) )
END AS attrib2_xlast
FROM cte
ORDER BY eventId
For many columns you could create an expression as below
val window = Window.partitionBy($"itemId").orderBy($"timestamp")
// Instead of selecting columns you could create a list of columns
val expr = df.columns
.map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(expr: _*).show(false)
Update:
val mainColumns = df.columns.filterNot(_.startsWith("attrib"))
val aggColumns = df.columns.diff(mainColumns).map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(( mainColumns.map(col) ++ aggColumns): _*).show(false)
Result:
+-------+----------+------+-------+-------+
|eventId|timestamp |itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|1 |10/01/2021|1 |abc |null |
|2 |11/01/2021|1 |abc |bbb |
|3 |12/01/2021|1 |ccc |bbb |
|4 |13/01/2021|1 |ccc |ddd |
|5 |10/01/2021|2 |eee |fff |
|6 |11/01/2021|2 |eee |fff |
|7 |12/01/2021|2 |eee |fff |
+-------+----------+------+-------+-------+
I have a table with a table_id row and 2 other rows. I want type of numbering with row_number function and I want result to seem like this:
id |col1 |col2 |what I want
------------------------------
1 |x |a |1
2 |x |b |2
3 |x |a |3
4 |x |a |3
5 |x |c |4
6 |x |c |4
7 |x |c |4
please consider that;
there's only one x, so "partition by col1" is OK. other than that;
there are two sequences of a's, and they'll be counted seperately
(not 1,2,1,1,3,3,3). and sorting must be by id, not by col2 (so
order by col2 is NOT OK).
I want that number to increase by one anytime col2 changes compared to previous line.
row_number () over (partition by col1 order by col2) DOESN'T WORK. because I want it ordered by id.
Using LAG and a windowed COUNT appears to get you what you are after:
WITH Previous AS(
SELECT V.id,
V.col1,
V.col2,
V.[What I want],
LAG(V.Col2,1,V.Col2) OVER (ORDER BY ID ASC) AS PrevCol2
FROM (VALUES(1,'x','a',1),
(2,'x','b',2),
(3,'x','a',3),
(4,'x','a',3),
(5,'x','c',4),
(6,'x','c',4),
(7,'x','c',4))V(id, col1, col2, [What I want]))
SELECT P.id,
P.col1,
P.col2,
P.[What I want],
COUNT(CASE P.Col2 WHEN P.PrevCol2 THEN NULL ELSE 1 END) OVER (ORDER BY P.ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) +1 AS [What you get]
FROM Previous P;
DB<>Fiddle
I have a SQL table with transactions some are in a parent-child relationship. But the relationship is only determined by a line and a type column. There is no other reference. I have to build this reference.
DECLARE #tmp TABLE (line INT, type CHAR(1), product VARCHAR(30))
INSERT #tmp VALUES
( 1,' ','22411')
,( 2,' ','22413')
,( 3,'P','27050')
,( 4,'C','22492')
,( 5,'C','22493')
,( 6,'C','22490')
,( 7,' ','22410')
,( 8,' ','22511')
,( 9,'P','27051')
,(10,'C','22470')
,(11,'C','22471')
,(12,'C','22473')
,(13,'C','22474')
,(14,' ','22015')
,(15,' ','22167')
,(16,' ','12411')
,(17,' ','22500')
Line 3 is a parent product. Lines 4 to 6 are the child rows.
Line 9 is another parent product. Lines 10 to 13 are the child rows.
Desired output is something like this, where the parent-child lines are grouped:
line|type|product|group
1 | |22411 |1
2 | |22413 |2
3 |P |27050 |3
4 |C |22492 |3
5 |C |22493 |3
6 |C |22490 |3
7 | |22410 |7
8 | |22511 |8
9 |P |27051 |9
10 |C |22470 |9
11 |C |22471 |9
12 |C |22473 |9
13 |C |22474 |9
14 | |22015 |14
15 | |22167 |15
16 | |12411 |16
17 | |22500 |17
How to achieve this without a cursor?
If input is always valid, you can use below query.
SELECT *, RANK() OVER(ORDER BY gid) AS [group]
FROM
(
SELECT *, SUM(CASE WHEN type = 'C' AND (prev = 'P' OR prev = 'C') THEN 0 ELSE 1 END) OVER(ORDER BY line) AS gid
FROM
(
SELECT *, LAG(type) OVER(ORDER BY line) AS prev
FROM #tmp
) AS withPreviouLine
) AS grouped
It can't handle continuous 'C' without 'P'. LAG is added starting with SQL SERVER 2012.
This will work for almost all version of SQL Server, since you didn't specify what version you're using.
But if you're using at least SQL Server 2012, then you can use #qxg's solution, since it's simpler.
Here's the code that you need, albeit it is not a single query, but it gives you the result you want:
CREATE TABLE #tmp (line INT, type CHAR(1), product VARCHAR(30))
INSERT #tmp VALUES ( 1,' ','22411')
,( 2,' ','22413')
,( 3,'P','27050')
,( 4,'C','22492')
,( 5,'C','22493')
,( 6,'C','22490')
,( 7,' ','22410')
,( 8,' ','22511')
,( 9,'P','27051')
,(10,'C','22470')
,(11,'C','22471')
,(12,'C','22473')
,(13,'C','22474')
,(14,' ','22015')
,(15,' ','22167')
,(16,' ','12411')
,(17,' ','22500')
select t1.*
, case
when t1.type = 'C'
and t3.type = ''
then 'G'
when t1.type = 'P' or t1.type = 'C'
then 'G'
else 'N'
end [same_group]
into #tmp2
from #tmp t1
left join #tmp t2 on t1.line = t2.line + 1
left join #tmp t3 on t1.line = t3.line - 1
order by t1.line
select *
, case
when t.type <> ''
then (select max(line)
from #tmp2
where same_group = 'G'
and type = 'P'
and line <= t.line)
else t.line
end [group_id]
from #tmp2 t
order by line
You probably could refactor it to be a single query, but I don't have the time to do so at the moment.
I need to generate a sequence starting from a CSV string and a maximum count.
When the sequence exceed, I need to start the sequence again and continue until I saturate the COUNT variable
I have the following CSV:
A,B,C,D
In order to get 4 rows out of this CSV I am using XML and the following statement:
SET #xml_csv = N'<root><r>' + replace('A, B, C, D',',','</r><r>') + '</r></root>'
SELECT
REPLACE(t.value('.','varchar(max)'), ' ', '') AS [delimited items]
FROM
#xml_csv.nodes('//root/r') AS a(t)
Now my SELECT returns the following output:
|-------------|
| A |
| B |
| C |
| D |
Assuming I have a #count variable set to 9, I need to output the following:
|--|-----------|
|1 |A |
|2 |B |
|3 |C |
|4 |D |
|5 |A |
|6 |B |
|7 |C |
|8 |D |
|9 |A |
I tried to join a table called master..[spt_values] but I get for a COUNT = 10 10 rows for A, 10 for B and so on, while I need the sequence ordered and repeated until it saturate
Basically you are on the correct path. Joining the split result with a numbers table will get you the correct output.
I've chosen to use a different function for splitting the csv data since it's using a numbers table for the split as well. (taken from this great article)
First, if you don't already have a numbers table, create one. here is the script used in the article I've linked to:
SET NOCOUNT ON;
DECLARE #UpperLimit INT = 1000;
WITH n AS
(
SELECT
x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
CROSS JOIN sys.all_objects AS s3
)
SELECT Number = x
INTO dbo.Numbers
FROM n
WHERE x BETWEEN 1 AND #UpperLimit;
GO
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers(Number)
WITH (DATA_COMPRESSION = PAGE);
GO
Then, create the split function:
CREATE FUNCTION dbo.SplitStrings_Numbers
(
#List NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = SUBSTRING(#List, Number,
CHARINDEX(#Delimiter, #List + #Delimiter, Number) - Number)
FROM dbo.Numbers
WHERE Number <= CONVERT(INT, LEN(#List))
AND SUBSTRING(#Delimiter + #List, Number, LEN(#Delimiter)) = #Delimiter
);
GO
Next step: Join the split results with the numbers table:
DECLARE #Csv varchar(20) = 'A,B,C,D'
SELECT TOP 10 Item
FROM dbo.SplitStrings_Numbers(#Csv, ',')
CROSS JOIN Numbers
ORDER BY Number
Output:
Item
----
A
B
C
D
A
B
C
D
A
B
Great thanks to Aaron Bertrand for sharing his knowledge.