I have a dataframe
Hi,I have a dataframe as below
+-------+--------+
|id |level |
+-------+--------+
| 0 | 0 |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 0 |
| 6 | 1 |
| 7 | 1 |
| 8 | 0 |
| 9 | 1 |
| 10 | 0 |
+-------+--------+
and I need the sum of consecutive 1's .SO the output should be 3,2,1.However the constraint in this scenario is that i do not need to use UDF Is there any in-built scala/spark function that can do this trick.I am not able to USE UDF
You could use row_number and count (SQL/Dataframe API), to count the number of consecutive values (repeat) in a column.
The trick is to count the offset between the current row and the index of the occurrence of the consecutive targeted values.
Scala
var df = spark.createDataFrame(Seq((0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0))).toDF("id","level")
df.createOrReplaceTempView("DT")
var df_cnt = spark.sql("select level, count(*) from (select *, (row_number() over (order by id) - row_number() over (partition by level order by id) ) as grp from DT order by id) as t where level !=0 group by grp, level ")
df_cnt.show()
The sequence of id must be maintained otherwise it will produce the wrong result.
Pyspark
df = spark.createDataFrame([(0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0)]).toDF("id","level")
df.createOrReplaceTempView('DF')
//same as before with spark.sql(...)
SQL
select level, count(*) from
(select *,
(row_number() over (order by id) -
row_number() over (partition by level order by id)
) as grp
from SDF order by id) as t
where level !=0
group by grp, level
Intermediate sql computation detail (row offset, and grouping) :
You could do something like this:
val seq = Seq(0,0,1,1,1,0,1,1,0,1,0)
val seq1s = seq.foldLeft("")(_ + _).split("0")
seq1s.map(_.sliding(1).count(_ == "1"))
res: Array[Int] = Array(0, 0, 3, 2, 1)
If you don´t want the 0s there you could just filter them out using this instead:
seq1s.map(_.sliding(1).count(_ == "1")).filterNot(_ == 0)
res: Array[Int] = Array(3, 2, 1)
Related
What I have
id | value
1 | foo
2 | foo
3 | bah
4 | bah
5 | bah
6 | jezz
7 | jezz
8 | jezz
9 | pas
10 | log
What I need:
Enumerate rows as in the following example
id | value | enumeration
1 | foo | 1
2 | foo | 1
3 | bah | 2
4 | bah | 2
5 | bah | 2
6 | jezz | 3
7 | jezz | 3
8 | jezz | 3
9 | pas | 4
10 | log | 5
I've tried row_number with over partition. But this leads to another kind of enumeration.
Thanks for any help
You can use rank() or dense_rank() for that case:
Click: demo:db<>fiddle
SELECT
*,
dense_rank() OVER (ORDER BY value)
FROM
mytable
rank() generates an ordered number to every element of a group, but it creates gaps (if there were 3 elements in the first group, the second group starting at row 4 would get the number 4). dense_rank() avoids these gaps.
Note, this orders the table by the value column alphabetically. So, the result will be: blah == 1, foo == 2, jezz == 3, log == 4, pas == 5.
If you want to keep your order, you need an additional order criterion. In your case you could use the id column to create such a column, if no other is available:
Click: demo:db<>fiddle
First, use first_value() to find the lowest id per value group:
SELECT
*,
first_value(id) OVER (PARTITION BY value ORDER BY id)
FROM
mytable
This first value (foo == 1, blah == 3, ...) can be used to keep the original order when calculating the dense_rank():
SELECT
id,
value,
dense_rank() OVER (ORDER BY first_value)
FROM (
SELECT
*,
first_value(id) OVER (PARTITION BY value ORDER BY id)
FROM
mytable
) s
Say I have a table like the following table that represents a path from 1 -> 2 -> 3 -> 4 -> 5:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | b | 1 |
| b | c | 2 |
| c | d | 1 |
| d | e | 1 |
| e | f | 3 |
+------+----+--------+
Each row knows where it came from and where it is going
I would like to union a total row that takes the starting name, ending name, and a total weight like so:
+------+----+--------+
| from | to | weight |
+------+----+--------+
| a | f | 8 |
+------+----+--------+
The first table is a result of a CTE expression, and I can easily get the total of the previous query with SUM, but I'm unable to get the LAST_VALUE to work in a similar way to:
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL
SELECT 'total', FIRST_VALUE(from), LAST_VALUE(to), SUM(weight)
FROM cte
The FIRST_VALUE and LAST_VALUE functions require OVER clauses which seem to add unnecessary complications to what I would expect, so I think I am going the wrong direction with that. Any ideas on how to achieve this?
So I made a strange solution that:
Selects the first from value (partitioned by TRUE)
Selects the last to value (partitioned by TRUE again)
Cross joins the sum of all weights, limited to 1
WITH RECURSIVE cte AS (
...
)
SELECT *
FROM cte
UNION ALL (
SELECT FIRST_VALUE(from) OVER (PARTITION BY TRUE), LAST_VALUE(to) OVER (PARTITION BY TRUE), total
FROM cte
CROSS JOIN (
SELECT SUM(weight) as total
FROM cte
) tmp
LIMIT 1
);
Is it hacky? Yes. Does it work? Also yes. I'm sure there are better solutions, and I would love to hear them.
I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.
I'm using Amazon RDS (Aurora) so don't have access to the crosstab() function.
My dataset is a count of particular actions per user and looks like:
| uid | action1 | action2 |
| alice | 2 | 2 |
| bob | 1 | 2 |
| charlie | 5 | 0 |
How can I pivot this dataset to make a histogram of action counts? So it would look like:
# | Action1 | Action2
---------------------
0 | | 1
1 | 1 |
2 | 1 | 2
3 | |
4 | |
5 | 1 |
6 | |
Here's a SQL fiddle I've been using with the values already entered: http://sqlfiddle.com/#!17/2b966/1
I have a solution but it is very verbose:
WITH nums AS (
SELECT n
FROM (VALUES (0), (1), (2), (3), (4), (5)) nums(n)
),
action1_counts as (
select
action1,
count(*) as total
from test
group by 1
),
action2_counts as (
select
action2,
count(*) as total
from test
group by 1
)
select
nums.n,
coalesce(a1.total, 0) as Action1,
coalesce(a2.total, 0) as Action2
from nums
LEFT join action1_counts a1 on a1.action1 = nums.n
LEFT join action2_counts a2 on a2.action2 = nums.n
order by 1
Assume action is between 0 and 6.
select a1.action, a1.action1, nullif(count(t2.action2),0) as action2
from
( select t.action, nullif(count(t1.action1),0) as action1
from
(select action from generate_series(0,6) g(action)) t
left join
test t1
on t1.action1 = t.action
group by t.action
) a1
left join
test t2
on t2.action2 = a1.action
group by a1.action, a1.action1
order by a1.action
I have a table in SQL Server 2008 R2 which contains product orders. For the most part, it is one entry per product
ID | Prod | Qty
------------
1 | A | 1
4 | B | 1
7 | A | 1
8 | A | 1
9 | A | 1
12 | C | 1
15 | A | 1
16 | A | 1
21 | B | 1
I want to create a view based on the table which looks like this
ID | Prod | Qty
------------------
1 | A | 1
4 | B | 1
9 | A | 3
12 | C | 1
16 | A | 2
21 | B | 1
I've written a query using a table expression, but I am stumped on how to make it work. The sql below does not actually work, but is a sample of what I am trying to do. I've written this query multiple different ways, but cannot figure out how to get the right results. I am using row_number to generate a sequential id. From that, I can order and compare consecutive rows to see if the next row has the same product as the previous row since ReleaseId is sequential, but not necessarily contiguous.
;with myData AS
(
SELECT
row_number() over (order by a.ReleaseId) as 'Item',
a.ReleaseId,
a.ProductId,
a.Qty
FROM OrdersReleased a
UNION ALL
SELECT
row_number() over (order by b.ReleaseId) as 'Item',
b.ReleaseId,
b.ProductId,
b.Qty
FROM OrdersReleased b
INNER JOIN myData c ON b.Item = c.Item + 1 and b.ProductId = c.ProductId
)
SELECT * from myData
Usually you drop the ID out of something like this, since it is a summary.
SELECT a.ProductId,
SUM(a.Qty) AS Qty
FROM OrdersReleased a
GROUP BY a.ProductId
ORDER BY a.ProductId
-- if you want to do sub query you can do it as a column (if you don't have a very large dataset).
SELECT a.ProductId,
SUM(a.Qty) AS Qty,
(SELECT COUNT(1)
FROM OrdersReleased b
WHERE b.ReleasedID - 1 = a.ReleasedID
AND b.ProductId = b.ProductId) as NumberBackToBack
FROM OrdersReleased a
GROUP BY a.ProductId
ORDER BY a.ProductId