Spark - Iterating through all rows in dataframe comparing multiple columns for each row against another - scala

| match_id | player_id | team | win |
| 0 | 1 | A | A |
| 0 | 2 | A | A |
| 0 | 3 | B | A |
| 0 | 4 | B | A |
| 1 | 1 | A | B |
| 1 | 4 | A | B |
| 1 | 8 | B | B |
| 1 | 9 | B | B |
| 2 | 8 | A | A |
| 2 | 4 | A | A |
| 2 | 3 | B | A |
| 2 | 2 | B | A |
I have a dataframe that looks like above.
I need to to create a map (key,value) pair such that for every
(k=>(player_id_1, player_id_2), v=> 1 ), if player_id_1 wins against player_id_2 in a match
and
(k=>(player_id_1, player_id_2), v=> 0 ), if player_id_1 loses against player_id_2 in a match
I will have to thus iterate through the entire data frame comparing each player id to another based upon the other 3 columns.
I am planning to achieve this as follows.
Group by match_id
In each group for a player_id check against other player_id's the following
a. If match_id is same and team is different
Then
if team = win
(k=>(player_id_1, player_id_2), v=> 0 )
else team != win
(k=>(player_id_1, player_id_2), v=> 1 )
For example, after partitioning by matches consider match 1.
player_id 1 needs to be compared to player_id 2,3 and 4.
While iterating, record for player_id 2 will be skipped as the team is same
for player_id 3 as team is different the team & win will be compared.
As player_id 1 was in team A and player_id 3 was in team B and team A won the key-value formed would be
((1,3),1)
I have a fair idea of how to achieve this in imperative programming but I am really new to scala and functional programming and can't get a clue as to how while iterating through every row for a field create a (key,value) pair by having checks on other fields.
I tried my best to explain the problem. Please do let me know if any part of my question is unclear. I would be happy to explain the same. Thank you.
P.S: I am using Spark 1.6

This can be achieved using the DataFrame API as shown below..
Dataframe API version:
val df = Seq((0,1,"A","A"),(0,2,"A","A"),(0,3,"B","A"),(0,4,"B","A"),(1,1,"A","B"),(1,4,"A","B"),(1,8,"B","B"),(1,9,"B","B"),(2,8,"A","A"),(2,4,"A","A"),(2,3,"B","A"),(2,2,"B","A")
).toDF("match_id", "player_id", "team", "win")
val result = df.alias("left")
.join(df.alias("right"), $"left.match_id" === $"right.match_id" && not($"right.team" === $"left.team"))
.select($"left.player_id", $"right.player_id", when($"left.team" === $"left.win", 1).otherwise(0).alias("flag"))
scala> result.collect().map(x => (x.getInt(0),x.getInt(1)) -> x.getInt(2)).toMap
res4: scala.collection.immutable.Map[(Int, Int),Int] = Map((1,8) -> 0, (3,4) -> 0, (3,1) -> 0, (9,1) -> 1, (4,1) -> 0, (8,1) -> 1, (2,8) -> 0, (8,3) -> 1, (1,9) -> 0, (1,4) -> 1, (8,2) -> 1, (4,9) -> 0, (3,2) -> 0, (1,3) -> 1, (4,8) -> 0, (4,2) -> 1, (2,4) -> 1, (8,4) -> 1, (2,3) -> 1, (4,3) -> 1, (9,4) -> 1, (3,8) -> 0)
SPARK SQL version:
df.registerTempTable("data_table")
val result = sqlContext.sql("""
SELECT DISTINCT t0.player_id, t1.player_id, CASE WHEN t0.team == t0.win THEN 1 ELSE 0 END AS flag FROM data_table t0
INNER JOIN data_table t1
ON t0.match_id = t1.match_id
AND t0.team != t1.team
""")

Related

Sum of consecutive values in column of a Spark dataframe

I have a dataframe
Hi,I have a dataframe as below
+-------+--------+
|id |level |
+-------+--------+
| 0 | 0 |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 0 |
| 6 | 1 |
| 7 | 1 |
| 8 | 0 |
| 9 | 1 |
| 10 | 0 |
+-------+--------+
and I need the sum of consecutive 1's .SO the output should be 3,2,1.However the constraint in this scenario is that i do not need to use UDF Is there any in-built scala/spark function that can do this trick.I am not able to USE UDF
You could use row_number and count (SQL/Dataframe API), to count the number of consecutive values (repeat) in a column.
The trick is to count the offset between the current row and the index of the occurrence of the consecutive targeted values.
Scala
var df = spark.createDataFrame(Seq((0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0))).toDF("id","level")
df.createOrReplaceTempView("DT")
var df_cnt = spark.sql("select level, count(*) from (select *, (row_number() over (order by id) - row_number() over (partition by level order by id) ) as grp from DT order by id) as t where level !=0 group by grp, level ")
df_cnt.show()
The sequence of id must be maintained otherwise it will produce the wrong result.
Pyspark
df = spark.createDataFrame([(0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0)]).toDF("id","level")
df.createOrReplaceTempView('DF')
//same as before with spark.sql(...)
SQL
select level, count(*) from
(select *,
(row_number() over (order by id) -
row_number() over (partition by level order by id)
) as grp
from SDF order by id) as t
where level !=0
group by grp, level
Intermediate sql computation detail (row offset, and grouping) :
You could do something like this:
val seq = Seq(0,0,1,1,1,0,1,1,0,1,0)
val seq1s = seq.foldLeft("")(_ + _).split("0")
seq1s.map(_.sliding(1).count(_ == "1"))
res: Array[Int] = Array(0, 0, 3, 2, 1)
If you don´t want the 0s there you could just filter them out using this instead:
seq1s.map(_.sliding(1).count(_ == "1")).filterNot(_ == 0)
res: Array[Int] = Array(3, 2, 1)

Aggregate all combinations of rows taken k at a time

I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.

Postgresql summing duplicate elements

In the table can exsist 2 lines that give the same information only a single column value is different. Basically the data is duplicated because of this 1 column. Can I somehow sum otherelement in such a manner that it takes this duplication into account ?
To illustrate the idea of the problem
Example:
|id|type|val1|val2|
|1 | 2 | 1 | 1 |
|1 | 3 | 1 | 1 |
|1 | 2 | 2 | 2 |
|1 | 3 | 2 | 2 |
Expected result
|id|type|val1|val2|count|
|1 |2,3 | 3 | 3 | 2 |
Actual result
|id|type|val1|val2|count|
|1 |2,3 | 6 | 6 | 4 |
In the actual data the type and val come from 2 different tables connected by 3rd table, so the query is like this:
SELECT id,
array_to_string(array_agg(DISTINCT x.type ORDER BY x.type), ','::text) AS type,
sum(y.val1) AS val1,
sum(y.val2) AS val2,
count(y.val1) AS count
FROM a
JOIN x ON x.a_id = a.id AND x.active = true
JOIN y ON y.a_id = a.id AND y.active = true
GROUP BY a.id
SOLUTION
SELECT id,
array_to_string(array_agg(DISTINCT x.type ORDER BY x.type), ','::text) AS type,
sum(distinct y.val1) AS val1,
sum(distinct y.val2) AS val2,
count(distinct y.val1) AS count
FROM a
JOIN x ON x.a_id = a.id AND x.active = true
JOIN y ON y.a_id = a.id AND y.active = true
GROUP BY a.id

How to do 2 distinct groupby conditions on the same data frame in Scala?

I have a data frame, I need to two different groupbys on the same data frame.
+----+-------+--------+----------------------------+
| id | type | item | value | timestamp |
+----+-------+--------+----------------------------+
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | buy | tv | 12 |2016-09-20T00:00:00Z
| 1 | rent | movie | 12 |2016-09-20T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
+----+-------+-------+------------------------------+
I would like to get the result as :
id : 1
totalValue : 72 --- group by based on id
typeCount : {"rent" : 3, "buy" : 3} --- group by based on id
itemCount : {"dvd" : 2, "tv" : 1, "movie" : 3 } --- group by based on id
typeForDay : {"rent: 2, "buy" : 2 } --- group By based on id and dayofmonth(col("timestamp")) atmost 1 type per day
I tried :
val count_by_value = udf {( listValues :scala.collection.mutable.WrappedArray[String]) => if (listValues == null) null else listValues.groupBy(identity).mapValues(_.size)}
val group1 = df.groupBy("id").agg(collect_list("type"),sum("value") as "totalValue", collect_list("item"))
val group1Result = group1.withColumn("typeCount", count_by_value($"collect_list(type)"))
.drop("collect_list(type)")
.withColumn("itemCount", count_by_value($"collect_list(item)"))
.drop("collect_list(item)")
val group2 = df.groupBy("id", dayofmonth(col("timestamp"))).agg(collect_set("type"))
val group2Result = group2.withColumn("typeForDay", count_by_value($"collect_set(type)"))
.drop("collect_set(type)")
val groupedResult = group1Result.join(group2Result, "id").show()
But it takes time, is there any other efficient way of doing this ?
Better approach is to add each group field to key & reduce them instead of groupBy(). You can use these:
df1.map(rec => (rec(0), rec(3).toString().toInt)).
reduceByKey(_+_).take(5).foreach(println)
=> (1,72)
df1.map(rec => ((rec(0), rec(1)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,rent,3)
(1,buy,3)
df1.map(rec => ((rec(0), rec(2)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,dvd,2)
(1,tv,1)
(1,movie,3)
df1.map(rec => ((rec(0), rec(1), rec(4).toString().substring(8,10)), 1)).
reduceByKey(_+_).map(x => (x._1._1, x._1._2,x._1._3,x._2)).
take(5).foreach(println)
=>(1,rent,19,2)
(1,buy,20,1)
(1,buy,18,2)
(1,rent,20,1)

How to traverse a hierarchical tree-structure structure backwards using recursive queries

I'm using PostgreSQL 9.1 to query hierarchical tree-structured data, consisting of edges (or elements) with connections to nodes. The data are actually for stream networks, but I've abstracted the problem to simple data types. Consider the example tree table. Each edge has length and area attributes, which are used to determine some useful metrics from the network.
CREATE TEMP TABLE tree (
edge text PRIMARY KEY,
from_node integer UNIQUE NOT NULL, -- can also act as PK
to_node integer REFERENCES tree (from_node),
mode character varying(5), -- redundant, but illustrative
length numeric NOT NULL,
area numeric NOT NULL,
fwd_path text[], -- optional ordered sequence, useful for debugging
fwd_search_depth integer,
fwd_length numeric,
rev_path text[], -- optional unordered set, useful for debugging
rev_search_depth integer,
rev_length numeric,
rev_area numeric
);
CREATE INDEX ON tree (to_node);
INSERT INTO tree(edge, from_node, to_node, mode, length, area) VALUES
('A', 1, 4, 'start', 1.1, 0.9),
('B', 2, 4, 'start', 1.2, 1.3),
('C', 3, 5, 'start', 1.8, 2.4),
('D', 4, 5, NULL, 1.2, 1.3),
('E', 5, NULL, 'end', 1.1, 0.9);
Which can be illustrated below, where the edges represented by A-E are connected with nodes 1-5. The NULL to_node (Ø) represents the end node. The from_node is always unique, so it can act as PK. If this network flows like a drainage basin, it would be from top to bottom, where the starting tributary edges are A, B, C and the ending outflow edge is E.
The documentation for WITH provide a nice example of how to use search graphs in recursive queries. So, to get the "forwards" information, the query starts at the end, and works backwards:
WITH RECURSIVE search_graph AS (
-- Begin at ending nodes
SELECT E.from_node, 1 AS search_depth, E.length
, ARRAY[E.edge] AS path -- optional
FROM tree E WHERE E.to_node IS NULL
UNION ALL
-- Accumulate each edge, working backwards (upstream)
SELECT o.from_node, sg.search_depth + 1, sg.length + o.length
, o.edge|| sg.path -- optional
FROM tree o, search_graph sg
WHERE o.to_node = sg.from_node
)
UPDATE tree SET
fwd_path = sg.path,
fwd_search_depth = sg.search_depth,
fwd_length = sg.length
FROM search_graph AS sg WHERE sg.from_node = tree.from_node;
SELECT edge, from_node, to_node, fwd_path, fwd_search_depth, fwd_length
FROM tree ORDER BY edge;
edge | from_node | to_node | fwd_path | fwd_search_depth | fwd_length
------+-----------+---------+----------+------------------+------------
A | 1 | 4 | {A,D,E} | 3 | 3.4
B | 2 | 4 | {B,D,E} | 3 | 3.5
C | 3 | 5 | {C,E} | 2 | 2.9
D | 4 | 5 | {D,E} | 2 | 2.3
E | 5 | | {E} | 1 | 1.1
The above makes sense, and scales well for large networks. For example, I can see edge B is 3 edges from the end, and the forward path is {B,D,E} with a total length of 3.5 from the tip to the end.
However, I cannot figure out a good way to build a reverse query. That is, from each edge, what are the accumulated "upstream" edges, lengths and areas. Using WITH RECURSIVE, the best I have is:
WITH RECURSIVE search_graph AS (
-- Begin at starting nodes
SELECT S.from_node, S.to_node, 1 AS search_depth, S.length, S.area
, ARRAY[S.edge] AS path -- optional
FROM tree S WHERE from_node IN (
-- Starting nodes have a from_node without any to_node
SELECT from_node FROM tree EXCEPT SELECT to_node FROM tree)
UNION ALL
-- Accumulate edges, working forwards
SELECT c.from_node, c.to_node, sg.search_depth + 1, sg.length + c.length, sg.area + c.area
, c.edge || sg.path -- optional
FROM tree c, search_graph sg
WHERE c.from_node = sg.to_node
)
UPDATE tree SET
rev_path = sg.path,
rev_search_depth = sg.search_depth,
rev_length = sg.length,
rev_area = sg.area
FROM search_graph AS sg WHERE sg.from_node = tree.from_node;
SELECT edge, from_node, to_node, rev_path, rev_search_depth, rev_length, rev_area
FROM tree ORDER BY edge;
edge | from_node | to_node | rev_path | rev_search_depth | rev_length | rev_area
------+-----------+---------+----------+------------------+------------+----------
A | 1 | 4 | {A} | 1 | 1.1 | 0.9
B | 2 | 4 | {B} | 1 | 1.2 | 1.3
C | 3 | 5 | {C} | 1 | 1.8 | 2.4
D | 4 | 5 | {D,A} | 2 | 2.3 | 2.2
E | 5 | | {E,C} | 2 | 2.9 | 3.3
I would like to build aggregates into the second term of the recursive query, since each downstream edge connects to 1 or many upstream edges, but aggregates are not allowed with recursive queries. Also, I'm aware that the join is sloppy, since the with recursive result has multiple join conditions for edge.
The expected result for the reverse / backwards query is:
edge | from_node | to_node | rev_path | rev_search_depth | rev_length | rev_area
------+-----------+---------+-------------+------------------+------------+----------
A | 1 | 4 | {A} | 1 | 1.1 | 0.9
B | 2 | 4 | {B} | 1 | 1.2 | 1.3
C | 3 | 5 | {C} | 1 | 1.8 | 2.4
D | 4 | 5 | {A,B,D} | 3 | 3.5 | 3.5
E | 5 | | {A,B,C,D,E} | 5 | 6.4 | 6.8
How can I write this update query?
Note that I'm ultimately more concerned about accumulating accurate length and area totals, and that the path attributes are for debugging. In my real-world case, forwards paths are up to a couple hundred, and I expect reverse paths in the tens of thousands for large and complex catchments.
UPDATE 2:
I rewrote the original recursive query so that all accumulation/aggregation is done outside the recursive part. It should perform better than the previous version of this answer.
This is very much alike the answer from #a_horse_with_no_name for a similar question.
WITH
RECURSIVE search_graph(edge, from_node, to_node, length, area, start_node) AS
(
SELECT edge, from_node, to_node, length, area, from_node AS "start_node"
FROM tree
UNION ALL
SELECT o.edge, o.from_node, o.to_node, o.length, o.area, p.start_node
FROM tree o
JOIN search_graph p ON p.from_node = o.to_node
)
SELECT array_agg(edge) AS "edges"
-- ,array_agg(from_node) AS "nodes"
,count(edge) AS "edge_count"
,sum(length) AS "length_sum"
,sum(area) AS "area_sum"
FROM search_graph
GROUP BY start_node
ORDER BY start_node
;
Results are as expected:
start_node | edges | edge_count | length_sum | area_sum
------------+-------------+------------+------------+------------
1 | {A} | 1 | 1.1 | 0.9
2 | {B} | 1 | 1.2 | 1.3
3 | {C} | 1 | 1.8 | 2.4
4 | {D,B,A} | 3 | 3.5 | 3.5
5 | {E,D,C,B,A} | 5 | 6.4 | 6.8