Aggregating similar records in Spark 1.6.2 - scala

I have a data set where after some transformation by using Spark SQL (1.6.2) in scala I got following data. (part of data).
home away count
a b 90
b a 70
c d 50
e f 45
f e 30
Now I want to get final result, like aggregating similar home and away i.e. a and b appearing two times. Similar home and away may not always come in consecutive rows
home away count
a b 160
c d 50
e f 75
Can someone help me out for this.

You can use create a temporary column using array and sort_array which you can use groupBy on to solve this. Here I assumed there can at most be two rows with the same value in the home/away columns and that which value is in home and which is in away doesn't matter:
val df = Seq(("a", "b", 90),
("b", "a", 70),
("c", "d", 50),
("e", "f", 45),
("f", "e", 30)).toDF("home", "away", "count")
val df2 = df.withColumn("home_away", sort_array(array($"home", $"away")))
.groupBy("home_away")
.agg(sum("count").as("count"))
.select($"home_away"(0).as("home"), $"home_away"(1).as("away"), $"count")
.drop("home_away")
Will give:
+----+----+-----+
|home|away|count|
+----+----+-----+
| e| f| 75|
| c| d| 50|
| a| b| 160|
+----+----+-----+

Related

how to perform complicated manipulations on scala datasets

I am fairly new to scala and having come from a sql and pandas background the dataset objects in scala are giving me a bit of trouble.
I have a dataset that looks like the following...
|car_num| colour|
+-----------+---------+
| 145| c|
| 132| p|
| 104| u|
| 110| c|
| 110| f|
| 113| c|
| 115| c|
| 11| i|
| 117| s|
| 118| a|
I have loaded it as a dataset using a case class that looks like the following
case class carDS(carNum: String, Colour: String)
Each car_num is unique to a car, many of the cars have multiple entries. The colour column refers to the colour the car was painted.
I would like to know how to add a column that gives the total number of paint jobs a car has had without being green (g) for example.
So far I have tried this.
carDS
.map(x => (x.carNum, x.Colour))
.groupBy("_1")
.count()
.orderBy($"count".desc).show()
But I believe it just gives me a count column of the number of times the car was painted. Not the longest sequential amount of times the car was painted without being green.
I think I might need to use a function in my query like the following
def colourrun(sq: String): Int = {
println(sq)
sq.mkString(" ")
.split("g")
.filter(_.nonEmpty)
.map(_.trim)
.map(s => s.split(" ").length)
.max
}
but I am unsure where it should go.
Ultimately if car 102 had been painted r, b, g, b, o, y, r, g
I would want the count column to give 4 as the answer.
How would I do this?
thanks
Here's one approach that involves grouping the paint jobs for a given car into monotonically numbered groups separated by paint jobs of color "g", followed by a couple of groupBy/aggs for the max count of paint jobs between being paint jobs of color "g".
(Note that a timestamp column is being added to ensure a deterministic ordering of the rows in the dataset.)
val ds = Seq(
("102", "r", 1), ("102", "b", 2), ("102", "g", 3), ("102", "b", 4), ("102", "o", 5), ("102", "y", 6), ("102", "r", 7), ("102", "g", 8),
("145", "c", 1), ("145", "g", 2), ("145", "b", 3), ("145", "r", 4), ("145", "g", 5), ("145", "c", 6), ("145", "g", 7)
).toDF("car_num", "colour", "timestamp").as[(String, String, Long)]
import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy("car_num").orderBy("timestamp")
ds.
withColumn("group", sum(when($"colour" === "g", 1).otherwise(0)).over(win)).
groupBy("car_num", "group").agg(
when($"group" === 0, count("group")).otherwise(count("group") - 1).as("count")
).
groupBy("car_num").agg(max("count").as("max_between_g")).
show
// +-------+-------------+
// |car_num|max_between_g|
// +-------+-------------+
// | 102| 4|
// | 145| 2|
// +-------+-------------+
An alternative to using the DataFrame API is to apply groupByKey to the Dataset followed by mapGroups like below:
ds.
map(c => (c.car_num, c.colour)).
groupByKey(_._1).mapGroups{ case (k, iter) =>
val maxTuple = iter.map(_._2).foldLeft((0, 0)){ case ((cnt, mx), c) =>
if (c == "g") (0, math.max(cnt, mx)) else (cnt + 1, mx)
}
(k, maxTuple._2)
}.
show
// +---+---+
// | _1| _2|
// +---+---+
// |102| 4|
// |145| 2|
// +---+---+

Find top N game for every ID based on total time using spark and scala

Find top N Game for every id who watched based on total time so here is my input dataframe:
InputDF:
id | Game | Time
1 A 10
2 B 100
1 A 100
2 C 105
1 N 103
2 B 102
1 N 90
2 C 110
And this is the output that I am expecting:
OutputDF:
id | Game | Time|
1 N 193
1 A 110
2 C 215
2 B 202
Here what I have tried but it is not working as expected:
val windowDF = Window.partitionBy($"id").orderBy($"Time".desc)
InputDF.withColumn("rank", row_number().over(windowDF))
.filter("rank<=10")
Your top-N ranking applies only to individual time rather than total time per game. A groupBy/sum to compute total time followed by a ranking on the total time will do:
val df = Seq(
(1, "A", 10),
(2, "B", 100),
(1, "A", 100),
(2, "C", 105),
(1, "N", 103),
(2, "B", 102),
(1, "N", 90),
(2, "C", 110)
).toDF("id", "game", "time")
import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy($"id").orderBy($"total_time".desc)
df.
groupBy("id", "game").agg(sum("time").as("total_time")).
withColumn("rank", row_number.over(win)).
where($"rank" <= 10).
show
// +---+----+----------+----+
// | id|game|total_time|rank|
// +---+----+----------+----+
// | 1| N| 193| 1|
// | 1| A| 110| 2|
// | 2| C| 215| 1|
// | 2| B| 202| 2|
// +---+----+----------+----+

Improving performance of distinct + groupByKey on Spark

I am trying to learn spark and came up with this problem but my solution doesn't seem to be performing well. I was hoping someone can educate me on how I can improve the performance. The problem I have is as follows.
I have a few million tuples (e.g. (A, B), (A, C), (B, C), etc.) with possibly duplicate tuples (keys and value). What I would like to do is group up the tuples by key AND, to make it more interesting, limit the length of the grouped values to some arbitrary number (let's say 3).
So, for example, if I have:
[(A, B), (A, C), (A, D), (A, E), (B, C)]
I would expect the output to be:
[(A, [B, C, D]), (A, [E]), (B, [C]))
If any of the values for the list got longer than 3, then it would split it and you the same key listed multiple times as shown above with (A, [E]). Hopefully this makes sense.
The solution I came up with was:
val myTuples: Array[(String, String)] = ...
sparkContext.parallelize(myTuples)
.distinct() // to delete duplicates
.groupByKey() // to group up the tuples by key
.flatMapValues(values => values.grouped(3)) // split up values in groups of 3
.repartition(sparkContext.defaultParallelism)
.collect()
My solution works okay but is there a more efficient way to do this? I hear that groupByKey is very inefficient. Any help would be greatly appreciated.
Also, is there a good number I should choose for partitions? I noticed that distinct takes in a partition parameter but not sure what I should have put.
Thanks!
You need to re-formulate your problem slightly, as you aren't actually grouping by a single key here; in your example above, you output multiple rows for "A". In the below, I add a column which we can use additionally to group by (it will increment every 3 records), and collect_list which is a Spark SQL function to produce the arrays you are looking for. Note that by sticking to entirely Spark SQL, you get many optimisations from Spark (through "catalyst" which is a query optimiser).
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = List(("A", "B"), ("A", "C"), ("A", "D"), ("A", "E"), ("B", "C")).toDF("key","value")
val data2 = data.withColumn("index", floor(
(row_number().over(Window.partitionBy("key").orderBy("value"))-1)/3)
)
data2.show
+---+-----+-----+
|key|value|index|
+---+-----+-----+
| B| C| 0|
| A| B| 0|
| A| C| 0|
| A| D| 0|
| A| E| 1|
+---+-----+-----+
data2.groupBy("key","index").agg(collect_list("value")).show
+---+-----+-------------------+
|key|index|collect_list(value)|
+---+-----+-------------------+
| B| 0| [C]|
| A| 0| [B, C, D]|
| A| 1| [E]|
+---+-----+-------------------+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

Data filtering in Spark

I am trying to do a certain kind of filtering using Spark. I have a data frame that looks like the following:
ID Property#1 Property#2 Property#3
-----------------------------------------
01 a b c
01 a X c
02 d e f
03 i j k
03 i j k
I expect the properties for a given ID to be the same. In the example above, I would like to filter out the following:
ID Property#2
---------------
01 b
01 X
Note that it is okay for IDs to be repeated in the data frame as long as the properties are the same (e.g. ID '03' in the first table). The code needs to be as efficient as possible as I am planning to apply it on datasets with >10k rows. I tried extracting the distinct rows using the distinct function in DataFrame API, grouped them on the ID column using groupBy and aggregated the results using countDistinct function, but unfortunately I couldn't get a working version of the code. Also the way I implemented it seems to be quite slow. I was wondering if anyone can provide some pointers as to how to approach this problem.
Thanks!
You can for example aggregate and join. First you'll have to create a lookup table:
val df = Seq(
("01", "a", "b", "c"), ("01", "a", "X", "c"),
("02", "d", "e", "f"), ("03", "i", "j", "k"),
("03", "i", "j", "k")
).toDF("id", "p1", "p2", "p3")
val lookup = df.distinct.groupBy($"id").count
Then filter the records:
df.join(broadcast(lookup), Seq("id"))
df.join(broadcast(lookup), Seq("id")).where($"count" !== 1).show
// +---+---+---+---+-----+
// | id| p1| p2| p3|count|
// +---+---+---+---+-----+
// | 01| a| b| c| 2|
// | 01| a| X| c| 2|
// +---+---+---+---+-----+