Spark : Aggregating based on a column - scala

I have a file consisting of 3 fields (Emp_ids, Groups, Salaries)
100 A 430
101 A 500
201 B 300
I want to get result as
1) Group name and count(*)
2) Group name and max( salary)
val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc= new SparkContext( conf)
val sal= sc.textFile(myfile)

Scala DSL:
case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
val arr = v.split(' ').map(_.trim())
Data(arr(0).toInt, arr(1), arr(2).toInt)
})
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
| 100| A| 430|
| 101| A| 500|
| 201| B| 300|
+-----+-----+------+
df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
Or with plain SQL:
df.registerTempTable("salaries")
sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
While Spark SQL is recommended way to do such aggregations due to performance reasons, it can be easily done with RDD API:
val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))
rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))
rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))

Related

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Updating a column of a dataframe based on another column value

I'm trying to update the value of a column using another column's value in Scala.
This is the data in my dataframe :
+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
Here column _c5 contains a value which is incorrect(value in Row2 has the string Incorrect) based on which I'd like to update its isBadRecord field to 1.
Is there a way to update this field?
You can use withColumn api and use one of the functions which meet your needs to fill 1 for bad record.
For your case you can write a udf function
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
And call it as
val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
Instead of reasoning about updating it, I'll suggest you think about it as you would in SQL; you could do the following:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Here is a self-contained script that you can copy and paste on your Spark shell to see the result locally:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
sc.setLogLevel("ERROR")
val schema =
StructType(Seq(
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Seq(
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
df.show()
val withBadRecords =
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
withBadRecords.show()
Whose relevant output is the following:
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
The best option is to create a UDF and try to convert it do Date format.
If it can be converted then return 0 else return 1
This work even if you have an bad date format
val spark = SparkSession.builder().master("local")
.appName("test").getOrCreate()
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
}
})
// Add column
data.withColumn("badData", check($"date")).show
Hope this helps!

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.
You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+
It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

How to map adjacent elements in scala

I have RDD[String] according to device,timestamp,on/off format.How do I calculate amount of time each device is swiched on.What is the best way of doing this in spark ?
on means 1 and off means 0
E.g
A,1335952933,1
A,1335953754,0
A,1335994294,1
A,1335995228,0
B,1336001513,1
B,1336002622,0
B,1336006905,1
B,1336007462,0
Intermediate step 1
A,((1335953754 - 1335952933),(1335995228 - 1335994294))
B,((1336002622- 1336001513),(1336007462 - 1336006905))
Intermediate step 2
(A,(821,934))
(B,(1109,557))
output
(A,1755)
(B,1666)
I'll assume that RDD[String] can be parsed into a RDD of DeviceLog where DeviceLog is:
case class DeviceLog(val id: String, val timestamp: Long, val onoff: Int)
The DeviceLog class is pretty straight forward.
// initialize contexts
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
Those initialize the spark context and sql context that we'll use it for dataframes.
Step 1:
val input = List(
DeviceLog("A",1335952933,1),
DeviceLog("A",1335953754,0),
DeviceLog("A",1335994294,1),
DeviceLog("A",1335995228,0),
DeviceLog("B",1336001513,1),
DeviceLog("B",1336002622,0),
DeviceLog("B",1336006905,1),
DeviceLog("B",1336007462,0))
val df = input.toDF()
df.show()
+---+----------+-----+
| id| timestamp|onoff|
+---+----------+-----+
| A|1335952933| 1|
| A|1335953754| 0|
| A|1335994294| 1|
| A|1335995228| 0|
| B|1336001513| 1|
| B|1336002622| 0|
| B|1336006905| 1|
| B|1336007462| 0|
+---+----------+-----+
Step 2: Partition by device id, order by timestamp and retain pair information (on/off)
val wSpec = Window.partitionBy("id").orderBy("timestamp")
val df1 = df
.withColumn("spend", lag("timestamp", 1).over(wSpec))
.withColumn("one", lag("onoff", 1).over(wSpec))
.where($"spend" isNotNull)
df1.show()
+---+----------+-----+----------+---+
| id| timestamp|onoff| spend|one|
+---+----------+-----+----------+---+
| A|1335953754| 0|1335952933| 1|
| A|1335994294| 1|1335953754| 0|
| A|1335995228| 0|1335994294| 1|
| B|1336002622| 0|1336001513| 1|
| B|1336006905| 1|1336002622| 0|
| B|1336007462| 0|1336006905| 1|
+---+----------+-----+----------+---+
Step 3: Compute upTime and filter by criteria
val df2 = df1
.withColumn("upTime", $"timestamp" - $"spend")
.withColumn("criteria", $"one" - $"onoff")
.where($"criteria" === 1)
df2.show()
| id| timestamp|onoff| spend|one|upTime|criteria|
+---+----------+-----+----------+---+------+--------+
| A|1335953754| 0|1335952933| 1| 821| 1|
| A|1335995228| 0|1335994294| 1| 934| 1|
| B|1336002622| 0|1336001513| 1| 1109| 1|
| B|1336007462| 0|1336006905| 1| 557| 1|
+---+----------+-----+----------+---+------+--------+
Step 4: group by id and sum
val df3 = df2.groupBy($"id").agg(sum("upTime"))
df3.show()
+---+-----------+
| id|sum(upTime)|
+---+-----------+
| A| 1755|
| B| 1666|
+---+-----------+