Given df as below:
val df = spark.createDataFrame(Seq(
(1, 2, 3),
(3, 2, 1)
)).toDF("One", "Two", "Three")
with schema:
I would like to write a udf that takes Three columns as inout; and returns new column based on highest input value similar as below:
import org.apache.spark.sql.functions.udf
def udfScoreToCategory=udf((One: Int, Two: Int, Three: Int): Int => {
cols match {
case cols if One > Two && One > Three => 1
case cols if Two > One && Two > Three => 2
case _ => 0
}}
It will be interesting to see how to do similar with vector type as input:
import org.apache.spark.ml.linalg.Vector
def udfVectorToCategory=udf((cols:org.apache.spark.ml.linalg.Vector): Int => {
cols match {
case cols if cols(0) > cols(1) && cols(0) > cols(2) => 1,
case cols if cols(1) > cols(0) && cols(1) > cols(2) => 2
case _ => 0
}})
Some problems:
cols in the first example are not in the scope.
(...): T => ... is not valid syntax for anonymous function.
It is better to use val over def here.
One way to define this:
val udfScoreToCategory = udf[Int, (Int, Int, Int)]{
case (one, two, three) if one > two && one > three => 1
case (one, two, three) if two > one && two > three => 2
case _ => 0
}
and
val udfVectorToCategory = udf[Int, org.apache.spark.ml.linalg.Vector]{
_.toArray match {
case Array(one, two, three) if one > two && one > three => 1
case Array(one, two, three) if two > one && two > three => 2
case _ => 0
}}
In general, for the first case you should use ``when`
import org.apache.spark.sql.functions.when
when ($"one" > $"two" && $"one" > $"three", 1)
.when ($"two" > $"one" && $"two" > $"three", 2)
.otherwise(0)
where one, two, three are column names.
I was able to find the biggest element of the vector by:
val vectorToCluster = udf{ (x: Vector) => x.argmax }
However, I am still puzzled how to do pattern matching on multiple columns values.
Related
I need to perform comparison operation (like greater than or less than) on two columns which is list with n number of values (values are nothing but timestamp) and my result should also be in list.
How can I do this operation?
Input:
Date1 Date2
["2016-11-24 12:06:47"] ["2017-10-04 03:30:23"]
["null"] []
["2017-01-25 10:07:25","2018-01-25 10:07:25"] ["2017-09-15 03:30:16","2017-09-15 03:30:16"]
Output should be:
Result
["Less"]
["Not Okay"]
["Less","Great"]
I need to perform comparison operation
It seems you are looking for the .compareTo operator:
scala> "a".compareTo("b")
res: Int = -1
scala> "a".compareTo("a")
res: Int = 0
scala> "b".compareTo("a")
res: Int = 1
Using the first example mentioned:
val date1 = "2016-11-24 12:06:47"
val date2 = "2017-10-04 03:30:23"
scala> date1.compareTo(date2)
res: Int = -1
If we ignore for a moment the "Not Okay" case, we could implement the "Less" or "Great" cases with a function like:
def compareLexicographically(s1: String, s2: String): String = s1.compareTo(s2) match {
case -1 => "Less"
case _ => "Great"
}
Looking at your example, I am assuming the rows are tuples of list of Strings:
val rows: List[(List[String], List[String])] =
List((
List("2016-11-24 12:06:47"),
List("2017-10-04 03:30:23")
),
(
List("2017-01-25 10:07:25", "2018-01-25 10:07:25"),
List("2017-09-15 03:30:16", "2017-09-15 03:30:16")
))
I would first zip the elements from the columns to get List[(String, String)]
rows.flatMap(r => r._1.zip(r._2))
Then simple map with compareLexicographically:
scala> rows.flatMap(r => r._1.zip(r._2)).map((compareLexicographically _).tupled)
res: List[String] = List(Less, Great, Great)
I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)
I have an RDD[Log] file with various fields (username,content,date,bytes) and I want to find different things for each field/column.
For example, I want to get the min/max and average bytes found in the RDD. When i do:
val q1 = cleanRdd.filter(x => x.bytes != 0)
I get the full lines of the RDD with bytes != 0. But how can I actually sum them, calculate the avg, find the min/max etc? How can I take only one column from my RDD and apply transformations on it?
EDIT: Prasad told me about changing the type to dataframe, he gave no instructions on how to do so though, and I cant find a solid answer on the site. Any help would be great.
EDIT: LOG class:
case class Log (username: String, date: String, status: Int, content: Int)
using a cleanRdd.take(5).foreach(println) gives something like this
Log(199.72.81.55 ,01/Jul/1995:00:00:01 -0400,200,6245)
Log(unicomp6.unicomp.net ,01/Jul/1995:00:00:06 -0400,200,3985)
Log(199.120.110.21 ,01/Jul/1995:00:00:09 -0400,200,4085)
Log(burger.letters.com ,01/Jul/1995:00:00:11 -0400,304,0)
Log(199.120.110.21 ,01/Jul/1995:00:00:11 -0400,200,4179)
Well... you have a lot of questions.
So... you have the following abstraction of a Log
case class Log (username: String, date: String, status: Int, content: Int, byte: Int)
Que - How can I take only one column from my RDD.
Ans - You have a map function with the RDD's. So for an RDD[A], map takes a map/transform function of type A => B to transform it into a RDD[B].
val logRdd: RDD[Log] = ...
val byteRdd = logRdd
.filter(l => l.bytes != 0)
.map(l => l.byte)
Que - how can I actually sum them ?
Ans - You can do it by using reduce / fold / aggregate.
val sum = byteRdd.reduce((acc, b) => acc + b)
val sum = byteRdd.fold(0)((acc, b) => acc + b)
val sum = byteRdd.aggregate(0)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Note :: An important thing to notice here is that a sum of Int can grow bigger than what an Int can handle. So in most real life cases we should use at least a Long as our accumulator instead of an Int, which actually removes reduce and fold as options. And we will be left with an aggregate only.
val sum = byteRdd.aggregate(0l)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Now if you have to calculate multiple things like min, max, avg then I will suggest that you calculate them in a single aggregate instead of multiple like this,
// (count, sum, min, max)
val accInit = (0, 0, Int.MaxValue, Int.MinValue)
val (count, sum, min, max) = byteRdd.aggregate(accInit)(
{ case ((count, sum, min, max), b) =>
(count + 1, sum + b, Math.min(min, b), Math.max(max, b)) },
{ case ((count1, sum1, min1, max1), (count2, sum2, min2, max2)) =>
(count1 + count2, sum1 + sum2, Math.min(min1, min2), Math.max(max1, max2)) }
})
val avg = sum.toDouble / count
Have a look in DataFrame API. You need to convert your RDD to a DataFrame and then you can use min, max, avg functions like below:
val rdd = cleanRdd.filter(x => x.bytes != 0)
val df = sparkSession.sqlContext.createDataFrame(rdd, classOf[Log])
Assuming you wanted to operations on column bytes then
import org.apache.spark.sql.functions._
df.select(avg("bytes")).show
df.select(min("bytes")).show
df.select(max("bytes")).show
Update:
Tried with the following in spark-shell. check the screenshots for the outcome...
case class Log (username: String, date: String, status: Int, content: Int)
val inputRDD = sc.parallelize(Seq(Log("199.72.81.55","01/Jul/1995:00:00:01 -0400",200,6245), Log("unicomp6.unicomp.net","01/Jul/1995:00:00:06 -0400",200,3985), Log("199.120.110.21","01/Jul/1995:00:00:09 -0400",200,4085), Log("burger.letters.com","01/Jul/1995:00:00:11 -0400",304,0), Log("199.120.110.21","01/Jul/1995:00:00:11 -0400",200,4179)))
val rdd = inputRDD.filter(x => x.content != 0)
val df = rdd.toDF("username", "date", "status", "content")
df.printSchema
import org.apache.spark.sql.functions._
df.select(avg("content")).show
df.select(min("content")).show
df.select(max("content")).show
Let's say I want to print duplicates in a list with their count. So I have 3 options as shown below:
def dups(dup:List[Int]) = {
//1)
println(dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => (x,ys.size) }.toSeq)
//2)
println(dup.groupBy(identity).collect { case (x, List(_, _, _*)) => x }.map(x => (x, dup.count(y => x == y))))
//3)
println(dup.distinct.map((a:Int) => (a, dup.count((b:Int) => a == b )) ).filter( (pair: (Int,Int) ) => { pair._2 > 1 } ))
}
Questions:
-> For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?
-> For option 1, is there any way to avoid the last call to toSeq to return a List?
-> which one of the 3 choices is more efficient by using the least amount of loops?
As an example input: List(1,1,1,2,3,4,5,5,6,100,101,101,102)
Should print: List((1,3), (5,2), (101,2))
Based on #lutzh answer below the best way would be to do the following:
val list: List[(Int, Int)] = dup.groupBy(identity).collect({ case (x, ys # List(_, _, _*)) => (x, ys.size) })(breakOut)
val list2: List[(Int, Int)] = dup.groupBy(identity).collect { case (x, ys) if ys.lengthCompare(1) > 0 => (x, ys.size) }(breakOut)
For option 1 is there any way to avoid the last call to toSeq to
return a List?
collect takes a CanBuildFrom, so if you assign it to something of the desired type you can use breakOut:
import collection.breakOut
val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x,ys) if ys.size > 1 => (x,ys.size)} )(breakOut)
collect will create a new collection (just like map), using a Builder. Usually the return type is determined by the origin type. With breakOut you basically ignore the origin type and look for a builder for the result type. So when collect creates the resulting collection, it will already create the "right" type, and you don't have to traverse the result again to convert it.
For option 2, is there any way to name the list parameter so that it
can be used to append the size of the list just like I did in option 1
using ys.size?
Yes, you can bind it to a variable with #
val dups: List[(Int,Int)] =
dup
.groupBy(identity)
.collect({ case (x, ys # List(_, _, _*)) => (x, ys.size) } )(breakOut)
which one of the 3 choices is more efficient?
Calling dup.count on a match seems inefficient, as dup needs to be traversed again then, I'd avoid that.
My guess would be that the guard (if lengthCompare(1) > 0) takes a few cycles less than the List(,,_*) pattern, but I haven't measured. And am not planning to.
Disclaimer: There may be a completely different (and more efficient) way of doing it that I can't think of right now. I'm only answering your specific questions.
I have a cartesian RDD which allows me to filter a RDD on a certain time range, but I need to get the minimum value of the RDD so I can calculate the delta time of each record to the entry that occurred first.
I have a case class that is made up like the below:
case class auction(id: String, prodID: String, timestamp: Long)
and I put together two RDDs, one that contains the auction of note, the other contains the auctions that occured in that time period as below:
val specificmessages = allauctions.cartesian(winningauction)
.filter( (x, y) => x.timestamp > y.timestamp - 10 &&
x.timestamp < y.timestamp + 10 &&
x.productID == y.productID )
I would like to, in the specificmessages function, be able to add a field which will contain the delta between each record and the auction timestamp that has the minimum value.
You can use DataFrames like this:
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window
// Convert RDDs to DFs
val allDF = allauctions.toDF
val winDF = winningauction.toDF("winId", "winProdId", "winTimestamp")
// Prepare join conditions
val prodCond = $"prodID" === $"winProdID"
val tsCond = f.abs($"timestamp" - $"winTimestamp") < 10
// Create window
val w = Window
.partitionBy($"id", $"prodID", $"timestamp")
.orderBy($"winTimestamp")
val joined = allDF
.join(winDF, prodCond && tsCond)
.select($"*", first($"winTimestamp").over(w).alias("mintimestamp")
Using plain RDDs
// Create PairRDDs
def allPairs = allauctions.map(a => (a.prodID, a))
def winPairs = winauctions.map(a => (a.prodID, a))
allPairs
.join(winPairs) // Join by prodId -> RDD[(prodID, (auction, auction))]
// Filter timestamp
.filter{case (_, (x, y)) => (x.timestamp - y.timestamp).abs < 10} //
.values // Drop key -> RDD[(auction, auction)]
.groupByKey // Group by allAuctions -> RDD[(auction, Seq[auction])]
.flatMap{ case (k, vals) => {
val minTs = vals.map(_.timestamp).min // Find min ts from winauction
vals.map(v => (k, v, minTs))
}} // -> RDD[(auction, auction, ts)]