Compare two columns which is list with n number of values and save the data as a list in scala - scala

I need to perform comparison operation (like greater than or less than) on two columns which is list with n number of values (values are nothing but timestamp) and my result should also be in list.
How can I do this operation?
Input:
Date1 Date2
["2016-11-24 12:06:47"] ["2017-10-04 03:30:23"]
["null"] []
["2017-01-25 10:07:25","2018-01-25 10:07:25"] ["2017-09-15 03:30:16","2017-09-15 03:30:16"]
Output should be:
Result
["Less"]
["Not Okay"]
["Less","Great"]

I need to perform comparison operation
It seems you are looking for the .compareTo operator:
scala> "a".compareTo("b")
res: Int = -1
scala> "a".compareTo("a")
res: Int = 0
scala> "b".compareTo("a")
res: Int = 1
Using the first example mentioned:
val date1 = "2016-11-24 12:06:47"
val date2 = "2017-10-04 03:30:23"
scala> date1.compareTo(date2)
res: Int = -1
If we ignore for a moment the "Not Okay" case, we could implement the "Less" or "Great" cases with a function like:
def compareLexicographically(s1: String, s2: String): String = s1.compareTo(s2) match {
case -1 => "Less"
case _ => "Great"
}
Looking at your example, I am assuming the rows are tuples of list of Strings:
val rows: List[(List[String], List[String])] =
List((
List("2016-11-24 12:06:47"),
List("2017-10-04 03:30:23")
),
(
List("2017-01-25 10:07:25", "2018-01-25 10:07:25"),
List("2017-09-15 03:30:16", "2017-09-15 03:30:16")
))
I would first zip the elements from the columns to get List[(String, String)]
rows.flatMap(r => r._1.zip(r._2))
Then simple map with compareLexicographically:
scala> rows.flatMap(r => r._1.zip(r._2)).map((compareLexicographically _).tupled)
res: List[String] = List(Less, Great, Great)

Related

reduceByKey RDD spark scala

I have RDD[(String, String, Int)] and want to use reduceByKey to get the result as shown below. I don't want it to convert to DF and then perform groupBy operation to get result. Int is constant with value as 1 always.
Is it possible to use reduceByKey here to get the result? Presenting it in Tabularformat for easy reading
Question
String
String
Int
First
Apple
1
Second
Banana
1
First
Flower
1
Third
Tree
1
Result
String
String
Int
First
Apple,Flower
2
Second
Banana
1
Third
Tree
1
You can not use reduceByKey if you have a Tuple3, you could use reduceByKey though if you had RDD[(String, String)].
Also, once you groupBy, you can then apply reduceByKey, but since keys would be unique, it makes no sense to call reduceByKey, therefore we use map to map one on one values.
So, assume df is your main table, then this piece of code:
val rest = df.groupBy(x => x._1).map(x => {
val key = x._1 // ex: First
val groupedData = x._2 // ex: [(First, Apple, 1), (First, Flower, 1)]
// ex: [(First, Apple, 1), (First, Flower, 1)] => [Apple, Flower] => Apple, Flower
val concat = groupedData.map(d => d._2).mkString(",")
// ex: [(First, Apple, 1), (First, Flower, 1)] => [1, 1] => 2
val sum = groupedData.map(d => d._3).sum
(key, concat, sum) // return a tuple3 again, same format
})
Returns this result:
(Second,Banana,1)
(First,Apple,Flower,2)
(Third,Tree,1)
EDIT
Implementing reduceByKey without sum, if your dataset looks like:
val data = List(
("First", "Apple"),
("Second", "Banana"),
("First", "Flower"),
("Third", "Tree")
)
val df: RDD[(String, String)] = sparkSession.sparkContext.parallelize(data)
Then, this:
df.reduceByKey((acc, next) => acc + "," + next)
Gives this:
(First,Apple,Flower)
(Second,Banana)
(Third,Tree)
Good luck!

scala - reset to 1 when the duplicated value changes in a list

I'm trying to generate sequence numbers on duplicated elements. It should reset to 1 when the value changes,
val dt = List("date", "date", "decimal", "decimal", "decimal", "string", "string")
var t = 0
dt.sorted.map( x => {t=t+1; (x,t)} )
This gives result as
List((date,1), (date,2), (decimal,3), (decimal,4), (decimal,5), (string,6), (string,7))
But what I expect is to get it as
List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
how do I change the value of t to 0 when the value changes in my list?.
Are there better methods to get the above output?.
The best method to use for this is scanLeft which is like foldLeft but emits a value at each step. The code looks like this:
val ds = dt.sorted
ds.tail.scanLeft((ds.head, 1)){
case ((prev, n), cur) if prev == cur => (cur, n+1)
case (_, cur) => (cur, 1)
}
At each step it increments the count if the value is the same as the previous, otherwise it resets it to 1.
This will work if the list has a single element. Although tail will be Nil, the first element in the result of scanLeft is always be the first parameter to the method. In this case it is (ds.head, 1).
This will not work if the list is empty, as ds.head will throw an exception. This can be fixed by using a match first:
ds match {
case head :: tail =>
tail.scanLeft((head, 1)) {
case ((prev, n), cur) if prev == cur => (cur, n + 1)
case (_, cur) => (cur, 1)
}
case _ => Nil
}
To reset the counter you need to look back at the previous element, which .map() can't do.
dt.foldLeft(List.empty[(String,Int)]){ case (lst,str) =>
lst.headOption.fold((str,1)::Nil){
case (`str`,cnt) => (str,cnt+1) :: lst
case _ => (str,1) :: lst
}
}.reverse
//res0: List[(String, Int)] = List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
explanation
foldLeft - consider the dt elements, one at a time, left to right
List.empty[(String,Int)] - we'll build a List of tuples, start with an empty list
case (lst,str) - the list we're building and the current String element from dt
lst.headOption - get the head of the list if it exists
fold((str,1)::Nil) - if lst is empty return a new list with a single element
case (str,cnt) - if the head string element is the same as the current dt element
(str,cnt+1) :: lst - add a new element, with incremented count, to the list
case _ - head string element is different from the current dt element
(str,1) :: lst - add a new element, with count = 1, to the list
.reverse - we've built the results in reverse order, reverse it
Hope this helps.
scala> val dt = List("date", "date", "decimal", "decimal", "decimal", "string", "string")
dt: List[String] = List(date, date, decimal, decimal, decimal, string, string)
scala> val dtset = dt.toSet
dtset: scala.collection.immutable.Set[String] = Set(date, decimal, string)
scala> dtset.map( x => dt.filter( y => y == x))
res41: scala.collection.immutable.Set[List[String]] = Set(List(date, date), List(decimal, decimal, decimal), List(string, string))
scala> dtset.map( x => dt.filter( y => y == x)).flatMap(a => a.zipWithIndex)
res42: scala.collection.immutable.Set[(String, Int)] = Set((string,0), (decimal,1), (decimal,0), (string,1), (date,0), (date,1), (decimal,2))
scala> dtset.map( x => dt.filter( y => y == x)).flatMap(a => a.zipWithIndex).toList
res43: List[(String, Int)] = List((string,0), (decimal,1), (decimal,0), (string,1), (date,0), (date,1), (decimal,2)) // sort this list to your needs
By adding one more mutable string variable, the below one works.
val dt = List("date", "date", "decimal", "decimal", "decimal", "string","string")
var t = 0
var s = ""
val dt_seq = dt.sorted.map( x => { t= if(s!=x) 1 else t+1;s=x; (x,t)} )
Results:
dt_seq: List[(String, Int)] = List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
Another way is to use groupBy(identity) and get the indices from map values
val dt = List("date", "date", "decimal", "decimal", "decimal", "string","string")
val dtg = dt.groupBy(identity).map( x => (x._2 zip x._2.indices.map(_+1)) ).flatten.toList
which results in
dtg: List[(String, Int)] = List((decimal,1), (decimal,2), (decimal,3), (date,1), (date,2), (string,1), (string,2))
Thanks to #Leo, instead of indices, you can use Stream from 1 with zip that gives the same results.
val dtg = dt.groupBy(identity).map( x => (x._2 zip (Stream from 1)) ).flatten.toList

RDD/Scala Get one column from RDD

I have an RDD[Log] file with various fields (username,content,date,bytes) and I want to find different things for each field/column.
For example, I want to get the min/max and average bytes found in the RDD. When i do:
val q1 = cleanRdd.filter(x => x.bytes != 0)
I get the full lines of the RDD with bytes != 0. But how can I actually sum them, calculate the avg, find the min/max etc? How can I take only one column from my RDD and apply transformations on it?
EDIT: Prasad told me about changing the type to dataframe, he gave no instructions on how to do so though, and I cant find a solid answer on the site. Any help would be great.
EDIT: LOG class:
case class Log (username: String, date: String, status: Int, content: Int)
using a cleanRdd.take(5).foreach(println) gives something like this
Log(199.72.81.55 ,01/Jul/1995:00:00:01 -0400,200,6245)
Log(unicomp6.unicomp.net ,01/Jul/1995:00:00:06 -0400,200,3985)
Log(199.120.110.21 ,01/Jul/1995:00:00:09 -0400,200,4085)
Log(burger.letters.com ,01/Jul/1995:00:00:11 -0400,304,0)
Log(199.120.110.21 ,01/Jul/1995:00:00:11 -0400,200,4179)
Well... you have a lot of questions.
So... you have the following abstraction of a Log
case class Log (username: String, date: String, status: Int, content: Int, byte: Int)
Que - How can I take only one column from my RDD.
Ans - You have a map function with the RDD's. So for an RDD[A], map takes a map/transform function of type A => B to transform it into a RDD[B].
val logRdd: RDD[Log] = ...
val byteRdd = logRdd
.filter(l => l.bytes != 0)
.map(l => l.byte)
Que - how can I actually sum them ?
Ans - You can do it by using reduce / fold / aggregate.
val sum = byteRdd.reduce((acc, b) => acc + b)
val sum = byteRdd.fold(0)((acc, b) => acc + b)
val sum = byteRdd.aggregate(0)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Note :: An important thing to notice here is that a sum of Int can grow bigger than what an Int can handle. So in most real life cases we should use at least a Long as our accumulator instead of an Int, which actually removes reduce and fold as options. And we will be left with an aggregate only.
val sum = byteRdd.aggregate(0l)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Now if you have to calculate multiple things like min, max, avg then I will suggest that you calculate them in a single aggregate instead of multiple like this,
// (count, sum, min, max)
val accInit = (0, 0, Int.MaxValue, Int.MinValue)
val (count, sum, min, max) = byteRdd.aggregate(accInit)(
{ case ((count, sum, min, max), b) =>
(count + 1, sum + b, Math.min(min, b), Math.max(max, b)) },
{ case ((count1, sum1, min1, max1), (count2, sum2, min2, max2)) =>
(count1 + count2, sum1 + sum2, Math.min(min1, min2), Math.max(max1, max2)) }
})
val avg = sum.toDouble / count
Have a look in DataFrame API. You need to convert your RDD to a DataFrame and then you can use min, max, avg functions like below:
val rdd = cleanRdd.filter(x => x.bytes != 0)
val df = sparkSession.sqlContext.createDataFrame(rdd, classOf[Log])
Assuming you wanted to operations on column bytes then
import org.apache.spark.sql.functions._
df.select(avg("bytes")).show
df.select(min("bytes")).show
df.select(max("bytes")).show
Update:
Tried with the following in spark-shell. check the screenshots for the outcome...
case class Log (username: String, date: String, status: Int, content: Int)
val inputRDD = sc.parallelize(Seq(Log("199.72.81.55","01/Jul/1995:00:00:01 -0400",200,6245), Log("unicomp6.unicomp.net","01/Jul/1995:00:00:06 -0400",200,3985), Log("199.120.110.21","01/Jul/1995:00:00:09 -0400",200,4085), Log("burger.letters.com","01/Jul/1995:00:00:11 -0400",304,0), Log("199.120.110.21","01/Jul/1995:00:00:11 -0400",200,4179)))
val rdd = inputRDD.filter(x => x.content != 0)
val df = rdd.toDF("username", "date", "status", "content")
df.printSchema
import org.apache.spark.sql.functions._
df.select(avg("content")).show
df.select(min("content")).show
df.select(max("content")).show

Count number of Ints in Scala list using a fold

Say I have the following list of type Any:
val list = List("foo", 1, "bar", 2)
I would now like to write a function that counts only the number of Ints in a list using a fold. In the case of the list above, the result should be "2".
I know counting the number of all elements using fold would look something like this:
def count(list: List[Any]): Int =
list.foldLeft(0)((sum,_) => sum + 1)
How can I tweak this to only count occurrences of Int?
Another version:
list.count(_.isInstanceOf[Int])
And, if you insist on the foldLeft version, here is one:
def count(list: List[Any]): Int =
list.foldLeft(0)((sum, x) => x match {
case _: Int => sum + 1
case _ => sum
})
Filtering list by Int and taking the size gives you what you want and is fairly straightforward.
scala> list.filter(_.isInstanceOf[Int]).size
res0: Int = 2

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}