Spark data processing - scala

I want to process Big Dataset using Spark and Scala as part of my analysis process.
Sample Input
id, related_ids
a, "b,e,f,i,j"
b, "e,i,j,k,l"
c, "f,i,j,m,n"
d, "c,i,g,m,s"
Sample Output
a, "c,d"
b, "a,c"
c, "a,b"
d, "NULL"
I thought of creating a data frame on top of that doing operations but I'm not able to move further after creating the data frame.
{
val input11 = sc.textFile(inputFile).map(x=>x.replaceAll("\"",""))
val input22 = input11.map(x=>x.split(",")).collect {
case Array(id,name,name1, name2, name3, name4) => Record(id.toInt, name.toInt, name1.toInt, name2.toInt, name3.toInt, name4.toInt)
}
val tbl = input22.toDF()
tbl.registerTempTable("rawData")
val res = sqlContext.sql("select name from rawData")
}
case class Record(id: Int, name: Int, name1 : Int, name2 : Int, name3 : Int, name4 : Int)
}

You can get exactly what you want with the following code:
I import your data:
val d = sc.parallelize(Seq(1 -> Seq(2,5,6,10,11),
2 -> Seq(5,10,11,15,16),
3-> Seq(6,10,11,17,21),
4 -> Seq(3,10,12,17,22))
Then defines a function that will enumerate all the 2-tuples that can be created from an ordered list.
def expand(seq : Seq[Int]): Seq[(Int, Int)] =
if (seq.isEmpty)
Seq[(Int, Int)]()
else
seq.tail.map(x=> seq.head -> x) ++ expand(seq.tail)
example :
scala> expand(Seq(1,2,3,4))
res27: Seq[(Int, Int)] = List((1,2), (1,3), (1,4), (2,3), (2,4), (3,4))
And the final calculation would go as follows:
val p = 2
d
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq.sorted))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= p)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray.sorted)
which yields:
Array((1,Array(2, 3)), (2,Array(1, 3)), (3,Array(1, 2, 4)), (4,Array(3)))
Note by the way that you made a mistake in your exemple, 4 and 3 have 2 elements in common (10 and 17). With p=3, you get:
Array((1,Array(2, 3)), (2,Array(1)), (3,Array(1)))
To get even the lines who do not have any "co_relations", join with the original data.
d
.leftOuterJoin(connexions)
.mapValues(x=> x._1 -> x._2.getOrElse(null))
And you finally get (with p=3):
Array((1,(List(2, 5, 6, 10, 11),Array(2, 3))),
(2,(List(5, 10, 11, 15, 16),Array(1))),
(3,(List(6, 10, 11, 17, 21),Array(1))),
(4,(List(3, 10, 12, 17, 22),null)))
Nonetheless, if you want to study connexions between your data points in a more general way, I encourage you to have a look to the Graph API of spark. You might for instance be interested in computing the connected components of your graph. ( GraphX )

Related

How to partition Fs2 Stream by key to transform each partition separately?

What I want to achieve, for example, given data:
time, part, data
0, a, 3
1, a, 4
2, b, 10
3, b, 20
3, a, 5
and transformation:
stream.keyBy(_.part).scan(0)((s, d) => s + d)
get:
0, a, 3
1, a, 7
2, b, 10
3, b, 30
3, a, 12
I've tried partition it using groupAdjacentBy, but it is becomes too complex, because I need to preserve complex state between each Chunk with Key.
I wonder if there something similar to Flink DataStream.keyBy? Or simpler way to implement it?
OK, I've found interesting solution (cannot be flatten, though)
The problem, as stated, can be solved by "partitioning" in the scan operation itself:
import cats.implicits._
import cats.effect.IO
import fs2._
case class Element(time: Long, part: Symbol, value: Int)
val elements = Stream(
Element(0, 'a, 3),
Element(1, 'a, 4),
Element(2, 'b, 10),
Element(3, 'b, 20),
Element(3, 'a, 5)
)
val runningSumsByPart = elements
.scan(Map.empty[Symbol, Int] -> none[Element]) {
case ((sums, _), el#Element(_, part, value)) =>
val sum = sums.getOrElse(part, 0) + value
(sums + (part -> sum), el.copy(value = sum).some)
}
.collect { case (_, Some(el)) => el }
runningSumsByPart.covary[IO].evalTap(el => IO { println(el) }).compile.drain.unsafeRunSync()
Outputs:
Element(0,'a,3)
Element(1,'a,7)
Element(2,'b,10)
Element(3,'b,30)
Element(3,'a,12)
I did something like this. First split, then merge. I don't know yet how to return 2 streams though. I just know how to process them in one place and then merge them together.
val notEqualS = in
.filter(_.isInstanceOf[NotEqual])
.map(_.asInstanceOf[NotEqual])
...
val invalidS = in
.filter(_.isInstanceOf[Invalid])
.map(_.asInstanceOf[Invalid])
...
notEqualS.merge(invalidS)

Compute the maximum length assigned to each element using scala

For example, this is the content in a file:
20,1,helloworld,alaaa
2,3,world,neww
1,223,ala,12341234
Desired output"
0-> 2
1-> 3
2-> 10
3-> 8
I want to find max-length assigned to each element.
It's possible to extend this to any number of columns. First read the file as a dataframe:
val df = spark.read.csv("path")
Then create an SQL expression for each column and evaluate it with expr:
val cols = df.columns.map(c => s"max(length(cast($c as String)))").map(expr(_))
Select the new columns as an array and covert to Map:
df.select(array(cols:_*)).as[Seq[Int]].collect()
.head
.zipWithIndex.map(_.swap)
.toMap
This should give you the desired Map.
Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Update:
OP's example suggests that they will be of equal lengths.
Using Spark-SQL and max(length()) on the DF columns is the idea that is being suggested in this answer.
You can do:
val xx = Seq(
("20","1","helloworld","alaaa"),
("2","3","world","neww"),
("1","223","ala","12341234")
).toDF("a", "b", "c", "d")
xx.registerTempTable("yy")
spark.sql("select max(length(a)), max(length(b)), max(length(c)), max(length(d)) from yy")
I would recommend using RDD's aggregate method:
val rdd = sc.textFile("/path/to/textfile").
map(_.split(","))
// res1: Array[Array[String]] = Array(
// Array(20, 1, helloworld, alaaa), Array(2, 3, world, neww), Array(1, 223, ala, 12341234)
// )
val seqOp = (m: Array[Int], r: Array[String]) =>
(r zip m).map( t => Seq(t._1.length, t._2).max )
val combOp = (m1: Array[Int], m2: Array[Int]) =>
(m1 zip m2).map( t => Seq(t._1, t._2).max )
val size = rdd.collect.head.size
rdd.
aggregate( Array.fill[Int](size)(0) )( seqOp, combOp ).
zipWithIndex.map(_.swap).
toMap
// res2: scala.collection.immutable.Map[Int,Int] = Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Note that aggregate takes:
an array of 0's (of size equal to rdd's row size) as the initial value,
a function seqOp for calculating maximum string lengths within a partition, and,
another function combOp to combine results across partitions for the final maximum values.

Scala - Reduce list of tuples by key

I have list of tuples which contains userId and point. I want to combine or reduce this list by the key.
val points: List[(Int, Double)] = List(
(1, 1.0),
(2, 3.2),
(4, 2.0),
(1, 4.0),
(2, 6.8)
)
The expected result should look like:
List((1, 5.0), (2, 10.0), (4, 2.0))
I tried with groupBy and mapValue, but got an error:
val aggrPoint: Map[Int, Double] = incomes.groupBy(_._1).mapValues(seq => seq.reduce(_._2 + _._2))
Error:(16, 180) type mismatch;
found : Double
required: (Int, Double)
What am I doing wrong, and is there a idiomatic way to achieve this?
P.S) I found that in Spark aggregateByKey does this job. But, is there a built-in method in Scala?
What am I doing wrong, and is there a idiomatic way to achieve this?
let's go step by step to see what are you doing wrong. (I am going to use REPL)
first of all lets define the points
scala> val points: List[(Int, Double)] = List(
| (1, 1.0),
| (2, 3.2),
| (4, 2.0),
| (1, 4.0),
| (2, 6.8)
| )
points: List[(Int, Double)] = List((1,1.0), (2,3.2), (4,2.0), (1,4.0), (2,6.8))
As you can see that you have List[Tuple2[Int, Double]] so when you do groupBy and mapValues as
scala> points.groupBy(_._1).mapValues(seq => println(seq))
List((2,3.2), (2,6.8))
List((4,2.0))
List((1,1.0), (1,4.0))
res1: scala.collection.immutable.Map[Int,Unit] = Map(2 -> (), 4 -> (), 1 -> ())
You can see that seq object is of List[Tuple2[Int, Double]] again but only contains the grouped tuples as list.
So when you apply seq.reduce(_._2 + _._2), the reduce function takes two inputs of Tuple2[Int, Double] but the output is Double only which doesn't match for the next iteration on seq as the expected input is Tuple2[Int, Double]. Thats the main issue. All you have to do is match the input and output types for reduce function
One way would be to match Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)})
res6: scala.collection.immutable.Map[Int,(Int, Double)] = Map(2 -> (2,10.0), 4 -> (4,2.0), 1 -> (1,5.0))
But this isn't your desired output, so you can extract the double value from the reduced Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)}._2)
res8: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
or you can just use map before you apply reduce function as
scala> points.groupBy(_._1).mapValues(seq => seq.map(_._2).reduce(_ + _))
res3: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
I hope the explanation is clear enough to understand your mistake and you must have understood how a reduce function works
You can map the tuples in the mapValues to their 2nd elements then sum them as follows:
points.groupBy(_._1).mapValues( _.map(_._2).sum ).toList
// res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
Using collect
points.groupBy(_._1).collect{
case e => e._1 -> e._2.map(_._2).sum
}.toList
//res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))

Scala: How to "map" an Array[Int] to a Map[String, Int] using the "map" method?

I have the following Array[Int]: val array = Array(1, 2, 3), for which I have the following mapping relation between an Int and a String:
val a1 = array.map{
case 1 => "A"
case 2 => "B"
case 3 => "C"
}
To create a Map to contain the above mapping relation, I am aware that I can use a foldLeft method:
val a2 = array.foldLeft(Map[String, Int]()) { (m, e) =>
m + (e match {
case 1 => ("A", 1)
case 2 => "B" -> 2
case 3 => "C" -> 3
})
}
which outputs:
a2: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2, C
-> 3)
This is the result I want. But can I achieve the same result via the map method?
The following codes do not work:
val a3 = array.map[(String, Int), Map[String, Int]] {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}
The signature of map is
def map[B, That](f: A => B)
(implicit bf: CanBuildFrom[Repr, B, That]): That
What is this CanBuildFrom[Repr, B, That]? I tried to read Tribulations of CanBuildFrom but don't really understand it. That article mentioned Scala 2.12+ has provided two implementations for map. But how come I didn't find it when I use Scala 2.12.4?
I mostly use Scala 2.11.12.
Call toMap in the end of your expression:
val a3 = array.map {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}.toMap
I'll first define your function here for the sake of brevity in later explanation:
// worth noting that this function is effectively partial
// i.e. will throw a `MatchError` if n is not in (1, 2, 3)
def toPairs(n: Int): (String, Int) =
n match {
case 1 => "a" -> 1
case 2 => "b" -> 2
case 3 => "c" -> 3
}
One possible way to go (as already highlighted in another answer) is to use toMap, which only works on collection of pairs:
val ns = Array(1, 2, 3)
ns.toMap // doesn't compile
ns.map(toPairs).toMap // does what you want
It is worth noting however that unless you are working with a lazy representation (like an Iterator or a Stream) this will result in two passes over the collection and the creation of unnecessary intermediate collections: the first time by mapping toPairs over the collection and then by turning the whole collection from a collection of pairs to a Map (with toMap).
You can see it clearly in the implementation of toMap.
As suggested in the read you already linked in the answer (and in particular here) You can avoid this double pass in two ways:
you can leverage scala.collection.breakOut, an implementation of CanBuildFrom that you can give map (among others) to change the target collection, provided that you explicitly provide a type hint for the compiler:
val resultMap: Map[String, Int] = ns.map(toPairs)(collection.breakOut)
val resultSet: Set[(String, Int)] = ns.map(toPairs)(collection.breakOut)
otherwise, you can create a view over your collection, which puts it in the lazy wrapper that you need for the operation to not result in a double pass
ns.view.map(toPairs).toMap
You can read more about implicit builder providers and views in this Q&A.
Basically toMap (credits to Sergey Lagutin) is the right answer.
You could actually make the code a bit more compact though:
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
If you run this code:
val array = Array(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 0)
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
println(a1)
You will see this on the console:
Map(E -> 5, J -> 10, F -> 6, A -> 1, # -> 0, G -> 7, L -> 12, B -> 2, C -> 3, H -> 8, K -> 11, D -> 4)

Play Scala - groupBy remove repetitive values

I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.