Compute the maximum length assigned to each element using scala - scala

For example, this is the content in a file:
20,1,helloworld,alaaa
2,3,world,neww
1,223,ala,12341234
Desired output"
0-> 2
1-> 3
2-> 10
3-> 8
I want to find max-length assigned to each element.

It's possible to extend this to any number of columns. First read the file as a dataframe:
val df = spark.read.csv("path")
Then create an SQL expression for each column and evaluate it with expr:
val cols = df.columns.map(c => s"max(length(cast($c as String)))").map(expr(_))
Select the new columns as an array and covert to Map:
df.select(array(cols:_*)).as[Seq[Int]].collect()
.head
.zipWithIndex.map(_.swap)
.toMap
This should give you the desired Map.
Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)

Update:
OP's example suggests that they will be of equal lengths.
Using Spark-SQL and max(length()) on the DF columns is the idea that is being suggested in this answer.
You can do:
val xx = Seq(
("20","1","helloworld","alaaa"),
("2","3","world","neww"),
("1","223","ala","12341234")
).toDF("a", "b", "c", "d")
xx.registerTempTable("yy")
spark.sql("select max(length(a)), max(length(b)), max(length(c)), max(length(d)) from yy")

I would recommend using RDD's aggregate method:
val rdd = sc.textFile("/path/to/textfile").
map(_.split(","))
// res1: Array[Array[String]] = Array(
// Array(20, 1, helloworld, alaaa), Array(2, 3, world, neww), Array(1, 223, ala, 12341234)
// )
val seqOp = (m: Array[Int], r: Array[String]) =>
(r zip m).map( t => Seq(t._1.length, t._2).max )
val combOp = (m1: Array[Int], m2: Array[Int]) =>
(m1 zip m2).map( t => Seq(t._1, t._2).max )
val size = rdd.collect.head.size
rdd.
aggregate( Array.fill[Int](size)(0) )( seqOp, combOp ).
zipWithIndex.map(_.swap).
toMap
// res2: scala.collection.immutable.Map[Int,Int] = Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Note that aggregate takes:
an array of 0's (of size equal to rdd's row size) as the initial value,
a function seqOp for calculating maximum string lengths within a partition, and,
another function combOp to combine results across partitions for the final maximum values.

Related

Spark data processing

I want to process Big Dataset using Spark and Scala as part of my analysis process.
Sample Input
id, related_ids
a, "b,e,f,i,j"
b, "e,i,j,k,l"
c, "f,i,j,m,n"
d, "c,i,g,m,s"
Sample Output
a, "c,d"
b, "a,c"
c, "a,b"
d, "NULL"
I thought of creating a data frame on top of that doing operations but I'm not able to move further after creating the data frame.
{
val input11 = sc.textFile(inputFile).map(x=>x.replaceAll("\"",""))
val input22 = input11.map(x=>x.split(",")).collect {
case Array(id,name,name1, name2, name3, name4) => Record(id.toInt, name.toInt, name1.toInt, name2.toInt, name3.toInt, name4.toInt)
}
val tbl = input22.toDF()
tbl.registerTempTable("rawData")
val res = sqlContext.sql("select name from rawData")
}
case class Record(id: Int, name: Int, name1 : Int, name2 : Int, name3 : Int, name4 : Int)
}
You can get exactly what you want with the following code:
I import your data:
val d = sc.parallelize(Seq(1 -> Seq(2,5,6,10,11),
2 -> Seq(5,10,11,15,16),
3-> Seq(6,10,11,17,21),
4 -> Seq(3,10,12,17,22))
Then defines a function that will enumerate all the 2-tuples that can be created from an ordered list.
def expand(seq : Seq[Int]): Seq[(Int, Int)] =
if (seq.isEmpty)
Seq[(Int, Int)]()
else
seq.tail.map(x=> seq.head -> x) ++ expand(seq.tail)
example :
scala> expand(Seq(1,2,3,4))
res27: Seq[(Int, Int)] = List((1,2), (1,3), (1,4), (2,3), (2,4), (3,4))
And the final calculation would go as follows:
val p = 2
d
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq.sorted))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= p)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray.sorted)
which yields:
Array((1,Array(2, 3)), (2,Array(1, 3)), (3,Array(1, 2, 4)), (4,Array(3)))
Note by the way that you made a mistake in your exemple, 4 and 3 have 2 elements in common (10 and 17). With p=3, you get:
Array((1,Array(2, 3)), (2,Array(1)), (3,Array(1)))
To get even the lines who do not have any "co_relations", join with the original data.
d
.leftOuterJoin(connexions)
.mapValues(x=> x._1 -> x._2.getOrElse(null))
And you finally get (with p=3):
Array((1,(List(2, 5, 6, 10, 11),Array(2, 3))),
(2,(List(5, 10, 11, 15, 16),Array(1))),
(3,(List(6, 10, 11, 17, 21),Array(1))),
(4,(List(3, 10, 12, 17, 22),null)))
Nonetheless, if you want to study connexions between your data points in a more general way, I encourage you to have a look to the Graph API of spark. You might for instance be interested in computing the connected components of your graph. ( GraphX )

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

Specify subset of elements in Spark RDD (Scala)

My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?
You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))
What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))

Spark: produce RDD[(X, X)] of all possible combinations from RDD[X]

Is it possible in Spark to implement '.combinations' function from scala collections?
/** Iterates over combinations.
*
* #return An Iterator which traverses the possible n-element combinations of this $coll.
* #example `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)`
*/
For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.
Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2 and combinations will create an RDD of size rdd.size() choose 2
val rdd = sc.parallelize(1 to 5)
val combinations = rdd.cartesian(rdd).filter{ case (a,b) => a < b }`.
combinations.collect()
Note this will only work if an ordering is defined on the elements of the list, since we use <. This one only works for choosing two but can easily be extended by making sure the relationship a < b for all a and b in the sequence
This is supported natively by a Spark RDD with the cartesian transformation.
e.g.:
val rdd = sc.parallelize(1 to 5)
val cartesian = rdd.cartesian(rdd)
cartesian.collect
Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (1,5),
(2,1), (2,2), (2,3), (2,4), (2,5),
(3,1), (3,2), (3,3), (3,4), (3,5),
(4,1), (4,2), (4,3), (4,4), (4,5),
(5,1), (5,2), (5,3), (5,4), (5,5))
As discussed, cartesian will give you n^2 elements of the cartesian product of the RDD with itself.
This algorithm computes the combinations (n,2) of an RDD without having to compute the n^2 elements first: (used String as type, generalizing to a type T takes some plumbing with classtags that would obscure the purpose here)
This is probably less time efficient that cartesian + filtering due to the iterative count and take actions that forces the computation of the RDD, but more space efficient as it calculates only the C(n,2) = n!/(2*(n-2))! = (n*(n-1)/2) elements instead of the n^2 of the cartesian product.
import org.apache.spark.rdd._
def combs(rdd:RDD[String]):RDD[(String,String)] = {
val count = rdd.count
if (rdd.count < 2) {
sc.makeRDD[(String,String)](Seq.empty)
} else if (rdd.count == 2) {
val values = rdd.collect
sc.makeRDD[(String,String)](Seq((values(0), values(1))))
} else {
val elem = rdd.take(1)
val elemRdd = sc.makeRDD(elem)
val subtracted = rdd.subtract(elemRdd)
val comb = subtracted.map(e => (elem(0),e))
comb.union(combs(subtracted))
}
}
This creates all combinations (n, 2) and works for any RDD without requiring any ordering on the elements of RDD.
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}
a._2 and b._2 are the indices, while a._1 and b._1 are the elements of the original RDD.
Example:
Note that, no ordering is defined on the maps here.
val m1 = Map('a' -> 1, 'b' -> 2)
val m2 = Map('c' -> 3, 'a' -> 4)
val m3 = Map('e' -> 5, 'c' -> 6, 'b' -> 7)
val rdd = sc.makeRDD(Array(m1, m2, m3))
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}.collect
Output:
Array((Map(a -> 1, b -> 2),Map(c -> 3, a -> 4)), (Map(a -> 1, b -> 2),Map(e -> 5, c -> 6, b -> 7)), (Map(c -> 3, a -> 4),Map(e -> 5, c -> 6, b -> 7)))

Play Scala - groupBy remove repetitive values

I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.