Looking for ways to optimize the program [Scala/Spark] - scala

I have an RDD which look the following:
( (tag_1, set_1), (tag_2, set_2) ) , ... , ( (tag_M, set_M), (tag_L, set_L) ), ...
And for each pair from the RDD I'm going to compute the expression
for k=0,..,3 and to find the sum: p(0)+...p(3). For each pair of pairs n_1 is length of the set in the first pair and n_2 is length of the set in the second pair.
For now I wrote the following:
val N = 1000
pairRDD.map({
case ((t1,l1), (t2,l2)) => (t1,t2, {
val n_1 = l1.size
val n_2 = l2.size
val vals = (0 to 3).map(k => {
val P1 = (0 to (n_2-k-1))
.map(j => 1 - n_1/(N-j.toDouble))
.foldLeft(1.0)(_*_)
val P2 = (0 to (k-1))
.map(j => (n_1-j.toDouble)*(n_2-j.toDouble)/(N-n_2+k.toDouble-j.toDouble)/(k.toDouble-j.toDouble) )
.foldLeft(1.0)(_*_)
P1*P2
})
vals.sum.toDouble
})
})
The problem is it seems to work really slow and I hope there are some features of scala/spark that I don't know about and that could reduce the time of execution there.
Edit:
1) In the first place I have a csv-file with 2 columns: tag and message_id. For each tag I'm finding messages where it could be found and creating pairs like I described above (tagIdsZipped). The code is here
2) Then I want to compute the expression for each pair and write it down to file.
Actually, I also would like to filter the result, but it would be even longer, so I'm even not trying for now.
3) No, actually I dont, but the problems happened, when I tried to use this code, previously I did the following:
val tagPairsWithMeasure: RDD[(Tag, Tag, Measure)] = tagIdsZipped.map({
case ((t1,l1), (t2,l2)) => (t1,t2, {
val numer = l1.intersect(l2).size
val denom = Math.sqrt(l1.size)*Math.sqrt(l2.size)
numer.toDouble / denom
})
})
and everything worked fine. (see 4) )
4) In the file I described in 1) there are about 25million rows (~1.2 GB). I'm computing in on Xeon E5-2673 #2.4GHz and 32 GB RAM. It took about 1.5h to execute the code with the function I described in 3). I see, that there are more operations now, but it took about 3hours and only about 25% of the task was done. The main problem is I will have to work with about 3 times more data, but I can't even do it on a 'smaller' one.
Thank you in advance!

As has been mentioned there is not much to improve about Spark.
The biggest issue I can see here is using range.map.
(0 to (n_2-k-1)) creates a Range object.
Calling map on it creates a Vector allocating much memory.
The most simple solution is to work with iterators since foldLeft is a streaming-friendly function:
(0 to (n_2-k-1)).iterator instead of (0 to (n_2-k-1))
It also probably makes sense to try rewriting it imperatively using vars, loops and arrays since since computation inside a loop is extremely cheap. But it is a weapon of the last chance.

have you tried using dataframes?
maybe you can create a dataframe with a schema like this:
tagIdsDF
+-----------------------------+
|tag1 | set1 |tag2 | set2 |
+-----------------------------+
|tag_1 |set_1 |tag_2 |set_2 |
|... |
|tag_M |set_M |tag_L |set_L |
+-----------------------------+
and define a UDF to compute the sum:
val pFun = udf((l1:Seq[Double], l2:Seq[Double]) => {
val n_1 = l1.size
val n_2 = l2.size
val vals = (0 to 3).map(k => {
val P1 = (0 to (n_2-k-1))
.map(j => 1 - n_1/(N-j.toDouble))
.foldLeft(1.0)(_*_)
val P2 = (0 to (k-1))
.map(j => (n_1-j.toDouble)*(n_2-j.toDouble)/(N-n_2+k.toDouble-j.toDouble)/(k.toDouble-j.toDouble) )
.foldLeft(1.0)(_*_)
P1*P2
})
vals.sum.toDouble
})
notice that you don't need to pass tag_1/tag_2 because this information is on the resulting dataframe, then you can call it like this:
val tagWithMeasureDF = tagIdsDF.withColumn("measure", pFun($"set1", $"set2"))
and you get this df:
tagWithMeasureDF
+-----------------------------+---------+
|tag1 | set1 |tag2 | set2 | measure |
+---------------------------------------+
|tag_1 |set_1 |tag_2 |set_2 | m1 |
|... ... ...|
|tag_M |set_M |tag_L |set_L | mn |
+---------------------------------------+
doing something like this maybe helps you to achieve the desired performance.
Hope this helps you and if it works tell me!

Related

Length of dataframe inside UDF function

I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()

How to count the number of words per line in text file using RDD?

Is there a way to count the number of word occurrences for each line of an RDD and not the complete RDD using map and reduce?
For example, if an RDD[String] contains these two lines:
Let's have some fun.
To have fun you don't need any plans.
then the output should be like a map containing the key value pairs:
("Let's",1)
("have",1)
("some",1)
("fun",1)
("To",1)("have",1)("fun",1)("you",1)("don't",1)("need",1)("plans",1)
Please, don't use RDD API if you've just started using Spark and no one told you to use it. There's so much nicer and often more efficient Spark SQL API to do this and many other distributed computations over large datasets in Spark.
Using RDD API is like using assembler for something you can use Scala (or other higher-level programming language) for. It's certainly too much to think about when starting your journey into Spark that I'd personally recommend the higher-level API of Spark SQL with DataFrames and Datasets in the first place.
Given the input:
$ cat input.txt
Let's have some fun.
To have fun you don't need any plans.
and that you were to use Dataset API, you could do the following:
val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+
scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+
Done. Simple, isn't it?
See functions object (for explode and struct functions).
According to my understanding you can do the following
You said that you have RDD[String] data
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
You can apply flatMap to split the lines and create (word, 1) tuples in map function
val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))
that should give you your desired output
output.foreach(println)
To have occurances by line you should do the following
val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
.map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)
What you want is to transform a line into a Map(word, count). So you can define a function count word by line :
def wordsCount(line: String):Map[String,Int] = {
line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}
then just apply it to your RDD[String]:
val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))
Although it is an old question; I was looking for an answer to this in pySpark. Finally managed like the below.
file_ = cont_.parallelize (
["shots are shots that are shots with more big shots by big people",
"people comes in all shapes and sizes, as people are idoits of the idiots",
"i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
"my spark is a current spark, that spark in my eyes."]
)
file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()
Let's say you have your rdd like this
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
Then simply apply flapMap and then map
val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))
Expected Output
res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))

Get all possible combinations of 3 values from n possible elements

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

My RDD change his values himself

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.