I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()
Is there a way to count the number of word occurrences for each line of an RDD and not the complete RDD using map and reduce?
For example, if an RDD[String] contains these two lines:
Let's have some fun.
To have fun you don't need any plans.
then the output should be like a map containing the key value pairs:
("Let's",1)
("have",1)
("some",1)
("fun",1)
("To",1)("have",1)("fun",1)("you",1)("don't",1)("need",1)("plans",1)
Please, don't use RDD API if you've just started using Spark and no one told you to use it. There's so much nicer and often more efficient Spark SQL API to do this and many other distributed computations over large datasets in Spark.
Using RDD API is like using assembler for something you can use Scala (or other higher-level programming language) for. It's certainly too much to think about when starting your journey into Spark that I'd personally recommend the higher-level API of Spark SQL with DataFrames and Datasets in the first place.
Given the input:
$ cat input.txt
Let's have some fun.
To have fun you don't need any plans.
and that you were to use Dataset API, you could do the following:
val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+
scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+
Done. Simple, isn't it?
See functions object (for explode and struct functions).
According to my understanding you can do the following
You said that you have RDD[String] data
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
You can apply flatMap to split the lines and create (word, 1) tuples in map function
val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))
that should give you your desired output
output.foreach(println)
To have occurances by line you should do the following
val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
.map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)
What you want is to transform a line into a Map(word, count). So you can define a function count word by line :
def wordsCount(line: String):Map[String,Int] = {
line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}
then just apply it to your RDD[String]:
val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))
Although it is an old question; I was looking for an answer to this in pySpark. Finally managed like the below.
file_ = cont_.parallelize (
["shots are shots that are shots with more big shots by big people",
"people comes in all shapes and sizes, as people are idoits of the idiots",
"i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
"my spark is a current spark, that spark in my eyes."]
)
file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()
Let's say you have your rdd like this
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
Then simply apply flapMap and then map
val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))
Expected Output
res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))
I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528
I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).
I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.