I am currently trying to enrich data for machine learning with requests per minute. The data is stored in a Kafka topic and on application start the whole content of the topic is loaded and processed - therefore it is not possible to use any window operations of spark streaming to my knowledge, as all data will arrive at the same time.
My approach was to try the following:
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()
(duration, rpm)
})
However on this approach I get the following exception:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Is there a way to achieve what I want to do?
If you need any more information just drop me a comment and I will update what you need.
EDIT:
Broadcasting an RDD does not work. Broadcasting the collected Array does not result in an acceptable performance.
What will be executed but is horribly slow and therefore not really an option:
val collected = kMeansInformationRdd.collect()
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
UPDATE:
This code is at least able to get the job done way faster - but as far as I see it still gets slower the higher the requests per minute are as the filtered array grows - interesting enough it gets slower towards the end what I cannot figure out why. If someone sees the issue - or performance issues that could be generally improved - I would be happy if you let me know.
kMeansInformationRdd = kMeansInformationRdd.cache()
kMeansInformationRdd.sortBy(_._2, true)
var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
var counter = 0
kMeansInformationRdd.collect.foreach(x => {
val ts = x._2
val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint
var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())
bufferInstance = bufferInstance ++ Array(ts)
bufferInstance = bufferInstance.filter(p => p > ts-1000)
buffer.put(identifier, bufferInstance)
val rpm = bufferInstance.size.toLong
kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
counter = counter +1
if(counter % 10000==0){
println(counter)
println((identifier, x._3, rpm))
println((instanceSizeBefore, instanceSizeAfter))
}
})
val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)
The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:
val collected = sc.broadcast(kMeansInformationRdd.collect())
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?
The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:
When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.
Related
I'm doing a program with big data and that's why I'm using Spark and Scala. I need to partition the database and for this I use
var data0 = conf.dataBase.repartition (8) .persist (StorageLevel.MEMORY_AND_DISK_SER)
but then I need to do things in the partition before proceeding to work with the piece of database corresponding to that partition and for that I use
var tester = data0.mapPartitions {x =>
configFuzzyPredProblem ()
Strategy.getStrategy.executeStrategy (conf.iterByRun, 5, GeneratorType.HillClimbing)
} .persist (StorageLevel.MEMORY_AND_DISK_SER)
Within the method executeStrategy() I use the database but I do not know if it is the global one or the one corresponding to that partition. How can I know which one I am using and then only perform the partition processing with the database of that partition?
Here is a simple example using mapPartitionsWithIndex which follows the same rules of mapPartitions - excluding the index aspect.
You can see that inside mapPartitions you need to process an interable, an Interator Int in this example. In this case 3 partitions are processed, in your case 8, with either some entries or possibly zero entries.
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + x)
}
val rdd2 = rdd1.mapPartitionsWithIndex(myfunc)
I cannot see inside your function but I assume it is OK and it will process a partition - part of your database.
I have and an RDD[(Int,Array[Double],Double, Double)].
val full_data = rdd.map(row => {
val label = row._1
val feature = row._2.map(_.toDouble)
val QD = k_function(feature)
val alpha = 0.0
(label,feature,QD,alpha)
})
Now I want to update the value of alpha in each record (say 10)
var tmp = full_data.map( x=> {
x._4 = 10
})
I got the error
Error: reassignment to val
x._4 = 10
I have changed the all the val to var but still, the error occurs. How to update the value of alpha. and I would like to know how to update the full row or a specific row in an RDD.
RDD's are immutable in nature. They are made so for easy caching, sharing and replicating. Its always safe to copy than to mutate in a multi-threaded system like spark for fault tolerance and correctness in processing. Recreation of immutable data is much easier than mutable data.
Transformation is like copying the RDD data to another RDD every variables are treated as val i.e. they are immutable so if you are looking to replace the last double with 10, you can do is
var tmp = full_data.map( x=> {
(x._1, x._2, x._3, 10)
})
I want to time my Spark program execution speed but due to laziness it's quite difficult. Let's take into account this (meaningless) code here:
var graph = GraphLoader.edgeListFile(context, args(0))
val graph_degs = graph.outerJoinVertices(graph.degrees).triplets.cache
/* I'd need to start the timer here */
val t1 = System.currentTimeMillis
val edges = graph_degs.flatMap(trip => { /* do something*/ })
.union(graph_degs)
val count = edges.count
val t2 = System.currentTimeMillis
/* I'd need to stop the timer here */
println("It took " + t2-t1 + " to count " + count)
The thing is, transformations are lazily so nothing gets evaluated before val count = edges.count line. But according to my point of view t1 gets a value despite the code above hasn't a value... the code above t1 gets evaluated after the timer started despite the position in the code. That's a problem...
In Spark Web UI I can't find anything interesting about it since I need the time spent after that specific line of code. Do you think is there a easy solution to see when a group of transformation gets evaluated for real?
Since consecutive transformations (within the same task - meaning, they are not separated by shuffles and performed as part of the same action) are performed as a single "step", Spark does not measure them individually. And from Driver code - you can't either.
What you can do is measure the duration of applying your function to each record, and use an Accumulator to sum it all up, e.g.:
// create accumulator
val durationAccumulator = sc.longAccumulator("flatMapDuration")
// "wrap" your "doSomething" operation with time measurement, and add to accumulator
val edges = rdd.flatMap(trip => {
val t1 = System.currentTimeMillis
val result = doSomething(trip)
val t2 = System.currentTimeMillis
durationAccumulator.add(t2 - t1)
result
})
// perform the action that would trigger evaluation
val count = edges.count
// now you can read the accumulated value
println("It took " + durationAccumulator.value + " to flatMap " + count)
You can repeat this for any individual transformation.
Disclaimers:
Of course, this will not include the time Spark spent shuffling things around and doing the actual counting - for that, indeed, the Spark UI is your best resource.
Note that accumulators are sensitive to things like retries - a retried task will update the accumulator twice.
Style Note:
You can make this code more reusable by creating a measure function that "wraps" around any function and updates a given accumulator:
// write this once:
def measure[T, R](action: T => R, acc: LongAccumulator): T => R = input => {
val t1 = System.currentTimeMillis
val result = action(input)
val t2 = System.currentTimeMillis
acc.add(t2 - t1)
result
}
// use it with any transformation:
rdd.flatMap(measure(doSomething, durationAccumulator))
The Spark Web UI records every single action, and even reports times of every stage of that action - it's all in there! You need to looks through the stages tab, not the jobs. I've found it's only useable though if you compile and submit your code. It is useless in the REPL, are you using this by any chance?
I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).
In the Spark core "example" directory (I am using Spark 1.2.0), there is an example called "SparkPageRank.scala",
val sparkConf = new SparkConf().setAppName("PageRank")
val iters = if (args.length > 0) args(1).toInt else 10
val ctx = new SparkContext(sparkConf)
val lines = ctx.textFile(args(0), 1)
val links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.collect()
ctx.stop()
}
I realize that in this example, the lineage will keep extending after each iteration. As a result, when I monitored the directory that holds the shuffle data, the shuffle data storage keeps increasing after each iteration.
How should I structure the application code, so that the ContextCleaner's doCleanupShuffle will be activated after certain interval (say, several iterations), so that I can prevent the ever-increasing of the shuffle data storage for computation that takes many iterations?
Jun
Apparently the clean-up of the shuffle files happens when the objects used for the shuffle are GCed. Since your snippet is a simple example of Page rank, I assume you are running it on a very small dataset, therefore your memory never exceeds the size of the heap and the objects are never GCed. Try with a bigger file or trigger the GC manually (although it is usually not recommended).
More information on : https://github.com/apache/spark/pull/5074/files
By the way in your example, it would be more efficient to partition the data instead of shuffling it every time.