Data-processing takes too long if pre-processing just before - scala

So I've been trying to perform a cumsum operation on a data-set. I want to emphasize that I want my cumsum to happen on partitions on my data-set (eg. cumsum for feature1 over time for personA).
I know how to do it, and it works "on its own" perfectly - i'll explain that part later. Here's the piece of code doing it:
// it's admitted that this DF contains all data I need
// with one column/possible value, with only 1/0 in each line
// 1 <-> feature has the value
// 0 <-> feature doesn't contain the value
// this DF is the one I get after the one-hot operation
// this operation is performed to apply ML algorithms on features
// having simultaneously multiple values
df_after_onehot.createOrReplaceTempView("test_table")
// #param DataFrame containing all possibles values eg. A, B, C
def cumSumForFeatures(values: DataFrame) = {
values
.map(value => "CAST(sum(" + value(0) + ") OVER (PARTITION BY person ORDER BY date) as Integer) as sum_" + value(0))
.reduce(_+ ", " +_)
}
val req = "SELECT *, " + cumSumForFeatures(possible_segments) + " FROM test_table"
// val req = "SELECT * FROM test_table"
println("executing: " + req)
val data_after_cumsum = sqLContext.sql(req).orderBy("person", "date")
data_after_cumsum.show(10, false)
The problem happens when I try to perform the same operation with some pre-processing before (like the one-hot operation, or adding computed features before). I tried with a very small dataset and it doesn't work.
Here is the printed stack trace (the part that should interess you at least):
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[Executor task launch worker-3] ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
So it seems it's related to a GC issue/JVM heap size? I just don't understand how it's related to my pre-processing?
I tried unpersist operation on not-used-anymore DFs.
I tried modifying the options on my machine (eg. -Xmx2048m).
The issue is the same once I deploy on AWS.
Extract of my pom.xml (for versions of Java, Spark, Scala):
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
Would you know how I could fix my issue?
Thanks

From what I understand, I think that we could have two reasons for that:
JVM's heap overflow because of kept-in-memory-but-no-longer-used DataFrames
the cum-sum request could be too big to be processed with the few amount of RAM left
show/print operations increase the number of steps necessary for the job, and may interfer with Spark's inner optimizations
Considering that, I decided to "unpersist" no-longer-used DataFrames. That did not seem to change much.
Then, I decided to remove all unecessary show/print operations. That improved the number of step very much.
I changed my code to be more functionnal, but I kept 3 separate values to help debugging. That did not change much, but my code is cleaner.
Finally, here is the thing that helped me deal with the problem. Instead of making my request go through the dataset in one pass, I partitionned the list of features into slices:
def listOfSlices[T](list: List[T], sizeOfSlices: Int): List[List[T]] =
(for (i <- 0 until list.length by sizeOfSlices) yield list.slice(i, i+sizeOfSlices)).toList
I perform the request for each slice of, with a map operation. Then I join together them to have my final DataFrame. That way, I kind of distribute the computation, and it seems that this way is more efficient.
val possible_features_slices = listOfSlices[String](possible_features, 5)
val df_cum_sum = possible_features_slices
.map(possible_features_slice =>
dfWithCumSum(sqLContext, my_df, possible_segments_slice, "feature", "time")) // operation described in the original post
.foldLeft[DataFrame](null)((a, b) => if (a == null) b else if (b == null) a else a.join(b, Seq("person", "list_features", "time")))
I just really want to emphasize that I still not understand the reason behind my problem, and I still expect an answer at this level.

Related

10 most common female first names - order changes

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")
The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

How to create WindowSpec to count rows per type before and after the current row?

I have had to implement an event centric Windowing batch, with a varying number of event names.
The rule is as follows, for a certain event, every time it occurs, we sum all other events according to certain time windows.
action1 00:01
action2 00:02
action1 00:03
action3 00:04
action3 00:05
For the above dataset, it should be:
window_before: Map(action1 -> 1)
window_after: Map(action1 -> 1, action3 -> 2)
In order to achieve this, we use WindowSpec and a custom udaf that aggregates all counters into a map. The udaf is necessary because the number of action names is completely arbitrary.
Of course at first, the UDAF used Spark's catalyst converters, which was horrendously slow.
Now I've reached what I think is a decent optimum, where I just maintain an array of keys and values with immutable lists (lower GC times, lower iterator overhead) all serialized as binary, so the Scala runtime handles boxing/unboxing and not Spark, using byte arrays instead of strings.
The problem is that some stragglers are very problematic, and the workload cannot be parallelized, unlike when we had a static number of columns and were just summing/counting numeric columns.
I tried to test another technique where I created a number of columns equal to the max cardinality of events and then aggregated back to a map, but the number of columns in the projection was simply killing spark (think a thousand columns easily).
One of the problems, is the huge stragglers, where most of the time a single partition (something like userid, app) will take 100 times longer than the median, even though everything is properly repartitioned.
Has anyone else come to a similar problem ?
Example WindowSpec:
val windowSpec = Window
.partitionBy($"id", $"product_id")
.orderBy("time")
.rangeBetween(-30days, -1)
then
df.withColumn("over30days", myUdaf("name", "count").over(windowSpec))
A naive version of the UDAF:
class UDAF[A] {
private var zero: A = ev.zero
val dt = schemaFor[A].dataType
override def bufferSchema: StructType =
StructType(StructField("actions", MapType(StringType, dt) :: Nil)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
name = row.get(0)
count = row.get(1)
buffer.update(name, buffer.getOrElse(name, ev.zero) + count)
}
}
My current version is less readable than the above naive version but effectively does the same, two binary arrays to circumvent CatalystConverters.

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class

Despite existing a lot of seemingly similar questions none answers my question.
I have a DataFrame already processed in order to be fed to a DecisionTreeClassifier and it contains a column label which is filled with either 0.0 or 1.0.
I need to bootstrap my data set, by randomly selecting with replacement the same amount of rows for each values of my label column.
I've looked at all the doc and all I could find are DataFrame.sample(...) and DataFrameStatFunctions.sampleBy(...) but the issue with those are that the number of sample retained is not guaranteed and the second one doesn't allow replacement! This wouldn't be an issue on larger data set but in around 50% of my cases I'll have one of the label values that have less than a hundred rows and I really don't want skewed data.
Despite my best efforts, I was unable to find a clean solution to this problem and I resolved myself. to collecting the whole DataFrame and doing the sampling "manually" in Scala before recreating a new DataFrame to train my DecisionTreeClassifier on. But this seem highly inefficient and cumbersome, I would much rather stay with DataFrame and keep all the benefits coming from that structure.
Here is my current implementation for reference and so you know exactly what I'd like to do:
val nbSamplePerClass = /* some int value currently ranging between 50 and 10000 */
val onesDataFrame = inputDataFrame.filter("label > 0.0")
val zeros = inputDataFrame.except(onesDataFrame).collect()
val ones = onesDataFrame.collect()
val nbZeros = zeros.count().toInt
val nbOnes = ones.count().toInt
def randomIndexes(maxIndex: Int) = (0 until nbSamplePerClass).map(
_ => new scala.util.Random().nextInt(maxIndex)).toSeq
val zerosSample = randomIndexes(nbZeros).map(idx => zeros(idx))
val onesSample = randomIndexes(nbOnes).map(idx => ones(idx))
val samples = scala.collection.JavaConversions.seqAsJavaList(zerosSample ++ onesSample)
val resDf = sqlContext.createDataFrame(samples, inputDataFrame.schema)
Does anyone know how I could implement such a sampling while only working with DataFrames?
I'm pretty sure that it would significantly speed up my code!
Thank you for your time.

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)