10 most common female first names - order changes - scala

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")

The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

Related

Scala - divide the dataset into dataset of arrays with a fixed size

I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

Data-processing takes too long if pre-processing just before

So I've been trying to perform a cumsum operation on a data-set. I want to emphasize that I want my cumsum to happen on partitions on my data-set (eg. cumsum for feature1 over time for personA).
I know how to do it, and it works "on its own" perfectly - i'll explain that part later. Here's the piece of code doing it:
// it's admitted that this DF contains all data I need
// with one column/possible value, with only 1/0 in each line
// 1 <-> feature has the value
// 0 <-> feature doesn't contain the value
// this DF is the one I get after the one-hot operation
// this operation is performed to apply ML algorithms on features
// having simultaneously multiple values
df_after_onehot.createOrReplaceTempView("test_table")
// #param DataFrame containing all possibles values eg. A, B, C
def cumSumForFeatures(values: DataFrame) = {
values
.map(value => "CAST(sum(" + value(0) + ") OVER (PARTITION BY person ORDER BY date) as Integer) as sum_" + value(0))
.reduce(_+ ", " +_)
}
val req = "SELECT *, " + cumSumForFeatures(possible_segments) + " FROM test_table"
// val req = "SELECT * FROM test_table"
println("executing: " + req)
val data_after_cumsum = sqLContext.sql(req).orderBy("person", "date")
data_after_cumsum.show(10, false)
The problem happens when I try to perform the same operation with some pre-processing before (like the one-hot operation, or adding computed features before). I tried with a very small dataset and it doesn't work.
Here is the printed stack trace (the part that should interess you at least):
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[Executor task launch worker-3] ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
So it seems it's related to a GC issue/JVM heap size? I just don't understand how it's related to my pre-processing?
I tried unpersist operation on not-used-anymore DFs.
I tried modifying the options on my machine (eg. -Xmx2048m).
The issue is the same once I deploy on AWS.
Extract of my pom.xml (for versions of Java, Spark, Scala):
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
Would you know how I could fix my issue?
Thanks
From what I understand, I think that we could have two reasons for that:
JVM's heap overflow because of kept-in-memory-but-no-longer-used DataFrames
the cum-sum request could be too big to be processed with the few amount of RAM left
show/print operations increase the number of steps necessary for the job, and may interfer with Spark's inner optimizations
Considering that, I decided to "unpersist" no-longer-used DataFrames. That did not seem to change much.
Then, I decided to remove all unecessary show/print operations. That improved the number of step very much.
I changed my code to be more functionnal, but I kept 3 separate values to help debugging. That did not change much, but my code is cleaner.
Finally, here is the thing that helped me deal with the problem. Instead of making my request go through the dataset in one pass, I partitionned the list of features into slices:
def listOfSlices[T](list: List[T], sizeOfSlices: Int): List[List[T]] =
(for (i <- 0 until list.length by sizeOfSlices) yield list.slice(i, i+sizeOfSlices)).toList
I perform the request for each slice of, with a map operation. Then I join together them to have my final DataFrame. That way, I kind of distribute the computation, and it seems that this way is more efficient.
val possible_features_slices = listOfSlices[String](possible_features, 5)
val df_cum_sum = possible_features_slices
.map(possible_features_slice =>
dfWithCumSum(sqLContext, my_df, possible_segments_slice, "feature", "time")) // operation described in the original post
.foldLeft[DataFrame](null)((a, b) => if (a == null) b else if (b == null) a else a.join(b, Seq("person", "list_features", "time")))
I just really want to emphasize that I still not understand the reason behind my problem, and I still expect an answer at this level.

Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class

Despite existing a lot of seemingly similar questions none answers my question.
I have a DataFrame already processed in order to be fed to a DecisionTreeClassifier and it contains a column label which is filled with either 0.0 or 1.0.
I need to bootstrap my data set, by randomly selecting with replacement the same amount of rows for each values of my label column.
I've looked at all the doc and all I could find are DataFrame.sample(...) and DataFrameStatFunctions.sampleBy(...) but the issue with those are that the number of sample retained is not guaranteed and the second one doesn't allow replacement! This wouldn't be an issue on larger data set but in around 50% of my cases I'll have one of the label values that have less than a hundred rows and I really don't want skewed data.
Despite my best efforts, I was unable to find a clean solution to this problem and I resolved myself. to collecting the whole DataFrame and doing the sampling "manually" in Scala before recreating a new DataFrame to train my DecisionTreeClassifier on. But this seem highly inefficient and cumbersome, I would much rather stay with DataFrame and keep all the benefits coming from that structure.
Here is my current implementation for reference and so you know exactly what I'd like to do:
val nbSamplePerClass = /* some int value currently ranging between 50 and 10000 */
val onesDataFrame = inputDataFrame.filter("label > 0.0")
val zeros = inputDataFrame.except(onesDataFrame).collect()
val ones = onesDataFrame.collect()
val nbZeros = zeros.count().toInt
val nbOnes = ones.count().toInt
def randomIndexes(maxIndex: Int) = (0 until nbSamplePerClass).map(
_ => new scala.util.Random().nextInt(maxIndex)).toSeq
val zerosSample = randomIndexes(nbZeros).map(idx => zeros(idx))
val onesSample = randomIndexes(nbOnes).map(idx => ones(idx))
val samples = scala.collection.JavaConversions.seqAsJavaList(zerosSample ++ onesSample)
val resDf = sqlContext.createDataFrame(samples, inputDataFrame.schema)
Does anyone know how I could implement such a sampling while only working with DataFrames?
I'm pretty sure that it would significantly speed up my code!
Thank you for your time.