PySpark filtering gives inconsistent behavior - pyspark

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

Related

10 most common female first names - order changes

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")
The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

Data-processing takes too long if pre-processing just before

So I've been trying to perform a cumsum operation on a data-set. I want to emphasize that I want my cumsum to happen on partitions on my data-set (eg. cumsum for feature1 over time for personA).
I know how to do it, and it works "on its own" perfectly - i'll explain that part later. Here's the piece of code doing it:
// it's admitted that this DF contains all data I need
// with one column/possible value, with only 1/0 in each line
// 1 <-> feature has the value
// 0 <-> feature doesn't contain the value
// this DF is the one I get after the one-hot operation
// this operation is performed to apply ML algorithms on features
// having simultaneously multiple values
df_after_onehot.createOrReplaceTempView("test_table")
// #param DataFrame containing all possibles values eg. A, B, C
def cumSumForFeatures(values: DataFrame) = {
values
.map(value => "CAST(sum(" + value(0) + ") OVER (PARTITION BY person ORDER BY date) as Integer) as sum_" + value(0))
.reduce(_+ ", " +_)
}
val req = "SELECT *, " + cumSumForFeatures(possible_segments) + " FROM test_table"
// val req = "SELECT * FROM test_table"
println("executing: " + req)
val data_after_cumsum = sqLContext.sql(req).orderBy("person", "date")
data_after_cumsum.show(10, false)
The problem happens when I try to perform the same operation with some pre-processing before (like the one-hot operation, or adding computed features before). I tried with a very small dataset and it doesn't work.
Here is the printed stack trace (the part that should interess you at least):
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[Executor task launch worker-3] ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
So it seems it's related to a GC issue/JVM heap size? I just don't understand how it's related to my pre-processing?
I tried unpersist operation on not-used-anymore DFs.
I tried modifying the options on my machine (eg. -Xmx2048m).
The issue is the same once I deploy on AWS.
Extract of my pom.xml (for versions of Java, Spark, Scala):
<spark.version>2.1.0</spark.version>
<scala.version>2.10.4</scala.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
Would you know how I could fix my issue?
Thanks
From what I understand, I think that we could have two reasons for that:
JVM's heap overflow because of kept-in-memory-but-no-longer-used DataFrames
the cum-sum request could be too big to be processed with the few amount of RAM left
show/print operations increase the number of steps necessary for the job, and may interfer with Spark's inner optimizations
Considering that, I decided to "unpersist" no-longer-used DataFrames. That did not seem to change much.
Then, I decided to remove all unecessary show/print operations. That improved the number of step very much.
I changed my code to be more functionnal, but I kept 3 separate values to help debugging. That did not change much, but my code is cleaner.
Finally, here is the thing that helped me deal with the problem. Instead of making my request go through the dataset in one pass, I partitionned the list of features into slices:
def listOfSlices[T](list: List[T], sizeOfSlices: Int): List[List[T]] =
(for (i <- 0 until list.length by sizeOfSlices) yield list.slice(i, i+sizeOfSlices)).toList
I perform the request for each slice of, with a map operation. Then I join together them to have my final DataFrame. That way, I kind of distribute the computation, and it seems that this way is more efficient.
val possible_features_slices = listOfSlices[String](possible_features, 5)
val df_cum_sum = possible_features_slices
.map(possible_features_slice =>
dfWithCumSum(sqLContext, my_df, possible_segments_slice, "feature", "time")) // operation described in the original post
.foldLeft[DataFrame](null)((a, b) => if (a == null) b else if (b == null) a else a.join(b, Seq("person", "list_features", "time")))
I just really want to emphasize that I still not understand the reason behind my problem, and I still expect an answer at this level.

Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class

Despite existing a lot of seemingly similar questions none answers my question.
I have a DataFrame already processed in order to be fed to a DecisionTreeClassifier and it contains a column label which is filled with either 0.0 or 1.0.
I need to bootstrap my data set, by randomly selecting with replacement the same amount of rows for each values of my label column.
I've looked at all the doc and all I could find are DataFrame.sample(...) and DataFrameStatFunctions.sampleBy(...) but the issue with those are that the number of sample retained is not guaranteed and the second one doesn't allow replacement! This wouldn't be an issue on larger data set but in around 50% of my cases I'll have one of the label values that have less than a hundred rows and I really don't want skewed data.
Despite my best efforts, I was unable to find a clean solution to this problem and I resolved myself. to collecting the whole DataFrame and doing the sampling "manually" in Scala before recreating a new DataFrame to train my DecisionTreeClassifier on. But this seem highly inefficient and cumbersome, I would much rather stay with DataFrame and keep all the benefits coming from that structure.
Here is my current implementation for reference and so you know exactly what I'd like to do:
val nbSamplePerClass = /* some int value currently ranging between 50 and 10000 */
val onesDataFrame = inputDataFrame.filter("label > 0.0")
val zeros = inputDataFrame.except(onesDataFrame).collect()
val ones = onesDataFrame.collect()
val nbZeros = zeros.count().toInt
val nbOnes = ones.count().toInt
def randomIndexes(maxIndex: Int) = (0 until nbSamplePerClass).map(
_ => new scala.util.Random().nextInt(maxIndex)).toSeq
val zerosSample = randomIndexes(nbZeros).map(idx => zeros(idx))
val onesSample = randomIndexes(nbOnes).map(idx => ones(idx))
val samples = scala.collection.JavaConversions.seqAsJavaList(zerosSample ++ onesSample)
val resDf = sqlContext.createDataFrame(samples, inputDataFrame.schema)
Does anyone know how I could implement such a sampling while only working with DataFrames?
I'm pretty sure that it would significantly speed up my code!
Thank you for your time.

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)