spark filter on external variable - scala

I have a series of sales records which are in an RDD like so,
case class salesRecord(startDate: Int, startTime: Int, itemNumber: Int)
val transactions: RDD[salesRecords]
each day is separated by a startDate and startTime which are the seconds since midnight. I need to be able to filter the start time between two bands which have been input by the user.
so imagine the user has input
val timeBandStart: Int //example 100
val timeBandEnd: Int // 5000
and they only want the records of each day between these bands, so to do this I've tried the following.
val timeFiltered = transactions.filter { record =>
record.startTime >= timeBandStart && record.startTime <= timeBandEnd
}
the issue i'm facing is that I get nothing on my output, I know for sure the records are within these time bands, so should appear on my output. To try and debug this I've tried extract what the timebands which i'm trying to filter on, with this.
val test = transactions.map( (record.startTime, timeBandStart, timeBandEnd)
My output of this is the following (32342, 0, 0), (32455, 0, 0).
Why are my timeBands not being set within the filter? I thought this could be something to with the variables not being broadcast to all of the nodes, so I tried placing the timebands with a broadcast variable. But that didn't work....
Probably something really stupid, can somebody point out what i'm doing wrong?
Cheers!

I've fixed my bug, it's related to this issue. https://issues.apache.org/jira/browse/SPARK-4170
I'm now not extending from App and instead using the main method. My filter is now being applied correctly.
Thanks for your help!

Related

How to aggregate events in flink stream before merging with current state by reduce function?

My events are like: case class Event(user: User, stats: Map[StatType, Int])
Every event contains +1 or -1 values in it.
I have my current pipeline that works fine but produces new event for every change of statistics.
eventsStream
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
I'd like to aggregate these increments in a time window before merging them with the current state. So I want the same rolling reduce but with a time window.
Current simple rolling reduce:
500 – last reduced value
+1
-1
+1
Emitted events: 501, 500, 501
Rolling reduce with a window:
500 – last reduced value
v-- window
+1
-1
+1
^-- window
Emitted events: 501
I've tried naive solution to put time window just before reduce but after reading the docs I see that reduce now has different behavior.
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
It seems that I should make keyed stream and reduce it after reducing my time window:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
Is it the right pipeline to solve a problem?
There's probably different options, but one would be to implement a WindowFunction and then run apply after the windowing:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.apply(new MyWindowFunction)
(WindowFuntion takes type parameters for the type of the input value, the type of the output value and the type of the key.)
There's an example of that in here. Let me copy the relevant snippet:
/** User-defined WindowFunction to compute the average temperature of SensorReadings */
class TemperatureAverager extends WindowFunction[SensorReading, SensorReading, String, TimeWindow] {
/** apply() is invoked once for each window */
override def apply(
sensorId: String,
window: TimeWindow,
vals: Iterable[SensorReading],
out: Collector[SensorReading]): Unit = {
// compute the average temperature
val (cnt, sum) = vals.foldLeft((0, 0.0))((c, r) => (c._1 + 1, c._2 + r.temperature))
val avgTemp = sum / cnt
// emit a SensorReading with the average temperature
out.collect(SensorReading(sensorId, window.getEnd, avgTemp))
}
I don't know how your data looks so I can't attempt a full answer, but that should serve as inspiration.
Yes, your proposed pipeline will have the desired effect. The window will reduce together the 2-minute batches. The results of those batches will flow into the final reduce, which will produce an updated result after each of its inputs (which are the window results).

Generating a random sequence using Scala

I need to get a random sequence of 100 values from 10^-10 to 10^10 and storing to an Array using Scala. I tried following but it didn't work
Array(scala.math.pow(10,-10).doubleValue to scala.math.pow(10,10).intValue by scala.math.pow(10,5).toLong)
Can anyone help me to figure out how to do this correctly?
So you need to fill() the array with Random elements.
import scala.util.Random
val rndm = new Random(1911L)
Array.fill(100)(rndm.between(math.pow(10,-10), math.pow(10,10)))
//res0: Array[Double] = Array(6.08868427907728E9
// , 3.29548545155816E9
// , 9.52802903383275E9
// , 7.981295238889314E9
// , 1.9462480080050848E9
// . . .
This works because the 2nd parameter to the fill() method is "by-name", i.e. re-evaluated for every element.
UPDATE
Things aren't quite as clean if you don't have the .between() method (Scala 2.13).
Array.fill(100)(rndm.nextDouble())
.map(_ * math.pow(10,10))
Note that this actually has a floor of 0.0 instead of the desired 0.0000000001. It's very unlikely you'd have an entry that's too small, especially when taking only 100 samples. Still, there are steps you could take to insure that can't happen.

10 most common female first names - order changes

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")
The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class

Despite existing a lot of seemingly similar questions none answers my question.
I have a DataFrame already processed in order to be fed to a DecisionTreeClassifier and it contains a column label which is filled with either 0.0 or 1.0.
I need to bootstrap my data set, by randomly selecting with replacement the same amount of rows for each values of my label column.
I've looked at all the doc and all I could find are DataFrame.sample(...) and DataFrameStatFunctions.sampleBy(...) but the issue with those are that the number of sample retained is not guaranteed and the second one doesn't allow replacement! This wouldn't be an issue on larger data set but in around 50% of my cases I'll have one of the label values that have less than a hundred rows and I really don't want skewed data.
Despite my best efforts, I was unable to find a clean solution to this problem and I resolved myself. to collecting the whole DataFrame and doing the sampling "manually" in Scala before recreating a new DataFrame to train my DecisionTreeClassifier on. But this seem highly inefficient and cumbersome, I would much rather stay with DataFrame and keep all the benefits coming from that structure.
Here is my current implementation for reference and so you know exactly what I'd like to do:
val nbSamplePerClass = /* some int value currently ranging between 50 and 10000 */
val onesDataFrame = inputDataFrame.filter("label > 0.0")
val zeros = inputDataFrame.except(onesDataFrame).collect()
val ones = onesDataFrame.collect()
val nbZeros = zeros.count().toInt
val nbOnes = ones.count().toInt
def randomIndexes(maxIndex: Int) = (0 until nbSamplePerClass).map(
_ => new scala.util.Random().nextInt(maxIndex)).toSeq
val zerosSample = randomIndexes(nbZeros).map(idx => zeros(idx))
val onesSample = randomIndexes(nbOnes).map(idx => ones(idx))
val samples = scala.collection.JavaConversions.seqAsJavaList(zerosSample ++ onesSample)
val resDf = sqlContext.createDataFrame(samples, inputDataFrame.schema)
Does anyone know how I could implement such a sampling while only working with DataFrames?
I'm pretty sure that it would significantly speed up my code!
Thank you for your time.