What is the difference between Apache Spark compute and slice? - scala

I trying to make unit test on DStream.
I put data in my stream with a mutable queue ssc.queueStream(red)
I set the ManualClock to 0
Start my streaming context.
Advance my ManualClock to batchDuration milis
When i'm doing a
stream.slice(Time(0), Time(clock.getTimeMillis())).map(_.collect().toList)
I got a result.
when I do
for (time <- 0L to stream.slideDuration.milliseconds + 10) {
println("time "+ time + " " +stream.compute(Time(time)).map(_.collect().toList))
}
None of them contain a result event the stream.compute(Time(clock.getTimeMillis()))
So what is the difference between this two functions without considerings the parameters differences?

Compute will return an RDD only if the provided time is a correct time in a sliding window i.e it's the zero time + a multiple of the slide duration.
Slice will align both the from and to times to slide duration and compute for each of them.

In slide you provide time interval and from time interval as long as it is valid we generate Seq[Time]
def to(that: Time, interval: Duration): Seq[Time] = {
(this.milliseconds) to (that.milliseconds) by (interval.milliseconds) map (new Time(_))
}
and then we "compute" for each instance of Seq[Time]
alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
if (time >= zeroTime) getOrCompute(time) else None
}
as oppose to compute we only compute for the instance of Time that we pass the compute method...

Related

How to aggregate events in flink stream before merging with current state by reduce function?

My events are like: case class Event(user: User, stats: Map[StatType, Int])
Every event contains +1 or -1 values in it.
I have my current pipeline that works fine but produces new event for every change of statistics.
eventsStream
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
I'd like to aggregate these increments in a time window before merging them with the current state. So I want the same rolling reduce but with a time window.
Current simple rolling reduce:
500 – last reduced value
+1
-1
+1
Emitted events: 501, 500, 501
Rolling reduce with a window:
500 – last reduced value
v-- window
+1
-1
+1
^-- window
Emitted events: 501
I've tried naive solution to put time window just before reduce but after reading the docs I see that reduce now has different behavior.
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
It seems that I should make keyed stream and reduce it after reducing my time window:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
Is it the right pipeline to solve a problem?
There's probably different options, but one would be to implement a WindowFunction and then run apply after the windowing:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.apply(new MyWindowFunction)
(WindowFuntion takes type parameters for the type of the input value, the type of the output value and the type of the key.)
There's an example of that in here. Let me copy the relevant snippet:
/** User-defined WindowFunction to compute the average temperature of SensorReadings */
class TemperatureAverager extends WindowFunction[SensorReading, SensorReading, String, TimeWindow] {
/** apply() is invoked once for each window */
override def apply(
sensorId: String,
window: TimeWindow,
vals: Iterable[SensorReading],
out: Collector[SensorReading]): Unit = {
// compute the average temperature
val (cnt, sum) = vals.foldLeft((0, 0.0))((c, r) => (c._1 + 1, c._2 + r.temperature))
val avgTemp = sum / cnt
// emit a SensorReading with the average temperature
out.collect(SensorReading(sensorId, window.getEnd, avgTemp))
}
I don't know how your data looks so I can't attempt a full answer, but that should serve as inspiration.
Yes, your proposed pipeline will have the desired effect. The window will reduce together the 2-minute batches. The results of those batches will flow into the final reduce, which will produce an updated result after each of its inputs (which are the window results).

Spark Exponential Moving Average

I have a dataframe of timeseries pricing data, with an ID, Date and Price.
I need to compute the Exponential Moving Average for the Price Column, and add it as a new column to the dataframe.
I have been using Spark's window functions before, and it looked like a fit for this use case, but given the formula for the EMA:
EMA: {Price - EMA(previous day)} x multiplier + EMA(previous day)
where
multiplier = (2 / (Time periods + 1)) //let's assume Time period is 10 days for now
I got a bit confused as to how can I access to the previous computed value in the column, while actually window-ing over the column.
With a simple moving average, it's simple, since all you need to do is compute a new column while averaging the elements in the window:
var window = Window.partitionBy("ID").orderBy("Date").rowsBetween(-windowSize, Window.currentRow)
dataFrame.withColumn(avg(col("Price")).over(window).alias("SMA"))
But it seems that with EMA its a bit more complicated since at every step I need the previous computed value.
I have also looked at Weighted moving average in Pyspark but I need an approach for Spark/Scala, and for a 10 or 30 days EMA.
Any ideas?
In the end, I've analysed how exponential moving average is implemented in pandas dataframes. Besides the recursive formula which I described above, and which is difficult to implement in any sql or window function(because its recursive), there is another one, which is detailed on their issue tracker:
y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
((1-a)^0 + (1-a)^1 + (1-a)^2 + ... + (1-a)^n).
Given this, and with additional spark implementation help from here, I ended up with the following implementation, which is roughly equivalent with doing pandas_dataframe.ewm(span=window_size).mean().
def exponentialMovingAverage(partitionColumn: String, orderColumn: String, column: String, windowSize: Int): DataFrame = {
val window = Window.partitionBy(partitionColumn)
val exponentialMovingAveragePrefix = "_EMA_"
val emaUDF = udf((rowNumber: Int, columnPartitionValues: Seq[Double]) => {
val alpha = 2.0 / (windowSize + 1)
val adjustedWeights = (0 until rowNumber + 1).foldLeft(new Array[Double](rowNumber + 1)) { (accumulator, index) =>
accumulator(index) = pow(1 - alpha, rowNumber - index); accumulator
}
(adjustedWeights, columnPartitionValues.slice(0, rowNumber + 1)).zipped.map(_ * _).sum / adjustedWeights.sum
})
dataFrame.withColumn("row_nr", row_number().over(window.orderBy(orderColumn)) - lit(1))
.withColumn(s"$column$exponentialMovingAveragePrefix$windowSize", emaUDF(col("row_nr"), collect_list(column).over(window)))
.drop("row_nr")
}
(I am presuming the type of the column for which I need to compute the exponential moving average is Double.)
I hope this helps others.

Q: [Anylogic] Measuring production throughput rate

I would like to know how to measure the throughput rate of the production line on Anylogic.
Question: Are there any methods to measure the Time Between Departure of the agent at the sink block? >>(I will calculate the throughput rate by inverting the time between departure value.)
At the moment, I just simply calculated the throughput based on Little's law, which I use the average lead time and WIP level of the line. I am not sure that whether the throughput value based on this calculation will be equal to the inverted value of the time between departure or not?
I hope you guys could help me figure it out.
Thanks in advance!
There is a function "time()" that returns the current model time in model time units. Using this function, you may know the times when agent A and agent B left the system, and calculate the difference between these times. You can do this by writing the code like below in the "On exit" field of the "sink" block:
statistic.add(time() - TimeOfPreviousAgent);
TimeOfPreviousAgent = time();
"TimeOfPreviousAgent" is a variable of "double" type;
"statistic" is a "Statistic" element used to collect the measurements
This approach of measuring time in the process flow is described in the tutorial Bank Office.
As an alternative, you can store leaving time of each agent into a collection. Then, you will need to iterate over the samples stored in the collection to find the difference between each pair of samples.
Not sure if this will help but it stems off Tatiana's answer. In the agents state chart you can create variables TimeIn, TimeOut, and TimeInSystem. Then at the Statechart Entry Point have,
TimeIn = time();
And at the Final state have,
TimeOut = time();
TimeInSystem = TimeOut - TimeIn;
To observe these times for each individual agent you can use the following code,
System.out.println("I came in at " + TimeIn + " and exited at " TimeOut + " and spent " + TimeInSystem + " seconds in the system";
Then for statistical analysis you can calculate the min, avg, and max throughputs of all agents by creating in Main variables, TotalTime, TotalAgentsServiced, AvgServiceTime, MaxServiceTime, MinServiceTime and then add a function call it say TrackAvgTimeInSystem ... within the function add argument NextAgent with type double. In the function body have,
TotalTime += NextAgent;
TotalAgentsServiced += 1;
AverageServiceTime = TotalTime/TotalCarsServiced;
if(MinServiceTimeReported == 0)
{
MinServiceTime = NextAgent;
}
else if(NextAgent < MinServiceTime)
{
MinServiceTime = NextAgent;
}
if(NextAgent > MaxServiceTime)
{
MaxServiceTime = NextAgent;
}
Then within your agent's state charts, in the Final State call the function
get_Main().TrackAvgTimeInSystem(TimeInSystem);
This then calculates the min, max, and average throughput of all agents.

Problems with simple time measurement in scala

I try to "profile" an expensive method by the means of just printing the system time. I've written a small method that prints the current time in seconds relative to the start-time. :
object Bechmark extends App {
var starttime = 0L
def printTime(): Unit = {
if (starttime == 0L) {
starttime = System.currentTimeMillis()
}
println((System.currentTimeMillis() - starttime) / 1000.0)
}
printTime()
Thread.sleep(100)
printTime()
}
I expect therefore that the first call to printTime prints something close to 0. But the output I get is
0.117
0.221
I don't understand why the first call already gives me ~120 miliseconds? What is the correct implementation for my purpose?
As others have mentioned the running time of your application does not necessarily represent the actual world time elapsed. There are several factors that affect it: warm up time of the JVM, JVM garbage collection, steady state of the JVM for the accurate measurement, OS process dispatching and shuffling.
For Scala-related purposes I suggest ScalaMeter
that allows you to tune all the aforementioned variables and measure the time quite accurately.

In Rx (or RxJava/RxScala), how to make an auto-resetting stateful latch map/filter for measuring in-stream elapsed time to touch a barrier?

Apologies if the question is poorly phrased, I'll do my best.
If I have a sequence of values with times as an Observable[(U,T)] where U is a value and T is a time-like type (or anything difference-able I suppose), how could I write an operator which is an auto-reset one-touch barrier, which is silent when abs(u_n - u_reset) < barrier, but spits out t_n - t_reset if the barrier is touched, at which point it also resets u_reset = u_n.
That is to say, the first value this operator receives becomes the baseline, and it emits nothing. Henceforth it monitors the values of the stream, and as soon as one of them is beyond the baseline value (above or below), it emits the elapsed time (measured by the timestamps of the events), and resets the baseline. These times then will be processed to form a high-frequency estimate of the volatility.
For reference, I am trying to write a volatility estimator outlined in http://www.amazon.com/Volatility-Trading-CD-ROM-Wiley/dp/0470181990 , where rather than measuring the standard deviation (deviations at regular homogeneous times), you repeatedly measure the time taken to breach a barrier for some fixed barrier amount.
Specifically, could this be written using existing operators? I'm a bit stuck on how the state would be reset, though maybe I need to make two nested operators, one which is one-shot and another which keeps creating that one-shot... I know it could be done by writing one by hand, but then I need to write my own publisher etc etc.
Thanks!
I don't fully understand the algorithm and your variables in the example, but you can use flatMap with some heap-state and return empty() or just() as needed:
int[] var1 = { 0 };
source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
If you need a per-sequence state because of multiple consumers, you can defer the whole thing:
Observable.defer(() -> {
int[] var1 = { 0 };
return source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
}).subscribe(...);