I am new to Scala and working on implementing an algorithm. In C#, this would have been a much easier task with necessary loops, but it is a bit confusing to implement with Scala functional programming semantics.
Assume I have to fill a spreadsheet (S) with N rows and M cols with values that I have in a one-dimensional list (L).
While filing an individual cell in the spreadsheet, there is a back and forth logic involved.
2a. The system will walk through the items in L sequentially and will fill the same in next empty cell in sheet S
2b. While filling the item value of the currently processed item from L in a cell, the system will check, can the current cell accept the item value. If yes, it will fill, and move on to the next item and follow Step 2a. If not, it will see if it could fill the next item from L. Until it finds a value that could fit in, the system will continue to evaluate till it runs out of values and will leave it blank.
2c. The system after filling the cell in Step 2b will move to the next cell. Now, it will first check whether any of the unprocessed values from the previous step (2b) could be accepted by the currently processed cell. If yes, it will fill the same and continue to do work with unprocessed values. If it cannot find an unprocessed value that could fit in, it will pull the next item from L based on the position of the pointer on Step 2b.
It would be great if I could get ideas of how-to structure this with Scala. As I mentioned earlier, in C# this would have been easy with foreach loops, but I am not sure what is the most optimal way to do this in a functional programming construct.
You can remember that imperative:
for (init; condition; afterEach) {
instructions
}
is just a syntactic sugar for:
init
while (condition) {
instructions
afterEach
}
(at least until you use break or continue). So if you are able to rewrite your for-loop code into while-loop code the translation is pretty straightforward.
If you are not interested in such solution you could do something like
val indices = for {
i <- (0 until n).toStream // or .to(LazyList) if on 2.13
j <- (0 until m).toStream // or .to(LazyList) if on 2.13
} yield i -> j
indices.foldLeft(allItemsToInsert) { case (itemsLeft, (i, j)) =>
itemsLeft.find(item => /* predicate if item can be inserted at (i, j) */) match {
case Some(item) =>
// insert item to spreadsheet
items diff List(1) // remove found element - use other data structure if you find this too costly
case None =>
items // nothing could be inserted, move on
}
}
This would go through all indices one after another, and then try to find the first element which can be inserted. If it does it would insert it and take it off the list, if it cannot be inserted move on.
You can tweak the logic to e.g. partition on items that can be inserted if there could be more than one:
indices.foldLeft(allItemsToInsert) { case (itemsLeft, (i, j)) =>
val (insertable, nonInsertable) = itemsLeft.partition(item => /* predicate if item can be inserted */)
// insert insertable
nonInsertable // pass non-insertable for the next indice
}
Alternatively you could also use tail recursion if you really need to go back and forth:
#scala.annotation.tailrec
def insertValues(items: List[Item], i: Int, j: Int): Unit = {
if (items.nonEmpty) {
// insert what you can into spreadsheet
val itemsLeft = ... // items that you haven't inserted
val newI, newJ = ...
insertValues(itemsLeft, newI, newJ)
}
}
i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable.
we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am working on is something like LPA. at first every node has initial label equal to 0 and then nodes get new labels. as nodes update and get new labels, based on some conditions the Label_counter should be incremented by one to use this value as label for other nodes . for example label=1 or label = 2 and so on. for example we have zachary karate club dataset that it has 34 nodes and the dataset has 2 communities.
the initial state is like this:
(1,0)
(2,0)
.
.
.
(34,0)
first number is node Id and second one is label.
as nodes get new label, the Label_counter increments and other nodes in next iterations get new label and again Label_counter increments.
(1,1)
(2,1)
(3,1)
.
.
.
(33,3)
(34,3)
nodes with same label, belong to same community.
the problem that i have is:
because nodes in RDD and variables are distributed across the machines(each machine has a copy of variables) and executors dont have connection with each other, if an executor updates the Label_counter, other executors wont be informed of new value of Label_counter and maybe nodes will get wrong labels, IS it true to use Accumulator as label counter in this case, because Accumulators are shared variables across machines, or there is other ways for handling this problem???
In spark it is always complicated to compute index like values because they depend on things that are not in all the partitions. I can propose the following idea.
Compute the number of time the condition is met per partition
Compute the cumulated increment per partition so that we know the initial increment of each partition.
Increment the values of the partition based on that initial increment
Here is what the code could look like this. Let me start by setting up a few things.
// Let's define some condition
def condition(node : Long) = node % 10 == 1
// step 0, generate the data
val rdd = spark.range(34)
.select('id+1).repartition(10).rdd
.map(r => (r.getAs[Long](0), 0))
.sortBy(_._1).cache()
rdd.collect
Array[(Long, Int)] = Array((1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0),
(9,0), (10,0), (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0),
(19,0), (20,0), (21,0), (22,0), (23,0), (24,0), (25,0), (26,0), (27,0), (28,0),
(29,0), (30,0), (31,0), (32,0), (33,0), (34,0))
Then the core of the solution:
// step 1 and 2
val partIncrInit = rdd
// to each partition, we associate the number of times we need to increment
.mapPartitionsWithIndex{ case (i,p) =>
Iterator(i -> p.map(_._1).count(condition))
}
.collect.sorted // sort by partition index
.map(_._2) // we don't need the index anymore
.scanLeft(0)(_+_) // cumulated sum
// step 3, we increment each partition based on this initial increment.
val result = rdd
.mapPartitionsWithIndex{ case (i, p) =>
var incr = 0
p.map{ case (node, value) =>
if(condition(node))
incr+=1
(node, partIncrInit(i) + value + incr)
}
}
result.collect
Array[(Long, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1),
(9,1), (10,1), (11,2), (12,2), (13,2), (14,2), (15,2), (16,2), (17,2), (18,2),
(19,2), (20,2), (21,3), (22,3), (23,3), (24,3), (25,3), (26,3), (27,3), (28,3),
(29,3), (30,3), (31,4), (32,4), (33,4), (34,4))
I'm getting data from my database in the reverse order of how I need it to be. In order to correctly order it I have a couple choices: I can insert each new piece of data gotten at index 0 of my array, or just append it then reverse the array at the end. Something like this:
let data = ["data1", "data2", "data3", "data4", "data5", "data6"]
var reversedArray = [String]()
for var item in data {
reversedArray.insert(item, 0)
}
// OR
reversedArray = data.reverse()
Which one of these options would be faster? Would there be any significant difference between the 2 as the number of items increased?
Appending new elements has an amortized complexity of roughly O(1). According to the documentation, reversing an array has also a constant complexity.
Insertion has a complexity O(n), where n is the length of the array and you're inserting all elements one by one.
So appending and then reversing should be faster. But you won't see a noticeable difference if you're only dealing with a few dozen elements.
Creating the array by repeatedly inserting items at the beginning will be slowest because it will take time proportional to the square of the number of items involved.
(Clarification: I mean building the entire array reversed will take time proportional to n^2, because each insert will take time proportional to the number of items currently in the array, which will therefore be 1 + 2 + 3 + ... + n which is proportional to n squared)
Reversing the array after building it will be much faster because it will take time proportional to the number of items involved.
Just accessing the items in reverse order will be even faster because you avoid reversing the array.
Look up 'big O notation' for more information. Also note that an algorithm with O(n^2) runtime can outperform one with O(n) for small values of n.
My test results…
do {
let start = Date()
(1..<100).forEach { _ in
for var item in data {
reversedArray.insert(item, at: 0)
}
}
print("First: \(Date().timeIntervalSince1970 - start.timeIntervalSince1970)")
}
do {
let start = Date()
(1..<100).forEach { _ in
reversedArray = data.reversed()
}
print("Second: \(Date().timeIntervalSince1970 - start.timeIntervalSince1970)")
}
First: 0.0124959945678711
Second: 0.00890707969665527
Interestingly, running them 10,000 times…
First: 7.67399883270264
Second: 0.0903480052947998
I would like to understand a process behavior. Basically this spark process must be create at most five files, one for each territory and save them into HDFS.
Territories are provided by an array of five strings. But when I'm looking at spark UI, I see many times the same action being executed.
These are my questions:
Why isEmpty action has been executed 4 times for each territory instead of one? I expect just one action for territory.
How are decided the tasks number when isEmpty is calculated? First time there is just one task, the second time tasks are 4, third are 20 and fourth are 35. Which the logic behind that sizing? Can I control that number in some way?
NOTE: is someone has a more say big data solution for to accomplish the same process goal, please suggest me.
This is the code excerpt for the Spark process:
class IntegrationStatusD1RequestProcess {
logger.info(s"Retrieving all measurement point from DB")
val allMPoints = registryData.createIncrementalRegistryByMPointID()
.setName("allMPoints")
.persist(StorageLevel.MEMORY_AND_DISK)
logger.info("getTerritories return always an array of five String")
intStatusHelper.getTerritories.foreach { territory =>
logger.info(s"Retrieving measurement point for territory $territory")
val intStatusesChanged = allMPoints
.filter { m => m.getmPoint.substring(0, 3) == territory }
.setName(s"intStatusesChanged_${territory}")
.persist(StorageLevel.MEMORY_AND_DISK)
intStatusesChanged.isEmpty match {
case true => logger.info(s"No changes detected for territory")
case false =>
//create file and save it into hdfs
}
}
}
This is the image showing all the spark jobs:
The following first two images showing isEmpty stages:
isEmpty is inefficient if you expect it to be true!
Here's the RDD code for isEmpty:
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
It calls take. This is an efficient implementation if you think the RDD isn't empty, but is a horrible implementation if you think that it is.
The implementation of take follows this recursive step, starting at parts = 1:
Collect the first parts partitions.
Check if this result contain >= n items.
If yes, take the first n
If no, repeat step 1 with parts = parts * 4.
This implementation strategy lets the execution short-circuit if the RDD has more elements than you want to take, which is usually true. But if your RDD has fewer elements than you want to take, you end up computing the partition #1 log4(nPartitions) + 1 times, partitions #2-4 log4(nPartitions) times, partitions #5-16 log4(nPartitions) - 1 times, and so on.
A better implementation for this use case
This implementation only computes each partition once by sacrificing short-circuit capability:
def fasterIsEmpty(rdd: RDD[_]): Boolean = {
rdd.mapPartitions(it => Iterator(it.isEmpty))
.fold(true)(_ && _)
}
Apologies if the question is poorly phrased, I'll do my best.
If I have a sequence of values with times as an Observable[(U,T)] where U is a value and T is a time-like type (or anything difference-able I suppose), how could I write an operator which is an auto-reset one-touch barrier, which is silent when abs(u_n - u_reset) < barrier, but spits out t_n - t_reset if the barrier is touched, at which point it also resets u_reset = u_n.
That is to say, the first value this operator receives becomes the baseline, and it emits nothing. Henceforth it monitors the values of the stream, and as soon as one of them is beyond the baseline value (above or below), it emits the elapsed time (measured by the timestamps of the events), and resets the baseline. These times then will be processed to form a high-frequency estimate of the volatility.
For reference, I am trying to write a volatility estimator outlined in http://www.amazon.com/Volatility-Trading-CD-ROM-Wiley/dp/0470181990 , where rather than measuring the standard deviation (deviations at regular homogeneous times), you repeatedly measure the time taken to breach a barrier for some fixed barrier amount.
Specifically, could this be written using existing operators? I'm a bit stuck on how the state would be reset, though maybe I need to make two nested operators, one which is one-shot and another which keeps creating that one-shot... I know it could be done by writing one by hand, but then I need to write my own publisher etc etc.
Thanks!
I don't fully understand the algorithm and your variables in the example, but you can use flatMap with some heap-state and return empty() or just() as needed:
int[] var1 = { 0 };
source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
If you need a per-sequence state because of multiple consumers, you can defer the whole thing:
Observable.defer(() -> {
int[] var1 = { 0 };
return source.flatMap(v -> {
var1[0] += v;
if ((var1[0] & 1) == 0) {
return Observable.just(v);
}
return Observable.empty();
});
}).subscribe(...);