Execution order in Scala for/yield block - scala

I make three database calls (that all return Future values) using this syntax:
for {
a <- databaseCallA
b <- databaseCallB(a)
c <- databaseCallC(a)
} yield (a,b,c)
The second and third call depend on the result of the first, but the two of them could be run in parallel.
How can I get databaseCallC to be issued immediately after databaseCallB (without waiting for the result b)?
Or is this already happening?

This is not happening currently - you have told the Futures to start one after the other. To parallelise the second and third call, you could use this:
for {
a <- databaseCallA
(eventualB, eventualC) = (databaseCallB(a), databaseCallC(a))
b <- eventualB
c <- eventualC
} yield(a,b,c)
This will start both the computation of b and c as soon as a is available, and complete once all three are available with the triple

Related

yield results in `finish_bundle` from a custom DoFn

One step of my pipeline involves fetching from an external data source and I'd like to do that in chunks (order doesn't matter). I couldn't find any class that does something similar so I've created the following:
class FixedSizeBatchSplitter(beam.DoFn):
def __init__(self, size):
self.size = size
def start_bundle(self):
self.current_batch = []
def finish_bundle(self):
if self.current_batch:
yield self.current_batch
def process(self, element):
self.current_batch.append(element)
if len(self.current_batch) >= self.size:
yield self.current_batch
self.current_batch = []
However, when I run this pipeline, I get a RuntimeError: Finish Bundle should only output WindowedValue type error:
with beam.Pipeline() as p:
res = (p
| beam.Create(range(10))
| beam.ParDo(FixedSizeBatchSplitter(3))
)
Why is that? How comes that I can yield outputs in process but not in finish_bundle? By the way, if I remove finish_bundle the pipeline works but obviously discards the leftovers.
A DoFn may be processing elements from multiple different windows. When you're in process(), the "current window" is unambiguous - it's the window of the element being processed. When you're in finish_bundle, it's ambiguous and you need to specify the window explicitly. You need to be yielding something of the form yield WindowedValue(something, timestamp, [window]).
If all your data is in the global window, that makes it easier: window will be just GlobalWindow(). If you're using multiple windows, then you'll need to have 1 buffer per window; capture the window in process() so that you add to the proper buffer; and in finish_bundle emit each of them in the respective window.

Flatten syntax with yield - improving code readability

I'm trying to improve the readability of my code and I'm having a hard time with this little chunk.
Foo is a method that accepts a List[Ping]
Thing.generate returns a List[Ping]
ListOfPings is a List[Ping]
hasQuality returns a boolean value from evaluating a Ping
Here's the code:
foo((for {
pinger <- listOfPings
} yield pinger.generate.filter(_.hasQuality)).flatten)
Each Ping in listOfPingss is creating a List[Thing] with the generate method, meaning the result of the yield at the end of the loop is a List[List[Ping]].
I'm flattening that List[List[Ping]] (not the individual lists), and putting the whole result into foo
I'm having trouble making this look nicer, potentially with a flatmap? I sincerely appreciate the help.
Something like:
foo {
for (p <- listOfPings ; q <- p.generate if q.hasQuality) yield q
}

Error in recursive list logic

I am trying to build a list in scala that given input (length,and a function) the output would be a list from 0 up to that length-1.
for example:
listMaker(3,f) = List(0,1,2)
so far I have created a helper class that takes 2 int and returns a list in that range.
the listMaker function is as follows:
def listMaker[A](length:Int, f:Int =>A):List[A] = length match{
case 0 => List()
case _ => listMaker(length,f)
}
my f function just takes a variable x and returns that:
def f(x:Int)=x
the comment below makes sense, but it still gets me errors. I think the edited code is an easier way to get where I would like to
However, now I get an infinite loop. What part of the logic am I missing?
A recursive function typically has to gradually "bite off" pieces of the input data until there is nothing left - otherwise it can never terminate.
What this means in your particular case is that length must decrease on each recursive call until it reaches zero.
def listMaker[A](length:Int, f:Int =>A):List[A] = length match{
case 0 => List()
case _ => listMaker(length,f)
}
But you are not reducing length - you are passing it unchanged to the next recursive call, so, your function cannot terminate.
(There are other problems too - you need to build up your result list as you recurse, but your current code simply returns an empty list. I assume this is a learning exercise, so I'm not supplying working code...).

Scala lazy val caching

In the following example:
def maybeTwice2(b: Boolean, i: => Int) = {
lazy val j = i
if (b) j+j else 0
}
Why is hi not printed twice when I call it like:
maybeTwice2(true, { println("hi"); 1+41 })
This example is actually from the book "Functional Programming in Scala" and the reason given as why "hi" not getting printed twice is not convincing enough for me. So just thought of asking this here!
So i is a function that gives an integer right? When you call the method you pass b as true and the if statement's first branch is executed.
What happens is that j is set to i and the first time it is later used in a computation it executes the function, printing "hi" and caching the resulting value 1 + 41 = 42. The second time it is used the resulting value is already computed and hence the function returns 84, without needing to compute the function twice because of the lazy val j.
This SO answer explores how a lazy val is internally implemented. In j + j, j is a lazy val, which amounts to a function which executes the code you provide for the definition of the lazy val, returns an integer and caches it for further calls. So it prints hi and returns 1+41 = 42. Then the second j gets evaluated, and calls the same function. Except this time, instead of running your code, it fetches the value (42) from the cache. The two integers are then added (returning 84).

Remove variable labels attached with foreign/Hmisc SPSS import functions

As usual, I got some SPSS file that I've imported into R with spss.get function from Hmisc package. I'm bothered with labelled class that Hmisc::spss.get adds to all variables in data.frame, hence want to remove it.
labelled class gives me headaches when I try to run ggplot or even when I want to do some menial analysis! One solution would be to remove labelled class from each variable in data.frame. How can I do that? Is that possible at all? If not, what are my other options?
I really want to bypass reediting variables "from scratch" with as.data.frame(lapply(x, as.numeric)) and as.character where applicable... And I certainly don't want to run SPSS and remove labels manually (don't like SPSS, nor care to install it)!
Thanks!
Here's how I get rid of the labels altogether. Similar to Jyotirmoy's solution but works for a vector as well as a data.frame. (Partial credits to Frank Harrell)
clear.labels <- function(x) {
if(is.list(x)) {
for(i in 1 : length(x)) class(x[[i]]) <- setdiff(class(x[[i]]), 'labelled')
for(i in 1 : length(x)) attr(x[[i]],"label") <- NULL
}
else {
class(x) <- setdiff(class(x), "labelled")
attr(x, "label") <- NULL
}
return(x)
}
Use as follows:
my.unlabelled.df <- clear.labels(my.labelled.df)
EDIT
Here's a bit of a cleaner version of the function, same results:
clear.labels <- function(x) {
if(is.list(x)) {
for(i in seq_along(x)) {
class(x[[i]]) <- setdiff(class(x[[i]]), 'labelled')
attr(x[[i]],"label") <- NULL
}
} else {
class(x) <- setdiff(class(x), "labelled")
attr(x, "label") <- NULL
}
return(x)
}
A belated note/warning regarding class membership in R objects. The correct method for identification of "labelled" is not to test for with an is function or equality {==) but rather with inherits. Methods that test for a specific location will not pick up cases where the order of existing classes are not the ones assumed.
You can avoid creating "labelled" variables in spss.get with the argument: , use.value.labels=FALSE.
w <- spss.get('/tmp/my.sav', use.value.labels=FALSE, datevars=c('birthdate','deathdate'))
The code from Bhattacharya could fail if the class of the labelled vector were simply "labelled" rather than c("labelled", "factor") in which case it should have been:
class(x[[i]]) <- NULL # no error from assignment of empty vector
The error you report can be reproduced with this code:
> b <- 4:6
> label(b) <- 'B Label'
> str(b)
Class 'labelled' atomic [1:3] 4 5 6
..- attr(*, "label")= chr "B Label"
> class(b) <- class(b)[-1]
Error in class(b) <- class(b)[-1] :
invalid replacement object to be a class string
You can try out the read.spss function from the foreign package.
A rough and ready way to get rid of the labelled class created by spss.get
for (i in 1:ncol(x)) {
z<-class(x[[i]])
if (z[[1]]=='labelled'){
class(x[[i]])<-z[-1]
attr(x[[i]],'label')<-NULL
}
}
But can you please give an example where labelled causes problems?
If I have a variable MAED in a data frame x created by spss.get, I have:
> class(x$MAED)
[1] "labelled" "factor"
> is.factor(x$MAED)
[1] TRUE
So well-written code that expects a factor (say) should not have any problems.
Suppose:
library(Hmisc)
w <- spss.get('...')
You could remove the labels of a variable called "var1" by using:
attributes(w$var1)$label <- NULL
If you also want to remove the class "labbled", you could do:
class(w$var1) <- NULL
or if the variable has more than one class:
class(w$var1) <- class(w$var1)[-which(class(w$var1)=="labelled")]
Hope this helps!
Well, I figured out that unclass function can be utilized to remove classes (who would tell, aye?!):
library(Hmisc)
# let's presuppose that variable x is gathered through spss.get() function
# and that x is factor
> class(x)
[1] "labelled" "factor"
> foo <- unclass(x)
> class(foo)
[1] "integer"
It's not the luckiest solution, just imagine back-converting bunch of vectors... If anyone tops this, I'll check it as an answer...