Issues with Spark Serialization? - scala

Let be a job that contains two phases that (for convenience) cannot be merged. Let's call A the first step and B the second step. Hence, we always need to do A then B.
Workflow
start new cluster with a job A
build MyOutput
count(MyOutput) = 2000 (*)
write(MyOutput)
start new cluster with a job B
read(MyOutput)
count(MyOutput) = 1788 (**).
Precisions
A provides an output which is an RDD[MyObject], namely MyOutput. To write MyOutput, here is what I do: MyOutput.saveAsObjectFile("...")
Then B uses MyOutput as an input, reading the previously written file. Here is what I do: val MyOutput: RDD[MyObject] = sc.objectFile("...")
A and B happen on two separate clusters.
Problem
The main point is that the problem does not ALWAYS happen but when this problem appears, (*) < (**) -- it seems we lost some data whereas we should not have any differences between these data. Obvisously something is wrong.
Do you know what happens here? How to solve that?

Related

Unexpected Drools Ruleflow Behaviour

Drools version: 6.5.0
A rule flow sequence which takes (Start -> A -> B -> End) route, the expectation is all the rules in A (RuleflowGroup: A) will be executed first before all the rules in B (RuleflowGroup: B). But the result it produces from the implementation of the AgendaEventListener methods (i.e., beforeMatchFired, afterMatchFired) follows the reverse order. Rules associated with B are executed first before rules associated with A.
Any explanation would be very helpful.
Please find the rule flow diagram below.
If it is the same with version 7.x that i am currently using, it is because is a bit more complicated than you think. It is not just a flow. (A->B->C), is a stack.
So it is A then B/A then C/B/A. And when C finishes executing is returning back to B/A and then A.
If you want to get rid of that you can add a rule at last level, with lowest priority and
eval (true) in the when part and halt() in the then part to end the session before it returns to previous ruleflow group.

scalaz-stream consume stream based on computed value

I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!

Slow Performance When Using Scalaz Task

I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?
I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.

How to read from TCP and write to stdout?

I'm failing to get a simple scalaz-stream example running, reading from TCP and writing to std out.
val src = tcp.reads(1024)
val addr = new InetSocketAddress(12345)
val p = tcp.server(addr, concurrentRequests = 1) {
src ++ tcp.lift(io.stdOutLines)
}
p.run.run
It just sits there, not printing anything.
I've also tried various arrangements using to, always with the tcp.lift incantation to get a Process[Connection, A], including
tcp.server(addr, concurrentRequests = 1)(src) map (_ to tcp.lift(io.stdOutLines))
which doesn't even compile.
Do I need to wye the source and print streams together? An example I found on the original pull request for tcp replacing nio seemed to indicate this, but wye no longer appears to exist on Process, so confusion reigns unfortunately.
Edit it turns out that in addition to the type problems explained by Paul, you also need to run the inner processes "manually", for example by doing p.map(_.run.run).run.run. I don't suppose that's the idiomatic way to do this, but it does work.
You need to pass src through the sink to actually write anything. I think this should do it:
import scalaz.stream.{io,tcp,text}
import scalaz.stream.tcp.syntax._
val p = tcp.server(addr, concurrentRequests = 1) {
tcp.reads(1024).pipe(text.utf8Decode) through tcp.lift(io.stdOutLines)
}
p.run.run
The expression src ++ tcp.lift(io.stdOutLines) should really be a type error. The type of tcp.reads(1024) is Process[Connection,ByteVector], and the type of tcp.lift(io.stdOutLines) is Process[Connection, String => Task[Unit]]. Appending those two processes does not make sense, and the only reason it typechecks is due to the covariance of Process[+F[_],+O]. Scala is "helpfully" inferring Any when you append two processes with unrelated output types.
A future release of scalaz-stream may add a constraint on ++ and other functions that exploit covariance to make sure the least upper bound that gets computed isn't something useless like Any or Serializable. This would go a long way to preventing mistakes like this. In the meantime, make sure you understand the types of all the functions you are working with, what they do, and how you are sticking them together.

Merging scalaz-stream input processes seems to "wait" on stdin

I have a simple program:
import scalaz._
import stream._
object Play extends App {
val in1 = io.linesR("C:/tmp/as.txt")
val in2 = io.linesR("C:/tmp/bs.txt")
val p = (in1 merge in2) to io.stdOutLines
p.run.run
}
The file as.txt contains five as and the file bs.txt contain 3 bs. I see this sort of output:
a
b
b
a
a
b
a
a
a
However, when I change the declaration of in2 as follows:
val in2 = io.stdInLines
Then I get what I think is unexpected behaviour. According to the documentation 1, the program should pull data non-deterministically from each stream according to whichever stream is quicker to supply stuff. This should mean that I see a bunch of as immediately printed to the console but this is not what happens at all.
Indeed, until I press ENTER, nothing happens. It's quite clear that the behaviour looks a lot like what I would expect if I was choosing a stream at random to get the next element from and then, if that stream was blocking, the merged process blocks too (even if the other stream contains data).
What is going on?
1 - well, OK, there is very little documentation, but Dan Spiewak said very clearly in his talk that it would grab whoever was the first to supply data
The problem is in the implementation of stdInLines. It is blocking, it never Task.forks a thread.
Try changing the implentation of stdInLines to this one:
def stdInLines: Process[Task,String] =
Process.repeatEval(Task.apply {
Option(scala.Console.readLine())
.getOrElse(throw Cause.Terminated(Cause.End))
})
The original io.stdInLines is running the readLine() in the same thread, so it always waits there until you type something.