Weird behavior of reduceByKeyAndWindow function in Spark - scala

I am using spark 1.6 and came across this function reduceByKeyAndWindow which I am using to perform word count over data transmitted over a kafka topic.
Following is the list of alternatives reduceByKeyAndWindow is providing. As we can see, all the alternatives has similar signatures with extra parameters.
But when I just use reduceByKeyAndWindow with my reduce function or with my reduce function and duration, it works and doesn't give me any errors as shown below.
But when I use the alternative with reduce function, duration and sliding window time it starts giving me the following error, same happens with the other alternatives, as shown below.
I am not really sure what is happening here and how can I fix the problem.
Any help is appreciated

If you comment this line .words.map(x => (x, 1L)) you should be able to use the method [.reduceByWindow(_+_, Seconds(2), Seconds(2))] from DStream.
If you transform the words to words with count, then you should use the below method.
reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
Please see the documentation on more details for what are those reduce function and inverse reduce function https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

Related

why is the map function inherently parallel?

I was reading the following presentation:
http://www.idt.mdh.se/kurser/DVA201/slides/parallel-4up.pdf
and the author claims that the map function is built very well for parallelism (specifically he supports his claim on page 3 or slides 9 and 10).
If one were given the problem of increasing each value of a list by +1, I can see how looping through the list imperatively would require a index value to change and hence cause potential race condition problems. But I'm curious how the map function better allows a programmer to successfully code in parallel.
Is it due to the way map is recursively defined? So each function call can be thrown to a different thread?
I hoping someone can provide some specifics, thanks!
The map function applies the same pure function to n elements in a collection and aggregates the results. It doesn't matter the order in which you apply the function to the members of the collection because by definition the return value of the function is purely dependent upon the input.
The others already explained that the standard map implementation isn't parallel.
But in Scala, since you tagged it, you can get the parallel version as simply as
val list = ... // some list
list.par.map(x => ...) // instead of list.map(x => ...)
See also Parallel Collections Overview and documentation for ParIterable and other types in the scala.collection.parallel package.
You can find the implementation of the parallel map in https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/parallel/ParIterableLike.scala, if you want (look for def map and class Map). It requires very non-trivial infrastructure and certainly isn't just taking the recursive definition of sequential map and parallelizing it.
If one had defined map via a loop how would that break down?
The slides give F# parallel arrays as the example at the end and at https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs#L266 you can see the non-parallel implementation there is a loop:
let inline map (mapping: 'T -> 'U) (array:'T[]) =
checkNonNull "array" array
let res : 'U[] = Microsoft.FSharp.Primitives.Basics.Array.zeroCreateUnchecked array.Length
for i = 0 to res.Length-1 do
res.[i] <- mapping array.[i]
res

scalaz-stream consume stream based on computed value

I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.

Get a value from RichPipe

I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.
Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:
myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))
Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.
Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:
val theValueYouWant =
myRDD.filter {
case (`specificName`, `specificTime`, _) => true
case _ => false
}
.toArray.head._3

Scala linked list stackoverflow

Using scala I have added about 100000 nodes to a linked list. When I use the function length, for example mylist.length. I get a 'java.lang.StackOverflowError' error, is my list to big to process? The list is only string objects.
It appears the library implementation is not tail-recursive override def length: Int = if (isEmpty) 0 else next.length + 1. It seems like this is something that could be discussed on the mailing list to check if an enhancement ticket should be opened.
You can compute the length like this:
def length[T](l:LinkedList[T], acc:Int=0): Int =
if (l.isEmpty) acc else length(l.tail, acc + 1)
In Scala, computing the length of a List is an order n operation, therefore you should try to avoid it. You might consider switching to an Array, as that is a constant time operation.
You could try increasing the stack/heap size available to the JVM.
scala JAVA_OPTS="-Xmx512M -Xms16M -Xss16M" MyClass.scala
Where
-Xss<size> maximum native stack size for any thread
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
This question has some more information.
See also this This Scala document.
Can you confirm that you truly need to use the length method? It sounds like you might not be using the correct collection type for your use-case (hard to tell without any extra information). Lists are optimised to be mapped over using folds or a tail-recursive function.
Despite saying that, this is absolutely an oversight that can easily be fixed in the standard library with a tail-recursive function. Hopefully we can get it in time for 2.9.0.