Get a value from RichPipe - scala

I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.

Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:
myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))
Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.
Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:
val theValueYouWant =
myRDD.filter {
case (`specificName`, `specificTime`, _) => true
case _ => false
}
.toArray.head._3

Related

Is it Scala style to use a for loop in Scala/Spark?

I have heard that it is a good practice in Scala to eliminate for loops and do things "the Scala way". I even found a Scala style checker at http://www.scalastyle.org. Are for loops a no-no in Scala? In a course at https://www.udemy.com/course/apache-spark-with-scala-hands-on-with-big-data/learn/lecture/5363798#overview I found this example, which makes me thing that for looks are okay to use, but using the Scala format and syntax of course, in a single line and not like the traditional Java for looks in multiple lines of code. See this example I found from that Udemy course:
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
That for loop prints this result, as expected:
Enterprise Defiant Voyager Deep Space Nine
I was wondering if using for as in the example above is acceptable Scala style code, or it if is a no-no and why. Thank you!
There is no problem in this for loop, but you can use functions form List object for your work in more functional way.
e.g. instead of using
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
You can use
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
shipList.foreach(element => println(element) )
or
shipList.foreach(println)
You can use for loops in Scala, there is no problem with that. But the difference is that this for-loop is not an expression and does not return a value, so you need to use a variable in order to return any value. Scala gives preference to work with immutable types.
In your example you print messages in the console, you need to perform a "side effect" to extract the value breaking the referencial transparency, I mean, you depend on the IO operation to extract a value, or you have mutate a variable which is in the scope which maybe is being accessed by another thread or another concurrent task thereby there is no guarantee that the value that you collect wont be what you are expecting. Obviously, all these hypothesis are related to concurrent/parallel programming and there is where Scala and the immutable style help.
To show the elements of a collection you can use a for loop, but if you want to count the total number of chars in Scala you do that using a expression like:
val chars = shipList.foldLeft(0)((a, b) => a + b.length)
To sum up, most of the times the Scala code that you will read uses immutable style of programming although not always because Scala supports the other way of coding too, but it is weird to find something using a classic Java OOP style, mutating object instances and using getters and setters.

Losing path dependent type when extracting value from Try in scala

I'm working with scalax to generate a graph of my Spark operationS. So, I have a custom library that generates my graph. So, let me show a sample:
val DAGWithoutGet = createGraphFromOps(ops)
val DAGWithGet = createGraphFromOps(ops).get
The return type of DAGWithoutGet is
scala.util.Try[scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge]],
and, for DAGWithGet is
scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge].
Here, typeA is a project related class representing a single Spark operation, not relevant for the context of this question. (for context only: What my custom library does is, essentially, generate a map of dependencies between those operations, creating a big Map object, and calling Graph(myBigMap: _*) to generate the graph).
As far as I know, calling the .get command on this point of my code or later should not make any difference, but that is not what I'm seeing.
Calling DAGWithoutGet.get.nodes has a return type of scalax.collection.Graph[typeA,DiEdge]#NodeSetT,
while calling DAGWithGet.nodes returns DAGWithGet.NodeSetT.
When I extract one of those nodes (using the .find method), I receive scalax.collection.Graph[typeA,DiEdge]#NodeT and DAGWithGet.NodeT types, respectively. Much to my dismay, even the methods available in each case are different - I cannot use pathTo (which happens to be what I want) or withSubgraph on the former, only on the latter.
My doubt is, then, after this relatively complex example: what is going on here? Why extracting the value from the Try construct on different moments leads to different types, one path dependent, and the other not - or, if that isn't correct, what may I be missing here?

Scala spark: Efficient check if condition is matched anywhere?

What I want is roughly equivalent to
df.where(<condition>).count() != 0
But I'm pretty sure it's not quite smart enough to stop once it finds any such violation. I would expect some sort of aggregator to be able to do this, but I haven't found one? I could do it with a max and some sort of conversion, but again I don't think it would necessarily know to quit (not being specific to bool, I'm not sure if understands no value is larger than true).
More specifically, I want to check if a column contains only a single element. Right now my best idea is to do this is by grabbing the first value and comparing everything.
I would try this option, it should be much faster:
df.where(<condition>).head(1).isEmpty
You can also try to define your conditions on a row together with scala's exists (which stops at the first occurence of true):
df.mapPartitions(rows => if(rows.exists(row => <condition>)) Iterator(1) else Iterator.empty).isEmpty
At the end you should benchmark the alternatives

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.

reduce array efficiently in coffeescript

If I have an array of objects in a var. I want to reduce this so that they are grouped by particular property. This is my code
array = tracks.reduce (x,y,i) ->
x[y.album] = []
x
, {}
albums = tracks.reduce (x,y,i) ->
array[y.album].push {'name':y.name, 'mp3':y.mp3}
array
, {}
console.log(albums)
It outputs what I want, however I want to know if there a better way to write this, without having to do the first loop, to create the empty arrays for the groups.
Thanks.
Yes, you can use the or= or ?= operator to assign array[y.ambum] only if it hasn't been initialized; therefore using only one loop. BTW, i think it's a bit confusing that the array variable is an object. Another way of coding this using a CoffeeScript loop instead of reduce is:
albums = {}
for {album, name, mp3} in tracks
(albums[album] or= []).push {name, mp3}
Notice that i'm using destructuring to get the track properties all at once.
Or, if you want to use a reduce:
albums = tracks.reduce (albums, {album, name, mp3}) ->
(albums[album] or= []).push {name, mp3}
albums
, {}
But i think the CS-loop version reads a bit better :)
Bonus track (pun intended): if you happen to have Underscore.js, i'd strongly recommend to use groupBy, which does exactly this kind of grouping job:
albums = _.groupBy tracks, (track) -> track.album
Notice that the tracks for each album name in albums will be "complete" tracks (not just the name and mp3 properties).
Update: A comment about performance: when asked to do something "efficiently" i interpret it as doing in the most direct and clean way possible (i'm thinking about the programmers' efficiency when reading the code); but many will unequivocally relate efficiency to performance.
About performance, all these three solutions are O(n), n being the number of tracks, in complexity; so neither is horribly worse than others.
It seems raw for loops run faster on modern JS engines than their equivalent higher order brothers: forEach, reduce, etc (which is quite saddening IMO :(...). So the first version should run faster than the second one.
In the case of the Underscore version, i won't do any prediction, as Underscore is known for using the higher order functions a lot instead of raw for loops, but at the same time, that version does not create a new object for each track.
In any case, you should always profile your code before changing your solution to one that might be more performant but is less readable. If you notice that that particular loop is a bottleneck, and you have a good set of data to benchmark it, jsPerf can be really useful :)