I'm trying to perform a series of transformations on log data with Scala, and I'm having difficulties with matching tuples. I have a data frame with user ids, urls and dates. I can map the data frame to an RDD and reduce by key with this map:
val countsRDD = usersUrlsDays.map { case Row(date:java.sql.Date, user_id:Long, url:String) => Tuple2(Tuple2(user_id, url), 1) }.rdd.reduceByKey(_+_)
This gives me an RDD of ((user_id, url), count):
scala> countsRDD.take(1)
res9: Array[((Long, String), Int)]
scala> countsRDD.take(1)(0)
res10: ((Long, String), Int)
Now I want to invert that by url to yield:
(url, [(user_id, count), ...])
I have tried this:
val urlIndex = countsRDD.map{ case Row(((user_id:Long, url:String), count:Int)) => Tuple2(url, List(Tuple2(user_id, count))) }.reduceByKey(_++_)
This produces match errors, however:
scala.MatchError: ... (of class scala.Tuple2)
I've tried many, many different permutations of these two map calls with explicitly and implicit types and this seems to have gotten me the farthest. I'm hoping that someone here can help point me in the right direction.
Something like this should work:
countsRDD
.map{ case ((user_id, url), count) => (url, (user_id, count)) }
.groupByKey
countsRDD is RDD[((String, String), Int)] not RDD[Row].
There is no need to use TupleN. Tuple literals will work just fine.
Since countsRDD is statically typed (unlike RDD[Row]) you don't have to specify types.
Don't use reduceByKey for list concatenation. It is the worst possible approach you can take and ignores computational complexity, garbage colector and the common sense. If you really need grouped data use operation which is designed for it.
Related
How to sort by 2 or multiple columns using the takeOrdered(4)(Ordering[Int]) approach in Spark-Scala.
I can achieve this using the sortBy like this :
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when i try to sort using the takeOrdered approach its failing
tl;dr Do something like this (but consider rewriting your code to call split only once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here is the explanation.
When you call takeOrdered directly on lines, the implicit Ordering that takes effect is Ordering[String] because lines is an RDD[String]. You need to transform lines into a new RDD[(Int, Int)]. Because there is an implicit Ordering[(Int, Int)] available, it takes effect on your transformed RDD.
Meanwhile, sortBy works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know that is an intimidating signature, but if you cut through the noise, you can see that sortBy takes a function that maps your original type to a new type just for sorting purposes and applies the Ordering for that return type if one is in implicit scope.
In your case, you are applying a function to the Strings in your RDD to transform them into a "view" of how Spark should treat them merely for sorting purposes, i.e as a (Int, Int), and then relying on the fact that the implicit Ordering[(Int, Int)] is available as mentioned.
The sortBy approach allows you to keep lines intact as an RDD[String] and use the mapping just to sort while the takeOrdered approach operates on a brand new RDD containing (Int, Int) derived from the original lines. Whichever approach is more suitable for your needs depends on what you wish to accomplish.
On another note, you probably want to rewrite your code to only split your text once.
You could implement your custom Ordering:
lines.takeOrdered(4)(new Ordering[String] {
override def compare(x: String, y: String): Int = {
val xs=x.split(",")
val ys=y.split(",")
val d1 = xs(1).toInt - ys(1).toInt
if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
}
})
I'm new to scala, and learning how to process twitter streams with scala.
I've been playing with the sample code below and trying to modify it to do some other stuffs.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala#L60
I have a tuple of tuples(maybe tuple is not the exact type name in scala streaming but..) summarizes each tweet like this: (username, (tuple of hashtags), (tuple of users mentioned in this tweet))
And below is the code I used to make this.
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(duration.toInt))
val stream = TwitterUtils.createStream(ssc, None)
// record username, hashtags, and mentioned user
val distilled = stream.map(status => (status.getUser.getName, status.getText.split(" ").filter(_.startsWith("#")), status.getText.split(" ").filter(_.startsWith("#"))))
What I want to do is melt this tuple into (tag, user, (mentioned users)).
For example, if the original tuple was
(Tom, (#apple, #banana), (#Chris, #Bob))
I want the result to be
((#apple, Tom, (#Chris, #Bob)), (#banana, Tom, (#Chris, #Bob))
My goal is to run reduceByKey on this result using the hashtag as the key to get
(#apple, (list of users who tweeted this hashtag), (list of users who were mentioned in tweets with this hashtag))
I'm not sure 'melt' is the right term to use for this purpose but just think of it as similar to melt function in R. Is there a way to get this done using .map{case ... } or .flatMap{case ... }? Or do I have to define a function to do this job?
ADDED reduce question:
As I said I want to reduce the result with reduceByKeyAndWindow so I wrote the following code:
// record username, hashtags, and mentioned user
val distilled = stream.map(
status => (status.getUser.getName,
status.getText.split(" ").filter(_.startsWith("#")),
status.getText.split(" ").filter(_.startsWith("#")))
)
val byTags = distilled.flatMap{
case (user, tag, mentioned) => tag.map((_ -> List(1, List(user), mentioned)))
}.reduceByKeyAndWindow({
case (a, b) => List(a._1+b._1, a._2++b._2, a._3++b._3)}, Seconds(60), Seconds(duration.toInt)
)
val sorted = byTags.map(_.flatten).map{
case (tag, count, users, mentioned) => (count, tag, users, mentioned)
}.transform(_.sortByKey(false))
// Print popular hashtags
sorted.foreachRDD(rdd => {
val topList = rdd.take(num.toInt)
println("\n%d Popular tags in last %d seconds:".format(num.toInt, duration.toInt))
topList.foreach{case (count, tag, users, mentioned) => println("%s (%s tweets), authors: %s, mentioned: %s".for$
})
However, it says
missing parameter type for expanded function
[error] The argument types of an anonymous function must be fully known. (SLS 8.5)
[error] Expected type was: ?
[error] }.reduceByKeyAndWindow({
I've tried deleting the brackets and cases, writing (a:List, b:List) =>, but all of them gave me errors related with types. What is the correct way to reduce it so that users and mentioned will be concatenated every 'duration' seconds for 60 secs?
hashTags.flatMap{ case (user, tags, mentions) => tags.map((_, user,mentions))}
The most trouble thing in your question is the misusing of term tuple.
In python tuple means immutable type which could have any size.
In scala TupleN means immutable type with N type parameters contains exactly N members of corresponding types. So Tuple2 is not the same the Tuple3.
In scala which is full of immutable types, any immutable collection like List, Vector or Stream could be considered as analogue of python's tuple. But most precise are probably subtype of immutable.IndexedSeq e.g. Vector
So methods like String.splitAt never could return a Tuple in scala sense, simply because element count could not be known at compile time.
At that concrete case result will be simply [Array][5]. And such assumption i used in my answer.
But in case if you will really have collection (i.e. RDD) of type (String, (String, String), (String, String)) you can use this almost equivalent piece of code
hashTags.flatMap {
case (user, (tag1, tag2), mentions) => Seq(tag1, tag2).map((_, user, mentions))
}
Looking for some assistance with a problem with how to to something in scala using spark.
I have:
type DistanceMap = HashMap[(VertexId,String), Int]
this forms part of my data in the form of an RDD of:
org.apache.spark.rdd.RDD[(DistanceMap, String)]
in short my dataset looks like this:
({(101,S)=3},piece_of_data_1)
({(101,S)=3},piece_of_data_2)
({(101,S)=1, (100,9)=2},piece_of_data_3)
What I want to do us flat map my distance map (which I can do) but at the same time for each flatmapped DistanceMap want to retain the associated string with that. So my resulting data would look like this:
({(101,S)=3},piece_of_data_1))<br>
({(101,S)=3},piece_of_data_2))<br>
({(101,S)=1},piece_of_data_3))<br>
({(109,S)=2},piece_of_data_3))<br>
As mentioned I can flatMap the first part using:
x.flatMap(x=>x._1).collect.foreach(println))
but am stuck on how I can retain the string from the second part of my original data.
This might work for you:
x.flatMap(x => x._1.map(y => (y,x._2)))
The idea is to convert from (Seq(a,b,c),Value) to Seq( (a,Value), (b, Value), (c, Value)).
This is the same in Scala, so here is a standalone simplified Scala example you can paste in Scala REPL:
Seq((Seq("a","b","c"), 34), (Seq("r","t"), 2)).flatMap( x => x._1.map(y => (y,x._2)))
This results in:
res0: Seq[(String, Int)] = List((a,34), (b,34), (c,34), (r,2), (t,2))
update
I have an alternative solution - flip key with value and use flatMapValues transformation, and then flip key with value again: see pseudo code:
x.map(x=>x._2, x._1).flatMapValues(x=>x).map(x=>x._2, x._1)
previous version
I propose to add one preprocessing step (sorry I have no computer with scala interpreter in front of me till tomorrow to come up with working code).
transform the pair rdd from (DistanceMap, String) into the rdd with list of Tuple4: List((VertexId,String, Int, String), ... ())
apply flatMap on on result
Pseudocode:
rdd.map( (DistanceMap, String) => List((VertexId,String, Int, String), ... ()))
.flatMap(x=>x)
The end goal is to connect two query parameters that are being passed to a Play! web service request. It looks like:
WS
.url(requestUri)
.withQueryString(finalQueries)
I attempted to use a couple operators but it failed like so:
val finalQueries = queryParams match {
case Some(queries) =>
tokenParam ++ queries
case None =>
tokenParam
}
Error: value ++ is not a member of (String, String)
The API documentation shows that withQueryString accepts a (String, String)*
I'm a little confused with Play!'s withQueryString method since it does appear to complete replace the entire query string every time I access it. Any way to decently combine query strings?
Edit: A sample query string is below (the type syntax and its final appearance are a little confusing...):
val queryString = ("timeMin" -> "2012-08-20T01%3A11%3A06.000Z")
from your code, it seems to me that queryParams should be Option[(String, String)], and from the error message, tokenParam must be (String, String)
I think you can try this:
val finalQueries = Seq(tokenParam) ++ queryParams
WS
.url(requestUri)
.withQueryString(finalQueries:_*)
it works because Option can be treated as Seq, eg: Seq(1, 2) ++ Some(3) will become Seq(1, 2, 3) and Seq(1, 2) ++ None will be just Seq(1, 2)
and then .withQueryString accepts a (String, String)* means you can call it like .withQueryString(param1, param2, andMore),
or you can call it with a Seq and tell the compiler to treat it like anythingRepeated by writing : _* at the end of the Seq like .withQueryString(Seq(param1, param2, andMore): _*)
I have a map that I need to map to a different type, and the result needs to be a List. I have two ways (seemingly) to accomplish what I want, since calling map on a map seems to always result in a map. Assuming I have some map that looks like:
val input = Map[String, List[Int]]("rk1" -> List(1,2,3), "rk2" -> List(4,5,6))
I can either do:
val output = input.map{ case(k,v) => (k.getBytes, v) } toList
Or:
val output = input.foldRight(List[Pair[Array[Byte], List[Int]]]()){ (el, res) =>
(el._1.getBytes, el._2) :: res
}
In the first example I convert the type, and then call toList. I assume the runtime is something like O(n*2) and the space required is n*2. In the second example, I convert the type and generate the list in one go. I assume the runtime is O(n) and the space required is n.
My question is, are these essentially identical or does the second conversion cut down on memory/time/etc? Additionally, where can I find information on storage and runtime costs of various scala conversions?
Thanks in advance.
My favorite way to do this kind of things is like this:
input.map { case (k,v) => (k.getBytes, v) }(collection.breakOut): List[(Array[Byte], List[Int])]
With this syntax, you are passing to map the builder it needs to reconstruct the resulting collection. (Actually, not a builder, but a builder factory. Read more about Scala's CanBuildFroms if you are interested.) collection.breakOut can exactly be used when you want to change from one collection type to another while doing a map, flatMap, etc. — the only bad part is that you have to use the full type annotation for it to be effective (here, I used a type ascription after the expression). Then, there's no intermediary collection being built, and the list is constructed while mapping.
Mapping over a view in the first example could cut down on the space requirement for a large map:
val output = input.view.map{ case(k,v) => (k.getBytes, v) } toList