scala twitter streaming: melting tuple of tuples - scala

I'm new to scala, and learning how to process twitter streams with scala.
I've been playing with the sample code below and trying to modify it to do some other stuffs.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala#L60
I have a tuple of tuples(maybe tuple is not the exact type name in scala streaming but..) summarizes each tweet like this: (username, (tuple of hashtags), (tuple of users mentioned in this tweet))
And below is the code I used to make this.
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(duration.toInt))
val stream = TwitterUtils.createStream(ssc, None)
// record username, hashtags, and mentioned user
val distilled = stream.map(status => (status.getUser.getName, status.getText.split(" ").filter(_.startsWith("#")), status.getText.split(" ").filter(_.startsWith("#"))))
What I want to do is melt this tuple into (tag, user, (mentioned users)).
For example, if the original tuple was
(Tom, (#apple, #banana), (#Chris, #Bob))
I want the result to be
((#apple, Tom, (#Chris, #Bob)), (#banana, Tom, (#Chris, #Bob))
My goal is to run reduceByKey on this result using the hashtag as the key to get
(#apple, (list of users who tweeted this hashtag), (list of users who were mentioned in tweets with this hashtag))
I'm not sure 'melt' is the right term to use for this purpose but just think of it as similar to melt function in R. Is there a way to get this done using .map{case ... } or .flatMap{case ... }? Or do I have to define a function to do this job?
ADDED reduce question:
As I said I want to reduce the result with reduceByKeyAndWindow so I wrote the following code:
// record username, hashtags, and mentioned user
val distilled = stream.map(
status => (status.getUser.getName,
status.getText.split(" ").filter(_.startsWith("#")),
status.getText.split(" ").filter(_.startsWith("#")))
)
val byTags = distilled.flatMap{
case (user, tag, mentioned) => tag.map((_ -> List(1, List(user), mentioned)))
}.reduceByKeyAndWindow({
case (a, b) => List(a._1+b._1, a._2++b._2, a._3++b._3)}, Seconds(60), Seconds(duration.toInt)
)
val sorted = byTags.map(_.flatten).map{
case (tag, count, users, mentioned) => (count, tag, users, mentioned)
}.transform(_.sortByKey(false))
// Print popular hashtags
sorted.foreachRDD(rdd => {
val topList = rdd.take(num.toInt)
println("\n%d Popular tags in last %d seconds:".format(num.toInt, duration.toInt))
topList.foreach{case (count, tag, users, mentioned) => println("%s (%s tweets), authors: %s, mentioned: %s".for$
})
However, it says
missing parameter type for expanded function
[error] The argument types of an anonymous function must be fully known. (SLS 8.5)
[error] Expected type was: ?
[error] }.reduceByKeyAndWindow({
I've tried deleting the brackets and cases, writing (a:List, b:List) =>, but all of them gave me errors related with types. What is the correct way to reduce it so that users and mentioned will be concatenated every 'duration' seconds for 60 secs?

hashTags.flatMap{ case (user, tags, mentions) => tags.map((_, user,mentions))}
The most trouble thing in your question is the misusing of term tuple.
In python tuple means immutable type which could have any size.
In scala TupleN means immutable type with N type parameters contains exactly N members of corresponding types. So Tuple2 is not the same the Tuple3.
In scala which is full of immutable types, any immutable collection like List, Vector or Stream could be considered as analogue of python's tuple. But most precise are probably subtype of immutable.IndexedSeq e.g. Vector
So methods like String.splitAt never could return a Tuple in scala sense, simply because element count could not be known at compile time.
At that concrete case result will be simply [Array][5]. And such assumption i used in my answer.
But in case if you will really have collection (i.e. RDD) of type (String, (String, String), (String, String)) you can use this almost equivalent piece of code
hashTags.flatMap {
case (user, (tag1, tag2), mentions) => Seq(tag1, tag2).map((_, user, mentions))
}

Related

Iterating over two Source and filter using a property in Scala

I am trying to filter out common elements in terms of latest versions of the object and return another Source. My object looks like:
case class Record(id: String, version: Long)
My method's input are two Source:
val sourceA: Source[Record, _] = <>
val sourceB: Source[Record, _] = <>
sourceA and sourceB has common id of the Record object but there is a possibility that versions are different in both. I want to create a method which returns a Source[Record, _] which will have latest version for an id. I tried
val latestCombinedSource: Source[Record, _] = sourceA map {each => {
sourceB.map(eachB => eachB.version > each.version? eachB: each)
.....
}
}
You did not mention what type of Source / what streaming library you are asking about (please update your question to clarify that). From the signatures in the code, I assume that this is about akka-stream. If that is correct, then you probably want to use zipLatestWith:
val latestCombinedSource: Source[Record, _] =
sourceA.zipLatestWith(sourceB) { (a, b) =>
if (a.version > b.version) a else b
}
Note that there is also zipWith and I'm not 100% sure which one you'd want to use. The difference (quoted from the API docs) is: zipWithLatest "Emits when all of the inputs have at least an element available, and then each time an element becomes available on either of the inputs" while zipWith "Emits when all of the inputs have an element available".

How to cast or access element in an Array[Product with Serializable]?

I am loading company records into spark
case class Company(id:Integer, name:String, kind: String, location : String, stage:String)
It is a large data file so i want to figure out which company records load correctly
safe: [S, T](f: S => T)S => Either[T,(S, Exception)]
loading the data with
val companiesText = sc.textFile("../companies.txt");
val safeParse = safe(parse)
val companyRecords = companiesText.map(line => line.split(";")).map(line => safeParse(line))
Access the company records that were correctly load
val goodCompaniesRecords = companyRecords.collect({
case t if t.isLeft => t.left.get
})
This gives an Array[Product with Serializable] and i can not access the elements
goodCompaniesRecords.map(x => new Company(x._1, x._2, x._3, x._4, x._5))
gives
error: value _1 is not a member of Product with Serializable
goodCompaniesRecords.map(x => new Company(x._1, x._2, x._3, x._4, x._5))
How can i access these element or how can i cast from an Array[Product with Serializable] to Array[Company] without modifying the safe function?
Appearance of Product with Serializable basically always means you have an earlier problem, so a better question would be "how to avoid getting Array[Product with Serializable]". Specifically, it means you probably have an expression returning unrelated case classes in different branches somewhere (e.g. None and a Company instead of Some(company), or tuples or different sizes). My guess (but only a guess, since you don't give enough code) is that this happens in parse.
To localize the problem, you can start giving explicit types to used variables and methods and checking if they still compile. In this case you should probably have def parse(s: String): Company and val companyRecords: RDD[Array[Either[Company, (String, Exception)]].
On a side note, the pattern-match in goodCompaniesRecords is much better written as case Left(company) => company.

Scala Spark map type matching issue

I'm trying to perform a series of transformations on log data with Scala, and I'm having difficulties with matching tuples. I have a data frame with user ids, urls and dates. I can map the data frame to an RDD and reduce by key with this map:
val countsRDD = usersUrlsDays.map { case Row(date:java.sql.Date, user_id:Long, url:String) => Tuple2(Tuple2(user_id, url), 1) }.rdd.reduceByKey(_+_)
This gives me an RDD of ((user_id, url), count):
scala> countsRDD.take(1)
res9: Array[((Long, String), Int)]
scala> countsRDD.take(1)(0)
res10: ((Long, String), Int)
Now I want to invert that by url to yield:
(url, [(user_id, count), ...])
I have tried this:
val urlIndex = countsRDD.map{ case Row(((user_id:Long, url:String), count:Int)) => Tuple2(url, List(Tuple2(user_id, count))) }.reduceByKey(_++_)
This produces match errors, however:
scala.MatchError: ... (of class scala.Tuple2)
I've tried many, many different permutations of these two map calls with explicitly and implicit types and this seems to have gotten me the farthest. I'm hoping that someone here can help point me in the right direction.
Something like this should work:
countsRDD
.map{ case ((user_id, url), count) => (url, (user_id, count)) }
.groupByKey
countsRDD is RDD[((String, String), Int)] not RDD[Row].
There is no need to use TupleN. Tuple literals will work just fine.
Since countsRDD is statically typed (unlike RDD[Row]) you don't have to specify types.
Don't use reduceByKey for list concatenation. It is the worst possible approach you can take and ignores computational complexity, garbage colector and the common sense. If you really need grouped data use operation which is designed for it.

Misunderstanding of some parts of an example in Spark MLlib

I follow this example to create a simple personalized demo recommender using Spark MLLib.
I slightly misunderstand the meaning of _._2.user and _._2.product in these lines of code:
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count
What 2 is indicating? Also it looks like user and product appear for the first time in this line. So, how they are linked to userId and movieId?
_1, _2, ... _2 are methods used to extract elements of the tuples in Scala. These have no special Spark specific context here. user and product are fields of a Rating. And since ratings is RDD[(Long, Rating)] created as follows:
val ratings = sc.textFile(...).map { line =>
...
(fields(3).toLong % 10, // Long
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)) // Rating
}
you should have a complete picture.
ratings has type RDD[(Int, Rating)]. So ratings.map takes a function with (Int, Rating) argument, and _ in _.something stands for this argument. _2 returns the second field of the tuple (the Rating), and user and product are declared in the declaration of Rating.

Dealing with Type Erasure with foldLeft [duplicate]

I've trying to sort a list by a future boolean.
I have a list of IDs and I need to query an external service to find out if there's contextual information behind them. The method I use to do this returns an optional future.
By using the partition method I hoped to create two lists of IDs, one with contextual information and one without.
The following answer on here provided a lot of help for this: Scala - sort based on Future result predicate
I now have an rough, untested method that looks like so,
val futureMatch = listOfIds.map( b => b.map{ j =>
getContext(j).map{ k =>
Map( j -> k)
}
}).map(Future.sequence(_)).flatMap(identity)
val partitionedList = futureMatch.map(_.partition{
case (k, Some(v)) => true
case _ => false
})
So as advised in the other question, I'm attempting to get all my answers at the same level, and then using Future.sequence and flatMap(identity) to flatten nested layers of futures.
The problem is this doesn't feel very efficient.
Ideally the successful list would have a signature of List[Map[String, String]] not List[Map[String, Option[String]] and the failed list would just be a list of Strings, so it'd only need to be one dimensional. As it currently stands I have two identical list signatures which have some redundancy. For example in the successful list, I know this is going exist so it doesn't need to be an option.
Any ideas how I could achieve this structure and produce two lists with different signatures or even if this is the most efficient way.
Thanks.
EDIT: Looking at the signature for partition it looks like I can only produce two lists of the same signature, so a different signature is probably too much to ask. I guess I can just flatten the list afterwards.
I found a suitable solution in the comments of the question I linked too.
val (matched, unmatched) =
finalMatch.foldLeft(List.empty[Map[String, String]], List.empty[String]) {
case ((matched, unmatched), p) => p match {
case m:Map[String, String] => (m :: matched, unmatched)
case s:String => (matched, s :: unmatched)
}
}
The only issue with this is it leads to type erasure. I've opened another question to discuss this issue.
Dealing with Type Erasure with foldLeft
Thanks all.