Misunderstanding of some parts of an example in Spark MLlib

Misunderstanding of some parts of an example in Spark MLlib - scala

I follow this example to create a simple personalized demo recommender using Spark MLLib.
I slightly misunderstand the meaning of _._2.user and _._2.product in these lines of code:
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count
What 2 is indicating? Also it looks like user and product appear for the first time in this line. So, how they are linked to userId and movieId?

_1, _2, ... _2 are methods used to extract elements of the tuples in Scala. These have no special Spark specific context here. user and product are fields of a Rating. And since ratings is RDD[(Long, Rating)] created as follows:
val ratings = sc.textFile(...).map { line =>
...
(fields(3).toLong % 10, // Long
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)) // Rating
}
you should have a complete picture.

ratings has type RDD[(Int, Rating)]. So ratings.map takes a function with (Int, Rating) argument, and _ in _.something stands for this argument. _2 returns the second field of the tuple (the Rating), and user and product are declared in the declaration of Rating.

Related

What is the error "error: type mismatch;" in scala?

import java.time.LocalDate
object Main extends App{
val scores: Seq[Score] = Seq(score1, score2, score3, score4)
println(getDate(scores)(LocalDate.of(2020, 1, 30))("Alice"))
def getDate(scoreSeq: Seq[Score]): Map[LocalDate, Map[String, Int]] = scores.groupMap(score => score.date)(score=>Map(score.name -> (score.english+score.math+score.science)))
}
I would like to implement a function that maps the examination date to a map of student names and the total scores of the three subjects on that date, and if there are multiple scores for the same student on the same date, the function returns the one with the highest total score. However, here is the function
found :scala.collection.immutable.Map[java.time.LocalDate,Seq[scala.collection.immutable.Map[String,Int]]]]
"required: Map[java.time.LocalDate,Map[String,Int]]".
How can I resolve this?

The error means what it says: The type of the variable and the type of the value being assigned to the variable don't match. It even tells you what the two types are!
The actual problem is that groupMap isn't returning the type you think it should. The values in the resulting Map are a Seq of Maps rather than a single Map with all the results.
You can use groupMapReduce to convert the List to a single Map by concatenation the Maps using ++:
scores
.groupMapReduce(_.date)
(score=>Map(score.name -> (score.english+score.math+score.science)))
(_ ++ _)

Classify every instance of a RDD | Apache Spark Scala

I'm starting to work with RDD's and I have some doubts. In my case, I have a RDD and I want to classify his data. My RDD contains the following:
Array[(String, String)] = Array((data: BD=bd_users,BD_classified,contains_people, rbd: BD=bd_users,BD_classified,contains_people),
(data: BD=bd_users,BD_classified,contains_people,contains_users, user: id=8282bd, BD_USERS,bdd),
(data: BD=bd_experts,BD_exp,contains_exp,contains_adm, rbd: BD=bd_experts,BD_ea,contains_exp,contains_adm),
(data: BD=bd_test,BD_test,contains_acc,contains_tst, rbd: BD=bd_test,BD_test,contains_tst,contains_t))
As you can see the RDD contains two strings, the first one start with data and the second one starts with rbd. What I want to do is classify every instance of this RDD as you can see here:
If the instance contains bd_users & BD_classified -> users
bd_experts & BD_exp -> experts
BD_test -> tests
The output would be something like this for this RDD:
1. Users
2. Users
3. Experts
4. Test
To do this I would like to use a map that calls a function for every instance in this RDD but I don't know how can orientate this:
val rdd_groups = rdd_1.map(x=>x(0).toString).map(x => getGroups(x))
def getGroups(input: String): (String) = {
//here i should use for example case to classify this strings?
}
If you need something more or examples, just tell me it. Thanks in advance!

Well assuming you have a RDD of strings and a classifier already defined:
val rdd: RDD[String] =
???
def classify(input: String): String =
???
rdd.map(input => classify(input))

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?

As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.

Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

How to work with a Spark RDD to produce or map to another RDD

I have a Key/Value RDD I want to take that "iterate over" the entities in it, Key/Value, and create, or map, to another RDD which could have more or less entries that the first RDD.
Example:
I have records in accumulo that represent observations of colors in paintings.
An observation entity/object holds data on the painting name and the colors in the painting.
Observation
public String getPaintingName() {return paintingName;}
public List<String> getObservedColors() {return colorList}
I pull the observations from accumulo into my code as an RDD.
val observationRDD: RDD[(Text, Observation)] = getObservationsFromAccumulo();
I want to take this RDD and create an RDD of the form of (Color, paintingName) where the key is the color observed and the value is the painting name which the color was observed in.
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.somefunction({ case (_, observation) =>
for(String color : observations.getObservedColors()) {
// Some how output a entry into a new RDD
//output/map (color, observation.getPaintingName)
})
I know map can't work, because its 1 to 1, I thought maybe observationRDD.flatmap(some function) but can't seem to find any examples on how to do that to create a new, larger or smaller, RDD.
Could someone help me out and tell me if flatmap is correct, and if so give me an example using this example I provided, or tell me if i'm way off base?
Please understand this is just a simple example, its not the content im asking about, its how one would transform a RDD to a RDD with more or less entries.

You should use flatmap and return a List[(String, String)] foreach element in RDD. FlatMap will flat the result and you'll get an RDD[(String, String)]
I didn't try the code, but it would be something like this:
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.flatMap { case (_, observation) =>
observations.getObservedColors().map(color => (color, observation.getPaintingName))
}
Probably if getObservedColors method is in Java you have to import JavaConversions and change to scala list.
import scala.collection.JavaConversions._
observations.getObservedColors().toList

scala twitter streaming: melting tuple of tuples

I'm new to scala, and learning how to process twitter streams with scala.
I've been playing with the sample code below and trying to modify it to do some other stuffs.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala#L60
I have a tuple of tuples(maybe tuple is not the exact type name in scala streaming but..) summarizes each tweet like this: (username, (tuple of hashtags), (tuple of users mentioned in this tweet))
And below is the code I used to make this.
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(duration.toInt))
val stream = TwitterUtils.createStream(ssc, None)
// record username, hashtags, and mentioned user
val distilled = stream.map(status => (status.getUser.getName, status.getText.split(" ").filter(_.startsWith("#")), status.getText.split(" ").filter(_.startsWith("#"))))
What I want to do is melt this tuple into (tag, user, (mentioned users)).
For example, if the original tuple was
(Tom, (#apple, #banana), (#Chris, #Bob))
I want the result to be
((#apple, Tom, (#Chris, #Bob)), (#banana, Tom, (#Chris, #Bob))
My goal is to run reduceByKey on this result using the hashtag as the key to get
(#apple, (list of users who tweeted this hashtag), (list of users who were mentioned in tweets with this hashtag))
I'm not sure 'melt' is the right term to use for this purpose but just think of it as similar to melt function in R. Is there a way to get this done using .map{case ... } or .flatMap{case ... }? Or do I have to define a function to do this job?
ADDED reduce question:
As I said I want to reduce the result with reduceByKeyAndWindow so I wrote the following code:
// record username, hashtags, and mentioned user
val distilled = stream.map(
status => (status.getUser.getName,
status.getText.split(" ").filter(_.startsWith("#")),
status.getText.split(" ").filter(_.startsWith("#")))
)
val byTags = distilled.flatMap{
case (user, tag, mentioned) => tag.map((_ -> List(1, List(user), mentioned)))
}.reduceByKeyAndWindow({
case (a, b) => List(a._1+b._1, a._2++b._2, a._3++b._3)}, Seconds(60), Seconds(duration.toInt)
)
val sorted = byTags.map(_.flatten).map{
case (tag, count, users, mentioned) => (count, tag, users, mentioned)
}.transform(_.sortByKey(false))
// Print popular hashtags
sorted.foreachRDD(rdd => {
val topList = rdd.take(num.toInt)
println("\n%d Popular tags in last %d seconds:".format(num.toInt, duration.toInt))
topList.foreach{case (count, tag, users, mentioned) => println("%s (%s tweets), authors: %s, mentioned: %s".for$
})
However, it says
missing parameter type for expanded function
[error] The argument types of an anonymous function must be fully known. (SLS 8.5)
[error] Expected type was: ?
[error] }.reduceByKeyAndWindow({
I've tried deleting the brackets and cases, writing (a:List, b:List) =>, but all of them gave me errors related with types. What is the correct way to reduce it so that users and mentioned will be concatenated every 'duration' seconds for 60 secs?

hashTags.flatMap{ case (user, tags, mentions) => tags.map((_, user,mentions))}
The most trouble thing in your question is the misusing of term tuple.
In python tuple means immutable type which could have any size.
In scala TupleN means immutable type with N type parameters contains exactly N members of corresponding types. So Tuple2 is not the same the Tuple3.
In scala which is full of immutable types, any immutable collection like List, Vector or Stream could be considered as analogue of python's tuple. But most precise are probably subtype of immutable.IndexedSeq e.g. Vector
So methods like String.splitAt never could return a Tuple in scala sense, simply because element count could not be known at compile time.
At that concrete case result will be simply [Array][5]. And such assumption i used in my answer.
But in case if you will really have collection (i.e. RDD) of type (String, (String, String), (String, String)) you can use this almost equivalent piece of code
hashTags.flatMap {
case (user, (tag1, tag2), mentions) => Seq(tag1, tag2).map((_, user, mentions))
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Misunderstanding of some parts of an example in Spark MLlib - scala

ratings has type RDD[(Int, Rating)]. So ratings.map takes a function with (Int, Rating) argument, and _ in _.something stands for this argument. _2 returns the second field of the tuple (the Rating), and user and product are declared in the declaration of Rating.

Related

What is the error "error: type mismatch;" in scala?

Classify every instance of a RDD | Apache Spark Scala

Dataset.groupByKey + untyped aggregation functions

How to work with a Spark RDD to produce or map to another RDD

scala twitter streaming: melting tuple of tuples

Categories

Resources