What is the similar alternative to reduceByKey in DataFrames - scala

Give following code
case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")
What is the best way to dedupe users by its timestamp
So the user with max ts will stay at collection?
In spark using RDD I would run something like this
rdd.reduceByKey(_ maxTS _)
and would add the maxTS method to Person or add implicits ...
def maxTS(that: Person):Person =
that.ts > ts match {
case true => that
case false => this
}
Is it possible to do the same at DataFrames? and will that be the similar performance?
We are using spark 1.6

You can use Window functions, I'm assuming that the key is name:
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = // convert to DataFrame
val win = Window.partitionBy('name).orderBy('ts.desc)
df.withColumn("personRank", rowNumber.over(win))
.where('personRank === 1).drop("personRank")
For each person it will create personRank - each person with given name will have unique number, person with the latest ts will have the lowest rank, equal to 1. The you drop temporary rank

You can do a groupBy and use your preferred aggregation method like sum, max etc.
df.groupBy($"name").agg(sum($"tx").alias("maxTS"))

Related

Aggregating all Column values within a Map after groupBy in Apache Spark

I've been trying this all day long with a Dataframe but no luck so far. Already did it with a RDD but it isn't really readable, so this approach would be much better when it comes to code readability.
Take this initial and result DF, both the starting DF and what I would like to obtain after peforming .groupBy().
case class SampleRow(name:String, surname:String, age:Int, city:String)
case class ResultRow(name: String, surnamesAndAges: Map[String, (Int, String)])
val df = List(
SampleRow("Rick", "Fake", 17, "NY"),
SampleRow("Rick", "Jordan", 18, "NY"),
SampleRow("Sandy", "Sample", 19, "NY")
).toDF()
val resultDf = List(
ResultRow("Rick", Map("Fake" -> (17, "NY"), "Jordan" -> (18, "NY"))),
ResultRow("Sandy", Map("Sample" -> (19, "NY")))
).toDF()
What I've tried so far is performing the following .groupBy...
val resultDf = df
.groupBy(
Name
)
.agg(
functions.map(
selectColumn(Surname),
functions.array(
selectColumn(Age),
selectColumn(City)
)
)
)
However, the following is prompt into console.
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`surname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
However, doing that would result in a single entry per surname and I would like to accumulate those in a single Map as you can see in resultDf. Is there an easy way to achieve this using DFs?
you can achieve it with a single UDF to convert your data to map:
val toMap = udf((keys: Seq[String], values1: Seq[String], values2: Seq[String]) => {
keys.zip(values1.zip(values2)).toMap
})
val myResultDF = df.groupBy("name").agg(collect_list("surname") as "surname", collect_list("age") as "age", collect_list("city") as "city").withColumn("surnamesAndAges", toMap($"surname", $"age", $"city")).drop("age", "city", "surname").show(false)
+-----+--------------------------------------+
|name |surnamesAndAges |
+-----+--------------------------------------+
|Sandy|[Sample -> [19, NY]] |
|Rick |[Fake -> [17, NY], Jordan -> [18, NY]]|
+-----+--------------------------------------+
If you are not concerned about typecasting the Dataframe to DataSet (In this case ResultRow you could do something like this
val grouped =df.withColumn("surnameAndAge",struct($"surname",$"age"))
.groupBy($"name")
.agg(collect_list("surnameAndAge").alias("surnamesAndAges"))
Then you could create a User defined function which would look like
import org.apache.spark.sql._
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map {
case Row(key: String, value: String) => (key, value) }.toMap
}
Now you could use a .withColumn and call this udf
val finalData = grouped.withColumn("surnamesAndAges",arrayToMap($"surnamesAndAges"))
The Dataframe would look something like this
finalData: org.apache.spark.sql.DataFrame = [name: string, surnamesAndAges: map<string,string>]
Since Spark 2.4, you don't need to use a Spark user-defined function:
import org.apache.spark.sql.functions.{col, collect_set, map_from_entries, struct}
df.withColumn("mapEntry", struct(col("surname"), struct(col("age"), col("city"))))
.groupBy("name")
.agg(map_from_entries(collect_set("mapEntry")).as("surnameAndAges"))
Explanation
You first add a column containing a Map entry from desired columns. a Map entry is merely a struct containing two columns: first column is the key and the second column is the value. You can put another struct as the value. So here your Map entry will use column surname as key, and a struct of columns age and city as value:
struct(col("surname"), struct(col("age"), col("city")))
Then, you collect all the Map entries grouped by your groupBy key, which is column name using function collect_set, and you convert this list of Map entries to a Map using function map_from_entries

Using a case class to rename split columns with Spark Dataframe

I am splitting 'split_column' to another five columns as per the following code. However I wanted to have this new columns to be renamed so that they would have some meaningful names(let's say new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5" in this example)
val df1 = df.withColumn("new_column", split(col("split_column"), "\\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
However issue with this approach is some of my splits might have more than fifty new rows. In thi renaming approach, a little typo (like extra comma, missing double quotes) would be painful to detect.
Is there a way to rename columns with case class like below ?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
Please note that in the actual scenario names would not look like new_renamed1, new_renamed2, ......, new_renamed5 , they would be totally different.
You could try something like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
One of the ways to use the case class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
is through udf function as
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")

Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:
((url_hash, url, created_timestamp )).
I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.
Cassandra Table looks like following:
url_hash| url | created_timestamp | updated_timestamp
Any pointers will be great.
I tried something like this this:
case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)
I am getting cassandra error
java.lang.NullPointerException: Unexpected null value of column full_url in keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
There are no null values in cassandra table
Thanks The Archetypal Paul!
I hope somebody finds this useful. Had to add Option to case class.
Looking forward to better solutions
case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Spark 2.0 ALS Recommendation how to recommend to a user

I have followed the guide given in the link
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
But this is outdated as it uses spark Mlib RDD approach. The New Spark 2.0 has DataFrame approach.
Now My problem is I have got the updated code
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
Now Here is the problem, In the old code the model that was obtained was a MatrixFactorizationModel, Now it has its own model(ALSModel)
In MatrixFactorizationModel you could directly do
val recommendations = bestModel.get
.predict(userID)
Which will give the list of products with highest probability of user liking them.
But Now there is no .predict method. Any Idea how to recommend a list of products given a user Id
Use transform method on model:
import spark.implicits._
val dataFrameToPredict = sparkContext.parallelize(Seq((111, 222)))
.toDF("userId", "productId")
val predictionsOfProducts = model.transform (dataFrameToPredict)
There's a jira ticket to implement recommend(User|Product) method, but it's not yet on default branch
Now you have DataFrame with score for user
You can simply use orderBy and limit to show N recommended products:
// where is for case when we have big DataFrame with many users
model.transform (dataFrameToPredict.where('userId === givenUserId))
.select ('productId, 'prediction)
.orderBy('prediction.desc)
.limit(N)
.map { case Row (productId: Int, prediction: Double) => (productId, prediction) }
.collect()
DataFrame dataFrameToPredict can be some large user-product DataFrame, for example all users x all products
The ALS Model in Spark contains the following helpful methods:
recommendForAllItems(int numUsers)
Returns top numUsers users recommended for each item, for all items.
recommendForAllUsers(int numItems)
Returns top numItems items recommended for each user, for all users.
recommendForItemSubset(Dataset<?> dataset, int numUsers)
Returns top numUsers users recommended for each item id in the input data set.
recommendForUserSubset(Dataset<?> dataset, int numItems)
Returns top numItems items recommended for each user id in the input data set.
e.g. Python
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode
alsEstimator = ALS()
(alsEstimator.setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop"))
alsModel = alsEstimator.fit(productRatings)
recommendForSubsetDF = alsModel.recommendForUserSubset(TargetUsers, 40)
recommendationsDF = (recommendForSubsetDF
.select("user_id", explode("recommendations")
.alias("recommendation"))
.select("user_id", "recommendation.*")
)
display(recommendationsDF)
e.g. Scala:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.functions.explode
val alsEstimator = new ALS().setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop")
val alsModel = alsEstimator.fit(productRatings)
val recommendForSubsetDF = alsModel.recommendForUserSubset(sampleTargetUsers, 40)
val recommendationsDF = recommendForSubsetDF
.select($"user_id", explode($"recommendations").alias("recommendation"))
.select($"user_id", $"recommendation.*")
display(recommendationsDF)
Here is what I did to get recommendations for a specific user with spark.ml:
import com.github.fommil.netlib.BLAS.{getInstance => blas}
userFactors.lookup(userId).headOption.fold(Map.empty[String, Float]) { user =>
val ratings = itemFactors.map { case (id, features) =>
val rating = blas.sdot(features.length, user, 1, features, 1)
(id, rating)
}
ratings.sortBy(_._2).take(numResults).toMap
}
Both userFactors and itemFactors in my case are RDD[(String, Array[Float])] but you should be able to do something similar with DataFrames.

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')
How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5