Comparing values from different keys in scala / spark - scala

I am trying to find the difference between values for keys that are related (but not the same). For example, lets say that I have the following map:
(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])
I want to compare the contents of Name_# to Name_(#-1) and get the difference. So, for the example above, I would like to get (ex:
(“John_1”,[“a”,”b”,”c”]) //Since there is no John_0, all of the contents are new, so I keep them all
(“John_2”,[]) //Since all of the contents of John_2 appear in John_1, the resulting list is empty (for now, I don’t care about what happened to “c”
(“John_3”,[”c”]) //In this case, “c” is a new item (because I don’t care whether it existed prior to John_2). Again, I don’t care what happened to “a”.
(“Mary_5”,[“a”,”d”]) //There is no Mary_4 so all the items are kept
(“John_5”,[“c”,”d”,”e”]) //There is no John_4 so all the items are kept.
I was thinking on doing some kind of aggregateByKey and then just find the difference between the lists, but I do not know how to make the match between the keys that I care about, namely Name_# with Name_(#-1).

Split "id":
import org.apache.spark.sql.functions._
val df = Seq(
("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
"user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))
Add window:
val w = org.apache.spark.sql.expressions.Window
.partitionBy($"user").orderBy($"id")
and udf
val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)
and compute:
df
.withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
.withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
.show
// +------+---------+----+---+-----------+---------+
// | key| values|user| id|is_previous| diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5| [a, d]|Mary| 5| false| [a, d]|
// |John_1|[a, b, c]|John| 1| false|[a, b, c]|
// |John_2| [a, b]|John| 2| true| []|
// |John_3| [b, c]|John| 3| true| [c]|
// |John_5|[c, d, e]|John| 5| false|[c, d, e]|
// +------+---------+----+---+-----------+---------+

I managed to solve my issue as follows:
First create a function that computes the previous key from the current key
def getPrevKey(k: String): String = {
val (n, h) = k.split(“_”)
val i = h.toInt
val sb = new StringBuilder
sb.append(n).append(“_”).append(i-1)
return sb.toString
}
Then, create a copy of my RDD with the shifted key:
val copyRdd = myRdd.map(row => {
val k1 = row._1
val v1 = row._2
val k2 = getPrevHour(k1)
(k2,v1)
})
And finally, I union both RDDs and reduce by key by taking the difference between the lists:
val result = myRdd.union(copyRdd)
.reduceByKey(_.diff(_))
This gets me the exact result I need, but has the problem that it requires a lot of memory due to the union. The final result is not that large, but the partial results really weigh down the process.

Related

Iterate over elements of columns Scala

I have a dataframe composed of two Arrays of Doubles. I would like to create a new column that is the result of applying a euclidean distance function to the first two columns, i.e if I had:
A B
(1,2) (1,3)
(2,3) (3,4)
Create:
A B C
(1,2) (1,3) 1
(2,3) (3,4) 1.4
My data schema is:
df.schema.foreach(println)
StructField(col1,ArrayType(DoubleType,false),false)
StructField(col2,ArrayType(DoubleType,false),true)
Whenever I call this distance function:
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
I get a type error:
df.withColumn("distances" , distance($"col1",$"col2"))
<console>:68: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Array[Double]
ids_with_predictions_centroids3.withColumn("distances" , distance($"col1",$"col2"))
I understand I have to iterate over the elements of each column, but I cannot find an explanation of how to do this anywhere. I am very new to Scala programming.
To use a custom function on a dataframe you need to define it as an UDF. This can be done, for example, as follows:
val distance = udf((xs: WrappedArray[Double], ys: WrappedArray[Double]) => {
math.sqrt((xs zip ys).map { case (x,y) => math.pow(y - x, 2) }.sum)
})
df.withColumn("C", distance($"A", $"B")).show()
Note that WrappedArray (or Seq) need to be used here.
Resulting dataframe:
+----------+----------+------------------+
| A| B| C|
+----------+----------+------------------+
|[1.0, 2.0]|[1.0, 3.0]| 1.0|
|[2.0, 3.0]|[3.0, 4.0]|1.4142135623730951|
+----------+----------+------------------+
Spark functions work on column based and your only mistake is that you are mixing column and primitives in the function
And the error message is clear enough which says that you are passing a column in the distance function i.e. $"col1" and $"col2" are columns but the distance function is defined as distance(xs: Array[Double], ys: Array[Double]) taking primitive types.
The solution is to make the distance function fully column based as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def distance(xs: Column, ys: Column) = {
sqrt(pow(ys(0)-xs(0), 2) + pow(ys(1)-xs(1), 2))
}
df.withColumn("distances" , distance($"col1",$"col2")).show(false)
which should give you the correct result without errors
+------+------+------------------+
|col1 |col2 |distances |
+------+------+------------------+
|[1, 2]|[1, 3]|1.0 |
|[2, 3]|[3, 4]|1.4142135623730951|
+------+------+------------------+
I hope the answer is helpful

How to pass a map in Spark Udf?

I have a question. I have a spark dataframe with several columns looking like:
id Color
1 Red, Blue, Black
2 Red, Green
3 Blue, Yellow, Green
...
I also have a map file looking like :
Red,0
Blue,1
Green,2
Black,3
Yellow,4
what I need to do is to map the color name into different ids, such as mapping "Red, Blue, Black" into an array of [1,1,0,1,0].
I write a code like this way:
def mapColor(label_string:String):Array[Int]={
var labels = label_string.split(",")
var index_array = new Array[Int](COLOR_LENGTH)
for (label<-labels){
if(COLOR_MAP.contains(label)){
index_array(COLOR_MAP(label))=1
}
else{
//dictionary does not contain the label, the last index set to be one
index_array(COLOR_LENGTH-1)=1
}
}
index_array
}
The COLOR_LENGTH is the length of the dictionary, and COLOR_MAP is the dictionary that contains the string->id relationship.
I call this function like this way:
val color_function = udf(mapColor:(String)=>Array[Int])
sql.withColumn("color_idx",color_function(col("Color")))
Since I have multiple columns need this operation, but different columns need different dictionaries. Currently, I duplicate this function for each column (just change the dictionary and length information). But the code looks tedious. Is there any method, I can pass the length and dictionary into the mapping function, such as
def map(label_string:String,map:Map[String,Integer],len:Int):Array[Int]
But how should I call this function in the spark dataframe? Since there is no way for me to pass the parameter in the declaration
val color_function = udf(mapColor:(String)=>Array[Int])
You can use a UDF that comes with the color Map as the base argument, like in the following example:
val df = Seq(
(1, "Red, Blue, Black"),
(2, "Red, Green"),
(3, "Blue, Yellow, Green")
).toDF("id", "color")
val colorMap = Map("Red"-> 0, "Blue"->1, "Green"->2, "Black"->3, "Yellow"->4)
def mapColorCode(m: Map[String, Int]) = udf( (s: String) =>
s.split("""\s*,\s*""").map(c => m.getOrElse(c, -99))
)
df.select($"id", mapColorCode(colorMap)($"color").as("colorcode")).show
// +---+----------+
// | id| colorcode|
// +---+----------+
// | 1| [0, 1, 3]|
// | 2| [0, 2]|
// | 3| [1, 4, 2]|
// +---+----------+
Here is the full code for brevity -
val colrMapList = List("Red" -> 0, "Blue" -> 1, "Green" -> 2).toMap
def getColor = udf((colors: Seq[String]) => { if(!colors.isEmpty) colors.map(color => colrMapList.getOrElse(color,"0")).mkString(",") else "0" } )
val colors = List((1, Array("Red","Blue","Black")),(2,Array("Red", "Green")))
val colrDF = sc.parallelize(colors).toDF
colrDF.withColumn("colorMap", getColor($"colors")).show
Explanation
Create a map for color to integer mapping.
The getColor function pulls the corresponding integers given the colors
Finally you apply the function of the colrDF to get the output

How to use RDD.flatMap?

I have a text file with lines that contain userid and rid separated by | (pipe). rid values correspond to many labels on another file.
How can I use flatMap to implement a method as follows:
xRdd = sc.textFile("file.txt").flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.foreach{ i =>
((userid, i), 1)
}
}
At compile time, I get an error:
type mismatch; found : Unit required: TraversableOnce[?]
piecing together the information provided it seems you will have to replace your foreach operation with a map operation.
xRdd = sc.textFile("file.txt") flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.map(i=>((userid,i),1))
}
This is exactly the reason why I said here and here that Scala's for-comprehension could make things easier. And should help you out too.
When you see a series of flatMap and map that's the moment where the nesting should trigger some thinking about solutions to cut the "noise". That begs for simpler solutions, doesn't it?
See the following and appreciate Scala (and its for-comprehension) yourself!
val lines = sc.textFile("file.txt")
val pairs = for {
line <- lines
Array(userid, rid) = line.split("\\|")
label <- getLabels(rid)
} yield ((userid, label), 1)
If you throw in Spark SQL to the mix, things would get even simpler. Just to whet your appetite:
scala> pairs.toDF.show
+-----------------+---+
| _1| _2|
+-----------------+---+
| [jacek,1]| 1|
|[jacek,getLabels]| 1|
| [agata,2]| 1|
|[agata,getLabels]| 1|
+-----------------+---+
I'm sure you can guess what was inside my file.txt file, can't you?

UDF to randomly assign values based on different probabilities

I would like to create a UDF to randomly assign values based on different probabilities.
In the following example depending of the value returned by rand:
0 to 0.5 the value should be A (50% probability)
0.8 to 1 the value should be B (20% probability)
anything else the value should be c (30% probability)
val names = Array("A", "B", "C")
val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})
val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))
`
However when I print the result the names are not assigned based on the rules defined in the UDF.
+---+----+----+
| id| val|name|
+---+----+----+
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C
There is nothing unexpected going on here. You call rand function twice so you get two different random values.
Either provide the same seed for both calls:
sqlContext.range(0, 100)
.select(
$"id",
abs(rand(1)).alias("val"),
allocate(abs(rand(1))).alias("name")
)
or reuse the value:
sqlContext.range(0, 100)
.withColumn("val", abs(rand))
.withColumn("name", allocate($"val"))

Stratified sampling in Spark

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.