UDF to randomly assign values based on different probabilities - scala

I would like to create a UDF to randomly assign values based on different probabilities.
In the following example depending of the value returned by rand:
0 to 0.5 the value should be A (50% probability)
0.8 to 1 the value should be B (20% probability)
anything else the value should be c (30% probability)
val names = Array("A", "B", "C")
val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})
val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))
`
However when I print the result the names are not assigned based on the rules defined in the UDF.
+---+----+----+
| id| val|name|
+---+----+----+
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C

There is nothing unexpected going on here. You call rand function twice so you get two different random values.
Either provide the same seed for both calls:
sqlContext.range(0, 100)
.select(
$"id",
abs(rand(1)).alias("val"),
allocate(abs(rand(1))).alias("name")
)
or reuse the value:
sqlContext.range(0, 100)
.withColumn("val", abs(rand))
.withColumn("name", allocate($"val"))

Related

Comparing values from different keys in scala / spark

I am trying to find the difference between values for keys that are related (but not the same). For example, lets say that I have the following map:
(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])
I want to compare the contents of Name_# to Name_(#-1) and get the difference. So, for the example above, I would like to get (ex:
(“John_1”,[“a”,”b”,”c”]) //Since there is no John_0, all of the contents are new, so I keep them all
(“John_2”,[]) //Since all of the contents of John_2 appear in John_1, the resulting list is empty (for now, I don’t care about what happened to “c”
(“John_3”,[”c”]) //In this case, “c” is a new item (because I don’t care whether it existed prior to John_2). Again, I don’t care what happened to “a”.
(“Mary_5”,[“a”,”d”]) //There is no Mary_4 so all the items are kept
(“John_5”,[“c”,”d”,”e”]) //There is no John_4 so all the items are kept.
I was thinking on doing some kind of aggregateByKey and then just find the difference between the lists, but I do not know how to make the match between the keys that I care about, namely Name_# with Name_(#-1).
Split "id":
import org.apache.spark.sql.functions._
val df = Seq(
("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
"user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))
Add window:
val w = org.apache.spark.sql.expressions.Window
.partitionBy($"user").orderBy($"id")
and udf
val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)
and compute:
df
.withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
.withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
.show
// +------+---------+----+---+-----------+---------+
// | key| values|user| id|is_previous| diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5| [a, d]|Mary| 5| false| [a, d]|
// |John_1|[a, b, c]|John| 1| false|[a, b, c]|
// |John_2| [a, b]|John| 2| true| []|
// |John_3| [b, c]|John| 3| true| [c]|
// |John_5|[c, d, e]|John| 5| false|[c, d, e]|
// +------+---------+----+---+-----------+---------+
I managed to solve my issue as follows:
First create a function that computes the previous key from the current key
def getPrevKey(k: String): String = {
val (n, h) = k.split(“_”)
val i = h.toInt
val sb = new StringBuilder
sb.append(n).append(“_”).append(i-1)
return sb.toString
}
Then, create a copy of my RDD with the shifted key:
val copyRdd = myRdd.map(row => {
val k1 = row._1
val v1 = row._2
val k2 = getPrevHour(k1)
(k2,v1)
})
And finally, I union both RDDs and reduce by key by taking the difference between the lists:
val result = myRdd.union(copyRdd)
.reduceByKey(_.diff(_))
This gets me the exact result I need, but has the problem that it requires a lot of memory due to the union. The final result is not that large, but the partial results really weigh down the process.

scala: Remove columns where column value below median value for all columns

In data reduction phase of analysis, I want to remove all columns where column total is below the median value of all column totals.
So with dataset:
v1,v2,v3
1 3 5
3 4 3
I sum columns
v1,v2,v3
4 7 8
The median is 7 so I drop v1
v2,v3
3 5
4 3
I thought I could do this with a streaming function on Row. But this does not seem possible.
The code I have come up with works, but it seems very verbose and looks a lot like Java code (which I take as a sign that I am doing it wrong).
Are there any more efficient ways of performing this operation?
val val dfv2=DataFrameUtils.openFile(spark,"C:\\Users\\jake\\__workspace\\R\\datafiles\\ikodaDataManipulation\\VERB2.csv")
//return a single row dataframe with sum of each column
val dfv2summed:DataFrame=dfv2.groupBy().sum()
logger.info(s"dfv2summed col count is ${dfv2summed.schema.fieldNames.length}")
//get the rowValues
val rowValues:Array[Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames).values.toArray
//sort the rows
scala.util.Sorting.quickSort(rowValues)
//calculate medians (simplistically)
val median:Long = rowValues(rowValues.length/2)
//ArrayBuffer to hold column needs that need removing
var columnArray: ArrayBuffer[String] = ArrayBuffer[String]()
//get tuple key value pairs of columnName/value
val entries: Map[String, Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames)
entries.foreach
{
//find all columns where total value below median value
kv =>
if(kv._2.<(median))
{
columnArray+=kv._1
}
}
//drop columns
val dropColumns:Seq[String]=columnArray.map(s => s.substring(s.indexOf("sum(")+4,s.length-1)).toSeq
logger.info(s"todrop ${dropColumns.size} : ${dropColumns}")
val reducedDf=dfv2.drop(dropColumns: _*)
logger.info(s"reducedDf col count is ${reducedDf.schema.fieldNames.length}")
After calculating the sum of each column in Spark, we can get the median value in plain Scala and then select only the columns greater than or equal to this value by column indices.
Let's start with defining a function for calculating the median, it is a slight modification of this example:
def median(seq: Seq[Long]): Long = {
//In order if you are not sure that 'seq' is sorted
val sortedSeq = seq.sortWith(_ < _)
if (seq.size % 2 == 1) sortedSeq(sortedSeq.size / 2)
else {
val (up, down) = sortedSeq.splitAt(seq.size / 2)
(up.last + down.head) / 2
}
}
We first calculate the sums for all columns and convert it to Seq[Long]:
import org.apache.spark.sql.functions._
val sums = df.select(df.columns.map(c => sum(col(c)).alias(c)): _*)
.first.toSeq.asInstanceOf[Seq[Long]]
Then we calculate the median,
val med = median(sums)
And use it as a threshold to generate the column indices to keep:
val cols_keep = sums.zipWithIndex.filter(_._1 >= med).map(_._2)
Finally, we map these indices inside a select() statement:
df.select(cols_keep map df.columns map col: _*).show()
+---+---+
| v2| v3|
+---+---+
| 3| 5|
| 4| 3|
+---+---+

Add a column to DataFrame with value of 1 where prediction greater than a custom threshold

I am trying to add a column to a DataFrame that should have the value 1 when the output class probability is high. Something like this:
val output = predictions
.withColumn(
"easy",
when( $"label" === $"prediction" &&
$"probability" > 0.95, 1).otherwise(0)
)
The problem is, probability is a Vector, and 0.95 is a Double, so the above doesn't work. What I really need is more like max($"probability") > 0.95 but of course that doesn't work either.
What is the right way of accomplishing this?
Here is a simple example as to implement your question.
Create a udf and pass probability column and return 0 or 1 for the new added column. In a Row WrappedArray is used instead of Array, Vector.
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(Vector(0.78, 0.98, 0.97), 1), (Vector(0.78, 0.96), 2), (Vector(0.78, 0.50), 3)
)).toDF("probability", "id")
data.withColumn("label", label($"probability")).show()
def label = udf((prob: mutable.WrappedArray[Double]) => {
if (prob.max >= 0.95) 1 else 0
})
Output:
+------------------+---+-----+
| probability| id|label|
+------------------+---+-----+
|[0.78, 0.98, 0.97]| 1| 1|
| [0.78, 0.96]| 2| 1|
| [0.78, 0.5]| 3| 0|
+------------------+---+-----+
Define UDF
val findP = udf((label: <type>, prediction: <type>, probability: <type> ) => {
if (label == prediction && vector.toArray.max > 0.95) 1 else 0
})
Use UDF in withCoulmn()
val output = predictions.withColumn("easy",findP($"lable",$"prediction",$"probability"))
Use an udf, something like:
val func = (label: String, prediction: String, vector: Vector) => {
if(label == prediction && vector.toArray.max > 0.95) 1 else 0
}
val output = predictions
.select($"label", func($"label", $"prediction", $"probability").as("easy"))

Spark GraphX Aggregation Summation

I'm trying to compute the sum of node values in a spark graphx graph. In short the graph is a tree and the top node (root) should sum all children and their children. My graph is actually a tree that looks like this and the expected summed value should be 1850:
+----+
+---------------> | VertexID 14
| | | Value: 1000
+---+--+ +----+
+------------> | VertexId 11
| | | Value: +----+
| +------+ Sum of 14 & 24 | VertexId 24
+---++ +--------------> | Value: 550
| | VertexId 20 +----+
| | Value:
+----++Sum of 11 & 911
|
| +-----+
+-----------> | VertexId 911
| | Value: 300
+-----+
The first stab at this looks like this:
val vertices: RDD[(VertexId, Int)] =
sc.parallelize(Array((20L, 0)
, (11L, 0)
, (14L, 1000)
, (24L, 550)
, (911L, 300)
))
//note that the last value in the edge is for factor (positive or negative)
val edges: RDD[Edge[Int]] =
sc.parallelize(Array(
Edge(14L, 11L, 1),
Edge(24L, 11L, 1),
Edge(11L, 20L, 1),
Edge(911L, 20L, 1)
))
val dataItemGraph = Graph(vertices, edges)
val sum: VertexRDD[(Int, BigDecimal, Int)] = dataItemGraph.aggregateMessages[(Int, BigDecimal, Int)](
sendMsg = { triplet => triplet.sendToDst(1, triplet.srcAttr, 1) },
mergeMsg = { (a, b) => (a._1, a._2 * a._3 + b._2 * b._3, 1) }
)
sum.collect.foreach(println)
This returns the following:
(20,(1,300,1))
(11,(1,1550,1))
It's doing the sum for vertex 11 but it's not rolling up to the root node (vertex 20). What am I missing or is there a better way of doing this? Of course the tree can be of arbitrary size and each vertex can have an arbitrary number of children edges.
Given the graph is directed (as in you example it seems to be) it should be possible to write a Pregel program that does what you're asking for:
val result =
dataItemGraph.pregel(0, activeDirection = EdgeDirection.Out)(
(_, vd, msg) => msg + vd,
t => Iterator((t.dstId, t.srcAttr)),
(x, y) => x + y
)
result.vertices.collect().foreach(println)
// Output is:
// (24,550)
// (20,1850)
// (14,1000)
// (11,1550)
// (911,300)
I'm using EdgeDirection.Out so that the messages are being send only from bottom to up (otherwise we would get into an endless loop).

Stratified sampling in Spark

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.