I have a dataframe composed of two Arrays of Doubles. I would like to create a new column that is the result of applying a euclidean distance function to the first two columns, i.e if I had:
A B
(1,2) (1,3)
(2,3) (3,4)
Create:
A B C
(1,2) (1,3) 1
(2,3) (3,4) 1.4
My data schema is:
df.schema.foreach(println)
StructField(col1,ArrayType(DoubleType,false),false)
StructField(col2,ArrayType(DoubleType,false),true)
Whenever I call this distance function:
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
I get a type error:
df.withColumn("distances" , distance($"col1",$"col2"))
<console>:68: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Array[Double]
ids_with_predictions_centroids3.withColumn("distances" , distance($"col1",$"col2"))
I understand I have to iterate over the elements of each column, but I cannot find an explanation of how to do this anywhere. I am very new to Scala programming.
To use a custom function on a dataframe you need to define it as an UDF. This can be done, for example, as follows:
val distance = udf((xs: WrappedArray[Double], ys: WrappedArray[Double]) => {
math.sqrt((xs zip ys).map { case (x,y) => math.pow(y - x, 2) }.sum)
})
df.withColumn("C", distance($"A", $"B")).show()
Note that WrappedArray (or Seq) need to be used here.
Resulting dataframe:
+----------+----------+------------------+
| A| B| C|
+----------+----------+------------------+
|[1.0, 2.0]|[1.0, 3.0]| 1.0|
|[2.0, 3.0]|[3.0, 4.0]|1.4142135623730951|
+----------+----------+------------------+
Spark functions work on column based and your only mistake is that you are mixing column and primitives in the function
And the error message is clear enough which says that you are passing a column in the distance function i.e. $"col1" and $"col2" are columns but the distance function is defined as distance(xs: Array[Double], ys: Array[Double]) taking primitive types.
The solution is to make the distance function fully column based as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def distance(xs: Column, ys: Column) = {
sqrt(pow(ys(0)-xs(0), 2) + pow(ys(1)-xs(1), 2))
}
df.withColumn("distances" , distance($"col1",$"col2")).show(false)
which should give you the correct result without errors
+------+------+------------------+
|col1 |col2 |distances |
+------+------+------------------+
|[1, 2]|[1, 3]|1.0 |
|[2, 3]|[3, 4]|1.4142135623730951|
+------+------+------------------+
I hope the answer is helpful
Related
So, I have the following functions that work perfectly when I use them on arrays:
def magnitude(x: Array[Int]): Double = {
math.sqrt(x map(i => i*i) sum)
}
def dotProduct(x: Array[Int], y: Array[Int]): Int = {
(for((a, b) <- x zip y) yield a * b) sum
}
def cosineSimilarity(x: Array[Int], y: Array[Int]): Double = {
require(x.size == y.size)
dotProduct(x, y)/(magnitude(x) * magnitude(y))
}
But, I don't know how to run it on an array that I have in a spark dataframe column.
I know the problem is that the function expects an array, but I am giving a column to it. But, I don't know how to solve the problem.
One way is to wrap your functions within UDFs. Yet UDFs are known to be suboptimal most of the time. You could therefore rewrite your functions with spark primitives. To ease the reuse of the expression you write, you can write functions that take Column objects as parameters.
import org.apache.spark.sql.Column
def magnitude(x : Column) = {
aggregate(transform(x, _ * _), lit(0), _ + _)
}
def dotProduct(x : Column, y : Column) = {
val products = transform(arrays_zip(x, y), s => s(x.toString) * s(y.toString))
aggregate(products, lit(0), _ + _)
}
def cosineSimilarity(x : Column, y : Column) = {
dotProduct(x, y) / (magnitude(x) * magnitude(y))
}
Let's test this:
val df = spark.range(1).select(
array(lit(1), lit(2), lit(3)) as "x",
array(lit(1), lit(3), lit(5)) as "y"
)
df.select(
'x, 'y,
magnitude('x) as "magnitude_x",
dotProduct('x, 'y) as "dot_prod_x_y",
cosineSimilarity('x, 'y) as "cosine_x_y"
).show()
which yields:
+---------+---------+-----------+------------+--------------------+
| x| y|magnitude_x|dot_prod_x_y| cosine_x_y|
+---------+---------+-----------+------------+--------------------+
|[1, 2, 3]|[1, 3, 5]| 14| 22|0.044897959183673466|
+---------+---------+-----------+------------+--------------------+
To use your own functions within sparkSQL, you need to wrap them inside of a UDF (user defined function).
val df = spark.range(1)
.withColumn("x", array(lit(1), lit(2), lit(3)))
// defining the user defined functions from the scala functions.
val magnitude_udf = udf(magnitude _)
val dot_product_udf = udf(dotProduct(_,_))
df
.withColumn("magnitude", magnitude_udf('x))
.withColumn("dot_product", dot_product_udf('x, 'x))
.show
+---+---------+------------------+-----------+
| id| x| magnitude|dot_product|
+---+---------+------------------+-----------+
| 0|[1, 2, 3]|3.7416573867739413| 14|
+---+---------+------------------+-----------+
I have a following piece of code, where I see the result, but do not understand how exactly it is made:
val Df = Seq(Seq(4,7,9)).toDf("x")
val Ds = Df.withColumn("t", $"x").as[(Seq[Int], Seq[Int])]
ds.flatMap{
case(x1,x2) => x2.map((x1,_))
}.toDf("v1","v2")
Result looks like this:
+---------+---+
|v1 |v2 |
+---------+---+
|[4, 7, 9]|4 |
|[4, 7, 9]|7 |
|[4, 7, 9]|9 |
+---------+---+
My questions are:
1) How come this:
Df.withColumn("t", $"x").as[(Seq[Int], Seq[Int])]
enters same content to both columns, even though this specific Seq does not have a name to refer to? Why doesn't it create empty sequences?
2) result of the flatmap should be list/array, why it becomes a dataset with 2 columns?
3) what does mean case (x1,x2) in this particular situation? Why is it in brackets?
4) x2.map((x1,_)) which exactly operations does map function perform here? I see, that it takes x2 (second column), I understand that "_" means an element of a Seq, but I totally miss the whole coherent picture.
1) thas the same content as x, so have a a dataframe with two columns (x,t), both array-type with same contents
2) map in DataFrame API maps over rows, not over the element of one row. x2.map((x1,_)) becomes a Seq of tuples, the first being x1 (i.e. your x column), the second is one element of your t column array
3) this is pattern-matching (unapply) on a tuple2 (i.e. Seq[Int], Seq[Int])), x1 und x2 are both Seqs/arrays
4) this is the same as select($"x",explode($"t")) in DataFrame API. For every element in t, a new row is created (thus you get 3 rows)
In data reduction phase of analysis, I want to remove all columns where column total is below the median value of all column totals.
So with dataset:
v1,v2,v3
1 3 5
3 4 3
I sum columns
v1,v2,v3
4 7 8
The median is 7 so I drop v1
v2,v3
3 5
4 3
I thought I could do this with a streaming function on Row. But this does not seem possible.
The code I have come up with works, but it seems very verbose and looks a lot like Java code (which I take as a sign that I am doing it wrong).
Are there any more efficient ways of performing this operation?
val val dfv2=DataFrameUtils.openFile(spark,"C:\\Users\\jake\\__workspace\\R\\datafiles\\ikodaDataManipulation\\VERB2.csv")
//return a single row dataframe with sum of each column
val dfv2summed:DataFrame=dfv2.groupBy().sum()
logger.info(s"dfv2summed col count is ${dfv2summed.schema.fieldNames.length}")
//get the rowValues
val rowValues:Array[Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames).values.toArray
//sort the rows
scala.util.Sorting.quickSort(rowValues)
//calculate medians (simplistically)
val median:Long = rowValues(rowValues.length/2)
//ArrayBuffer to hold column needs that need removing
var columnArray: ArrayBuffer[String] = ArrayBuffer[String]()
//get tuple key value pairs of columnName/value
val entries: Map[String, Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames)
entries.foreach
{
//find all columns where total value below median value
kv =>
if(kv._2.<(median))
{
columnArray+=kv._1
}
}
//drop columns
val dropColumns:Seq[String]=columnArray.map(s => s.substring(s.indexOf("sum(")+4,s.length-1)).toSeq
logger.info(s"todrop ${dropColumns.size} : ${dropColumns}")
val reducedDf=dfv2.drop(dropColumns: _*)
logger.info(s"reducedDf col count is ${reducedDf.schema.fieldNames.length}")
After calculating the sum of each column in Spark, we can get the median value in plain Scala and then select only the columns greater than or equal to this value by column indices.
Let's start with defining a function for calculating the median, it is a slight modification of this example:
def median(seq: Seq[Long]): Long = {
//In order if you are not sure that 'seq' is sorted
val sortedSeq = seq.sortWith(_ < _)
if (seq.size % 2 == 1) sortedSeq(sortedSeq.size / 2)
else {
val (up, down) = sortedSeq.splitAt(seq.size / 2)
(up.last + down.head) / 2
}
}
We first calculate the sums for all columns and convert it to Seq[Long]:
import org.apache.spark.sql.functions._
val sums = df.select(df.columns.map(c => sum(col(c)).alias(c)): _*)
.first.toSeq.asInstanceOf[Seq[Long]]
Then we calculate the median,
val med = median(sums)
And use it as a threshold to generate the column indices to keep:
val cols_keep = sums.zipWithIndex.filter(_._1 >= med).map(_._2)
Finally, we map these indices inside a select() statement:
df.select(cols_keep map df.columns map col: _*).show()
+---+---+
| v2| v3|
+---+---+
| 3| 5|
| 4| 3|
+---+---+
I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.
If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such:
val data = rdd.map(col => new LabeledPoint(
col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray))
Any suggestions or guidance will be much appreciated.
Maybe I messed up RDD with DataFrame, I can convert the RDD to DataFrame with .toDF() or it is easier to accomplish the goal with DataFrame than RDD.
I assume your data looks more or less like this:
import scala.util.Random.{setSeed, nextDouble}
setSeed(1)
case class Record(
foo: Double, target: Double, x1: Double, x2: Double, x3: Double)
val rows = sc.parallelize(
(1 to 10).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
))
)
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(foo, 2) foo,
ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x2, 2) x3
FROM df""").show
So we have data as below:
+----+------+----+----+----+
| foo|target| x1| x2| x3|
+----+------+----+----+----+
|0.73| 0.41|0.21|0.33|0.33|
|0.01| 0.96|0.94|0.95|0.95|
| 0.4| 0.35|0.29|0.51|0.51|
|0.77| 0.66|0.16|0.38|0.38|
|0.69| 0.81|0.01|0.52|0.52|
|0.14| 0.48|0.54|0.58|0.58|
|0.62| 0.18|0.01|0.16|0.16|
|0.54| 0.97|0.25|0.39|0.39|
|0.43| 0.23|0.89|0.04|0.04|
|0.66| 0.12|0.65|0.98|0.98|
+----+------+----+----+----+
and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)):
// Map feature names to indices
val featInd = List("x1", "x3").map(df.columns.indexOf(_))
// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))
// Get index of target
val targetInd = df.columns.indexOf("target")
df.rdd.map(r => LabeledPoint(
r.getDouble(targetInd), // Get target value
// Map feature indices to values
Vectors.dense(featInd.map(r.getDouble(_)).toArray)
))