I have a question. I have a spark dataframe with several columns looking like:
id Color
1 Red, Blue, Black
2 Red, Green
3 Blue, Yellow, Green
...
I also have a map file looking like :
Red,0
Blue,1
Green,2
Black,3
Yellow,4
what I need to do is to map the color name into different ids, such as mapping "Red, Blue, Black" into an array of [1,1,0,1,0].
I write a code like this way:
def mapColor(label_string:String):Array[Int]={
var labels = label_string.split(",")
var index_array = new Array[Int](COLOR_LENGTH)
for (label<-labels){
if(COLOR_MAP.contains(label)){
index_array(COLOR_MAP(label))=1
}
else{
//dictionary does not contain the label, the last index set to be one
index_array(COLOR_LENGTH-1)=1
}
}
index_array
}
The COLOR_LENGTH is the length of the dictionary, and COLOR_MAP is the dictionary that contains the string->id relationship.
I call this function like this way:
val color_function = udf(mapColor:(String)=>Array[Int])
sql.withColumn("color_idx",color_function(col("Color")))
Since I have multiple columns need this operation, but different columns need different dictionaries. Currently, I duplicate this function for each column (just change the dictionary and length information). But the code looks tedious. Is there any method, I can pass the length and dictionary into the mapping function, such as
def map(label_string:String,map:Map[String,Integer],len:Int):Array[Int]
But how should I call this function in the spark dataframe? Since there is no way for me to pass the parameter in the declaration
val color_function = udf(mapColor:(String)=>Array[Int])
You can use a UDF that comes with the color Map as the base argument, like in the following example:
val df = Seq(
(1, "Red, Blue, Black"),
(2, "Red, Green"),
(3, "Blue, Yellow, Green")
).toDF("id", "color")
val colorMap = Map("Red"-> 0, "Blue"->1, "Green"->2, "Black"->3, "Yellow"->4)
def mapColorCode(m: Map[String, Int]) = udf( (s: String) =>
s.split("""\s*,\s*""").map(c => m.getOrElse(c, -99))
)
df.select($"id", mapColorCode(colorMap)($"color").as("colorcode")).show
// +---+----------+
// | id| colorcode|
// +---+----------+
// | 1| [0, 1, 3]|
// | 2| [0, 2]|
// | 3| [1, 4, 2]|
// +---+----------+
Here is the full code for brevity -
val colrMapList = List("Red" -> 0, "Blue" -> 1, "Green" -> 2).toMap
def getColor = udf((colors: Seq[String]) => { if(!colors.isEmpty) colors.map(color => colrMapList.getOrElse(color,"0")).mkString(",") else "0" } )
val colors = List((1, Array("Red","Blue","Black")),(2,Array("Red", "Green")))
val colrDF = sc.parallelize(colors).toDF
colrDF.withColumn("colorMap", getColor($"colors")).show
Explanation
Create a map for color to integer mapping.
The getColor function pulls the corresponding integers given the colors
Finally you apply the function of the colrDF to get the output
Related
I have a dataframe composed of two Arrays of Doubles. I would like to create a new column that is the result of applying a euclidean distance function to the first two columns, i.e if I had:
A B
(1,2) (1,3)
(2,3) (3,4)
Create:
A B C
(1,2) (1,3) 1
(2,3) (3,4) 1.4
My data schema is:
df.schema.foreach(println)
StructField(col1,ArrayType(DoubleType,false),false)
StructField(col2,ArrayType(DoubleType,false),true)
Whenever I call this distance function:
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
I get a type error:
df.withColumn("distances" , distance($"col1",$"col2"))
<console>:68: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Array[Double]
ids_with_predictions_centroids3.withColumn("distances" , distance($"col1",$"col2"))
I understand I have to iterate over the elements of each column, but I cannot find an explanation of how to do this anywhere. I am very new to Scala programming.
To use a custom function on a dataframe you need to define it as an UDF. This can be done, for example, as follows:
val distance = udf((xs: WrappedArray[Double], ys: WrappedArray[Double]) => {
math.sqrt((xs zip ys).map { case (x,y) => math.pow(y - x, 2) }.sum)
})
df.withColumn("C", distance($"A", $"B")).show()
Note that WrappedArray (or Seq) need to be used here.
Resulting dataframe:
+----------+----------+------------------+
| A| B| C|
+----------+----------+------------------+
|[1.0, 2.0]|[1.0, 3.0]| 1.0|
|[2.0, 3.0]|[3.0, 4.0]|1.4142135623730951|
+----------+----------+------------------+
Spark functions work on column based and your only mistake is that you are mixing column and primitives in the function
And the error message is clear enough which says that you are passing a column in the distance function i.e. $"col1" and $"col2" are columns but the distance function is defined as distance(xs: Array[Double], ys: Array[Double]) taking primitive types.
The solution is to make the distance function fully column based as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def distance(xs: Column, ys: Column) = {
sqrt(pow(ys(0)-xs(0), 2) + pow(ys(1)-xs(1), 2))
}
df.withColumn("distances" , distance($"col1",$"col2")).show(false)
which should give you the correct result without errors
+------+------+------------------+
|col1 |col2 |distances |
+------+------+------------------+
|[1, 2]|[1, 3]|1.0 |
|[2, 3]|[3, 4]|1.4142135623730951|
+------+------+------------------+
I hope the answer is helpful
I am trying to find the difference between values for keys that are related (but not the same). For example, lets say that I have the following map:
(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])
I want to compare the contents of Name_# to Name_(#-1) and get the difference. So, for the example above, I would like to get (ex:
(“John_1”,[“a”,”b”,”c”]) //Since there is no John_0, all of the contents are new, so I keep them all
(“John_2”,[]) //Since all of the contents of John_2 appear in John_1, the resulting list is empty (for now, I don’t care about what happened to “c”
(“John_3”,[”c”]) //In this case, “c” is a new item (because I don’t care whether it existed prior to John_2). Again, I don’t care what happened to “a”.
(“Mary_5”,[“a”,”d”]) //There is no Mary_4 so all the items are kept
(“John_5”,[“c”,”d”,”e”]) //There is no John_4 so all the items are kept.
I was thinking on doing some kind of aggregateByKey and then just find the difference between the lists, but I do not know how to make the match between the keys that I care about, namely Name_# with Name_(#-1).
Split "id":
import org.apache.spark.sql.functions._
val df = Seq(
("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
"user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))
Add window:
val w = org.apache.spark.sql.expressions.Window
.partitionBy($"user").orderBy($"id")
and udf
val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)
and compute:
df
.withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
.withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
.show
// +------+---------+----+---+-----------+---------+
// | key| values|user| id|is_previous| diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5| [a, d]|Mary| 5| false| [a, d]|
// |John_1|[a, b, c]|John| 1| false|[a, b, c]|
// |John_2| [a, b]|John| 2| true| []|
// |John_3| [b, c]|John| 3| true| [c]|
// |John_5|[c, d, e]|John| 5| false|[c, d, e]|
// +------+---------+----+---+-----------+---------+
I managed to solve my issue as follows:
First create a function that computes the previous key from the current key
def getPrevKey(k: String): String = {
val (n, h) = k.split(“_”)
val i = h.toInt
val sb = new StringBuilder
sb.append(n).append(“_”).append(i-1)
return sb.toString
}
Then, create a copy of my RDD with the shifted key:
val copyRdd = myRdd.map(row => {
val k1 = row._1
val v1 = row._2
val k2 = getPrevHour(k1)
(k2,v1)
})
And finally, I union both RDDs and reduce by key by taking the difference between the lists:
val result = myRdd.union(copyRdd)
.reduceByKey(_.diff(_))
This gets me the exact result I need, but has the problem that it requires a lot of memory due to the union. The final result is not that large, but the partial results really weigh down the process.
I need to convert an Array[Array[Double]] to an RDD, e.g [[1.1, 1.2 ...], [2.1, 2.2 ...], [3.1, 3.2 ...], ...] to
+-----+-----+-----+
| 1.1 | 1.2 | ... |
| 2.1 | 2.2 | ... |
| 3.1 | 3.2 | ... |
| ... | ... | ... |
+-----+-----+-----+
val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
val testData = spark.sparkContext
.parallelize(Seq(testDensities
.map { x => x.toArray }
.map { x => x.toString() } ))
And this code even feels incorrect, the second map call is supposed to map each element in the array to convert the Double to String. This is what I get when I save it as a text file.
[Ljava.lang.String;#773d7a60
Can anybody please comment on what should I do, and where I am doing a horrendous mistake?
Thanks.
If you want to convert an Array[Double] to a String you can use the mkString method which joins each item of the array with a delimiter (in my example ",")
scala> val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
scala> val rdd = spark.sparkContext.parallelize(testDensities)
scala> val rddStr = rdd.map(_.mkString(","))
rddStr: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at map at
scala> rddStr.collect.foreach(println)
1.1,1.2
2.1,2.2
3.1,3.2
Maybe something like this:
scala> val testDensities: Array[Array[Double]] = Array(Array(1.1, 1.2), Array(2.1, 2.2), Array(3.1, 3.2))
scala> val strRdd = sc.parallelize(testDensities).map(_.mkString("[",",","]"))
strRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[16] at map at <console>:26
scala> strRdd.collect
res7: Array[String] = Array([1.1,1.2], [2.1,2.2], [3.1,3.2])
But I have two question:
Why do you want to do it? I understand that is only because you are
learning and you are playing with Spark.
Why do you try to use "Array"? It is not the first time that I see people trying to transform all in arrays. Keep RDD until the end and use more generic collections types.
Why is your code wrong:
Because you apply the map in your local array (in the Driver) and then create a RDD from a list of lists.
So:
You are not parallelizing the execution of the maps. In fact, you are parallelizing nothing.
You create an RDD of Lists and not of String.
If you execute your code in the console:
scala> val testData = sc.parallelize(Seq(testDensities.map { x => x.toArray }.map { x => x.toString() } ))
testData: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[14] at parallelize at <console>:26
the response is clear: RDD[Array[String]]
I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.
If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such:
val data = rdd.map(col => new LabeledPoint(
col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray))
Any suggestions or guidance will be much appreciated.
Maybe I messed up RDD with DataFrame, I can convert the RDD to DataFrame with .toDF() or it is easier to accomplish the goal with DataFrame than RDD.
I assume your data looks more or less like this:
import scala.util.Random.{setSeed, nextDouble}
setSeed(1)
case class Record(
foo: Double, target: Double, x1: Double, x2: Double, x3: Double)
val rows = sc.parallelize(
(1 to 10).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
))
)
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(foo, 2) foo,
ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x2, 2) x3
FROM df""").show
So we have data as below:
+----+------+----+----+----+
| foo|target| x1| x2| x3|
+----+------+----+----+----+
|0.73| 0.41|0.21|0.33|0.33|
|0.01| 0.96|0.94|0.95|0.95|
| 0.4| 0.35|0.29|0.51|0.51|
|0.77| 0.66|0.16|0.38|0.38|
|0.69| 0.81|0.01|0.52|0.52|
|0.14| 0.48|0.54|0.58|0.58|
|0.62| 0.18|0.01|0.16|0.16|
|0.54| 0.97|0.25|0.39|0.39|
|0.43| 0.23|0.89|0.04|0.04|
|0.66| 0.12|0.65|0.98|0.98|
+----+------+----+----+----+
and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)):
// Map feature names to indices
val featInd = List("x1", "x3").map(df.columns.indexOf(_))
// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))
// Get index of target
val targetInd = df.columns.indexOf("target")
df.rdd.map(r => LabeledPoint(
r.getDouble(targetInd), // Get target value
// Map feature indices to values
Vectors.dense(featInd.map(r.getDouble(_)).toArray)
))