How to convert a mllib matrix to a spark dataframe? - scala

I want to pretty print the result of a correlation in a zeppelin notebook:
val Row(coeff: Matrix) = Correlation.corr(data, "features").head
One of the ways to achieve this is to convert the result into a DataFrame with each value in a separate column and call z.show().
However, looking into the Matrix api I don't see any way to do this.
Is there another straight forward way to achieve this?
Edit:
The dataframe has 50 columns. Just converting to a string would not help as the output get truncated.

Using the toString method should be the easiest and fastest way if you simply want to print the matrix. You can change the output by inputting the maximum number of lines to print as well as max line width. You can change the formatting by splitting on new lines and ",". For example:
val matrix = Matrices.dense(2,3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
matrix.toString
.split("\n")
.map(_.trim.split(" ").filter(_ != "").mkString("[", ",", "]"))
.mkString("\n")
which will give the following:
[1.0,3.0,5.0]
[2.0,4.0,6.0]
However, if you want to convert the matrix to an DataFrame, the easiest way would be to first create an RDD and then use toDF().
val matrixRows = matrix.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Row")
Then to put each value in a separate column you can do the following
val numOfCols = matrixRows.head.length
val df2 = (0 until numOfCols).foldLeft(df)((df, num) =>
df.withColumn("Col" + num, $"Row".getItem(num)))
.drop("Row")
df2.show(false)
Result using the example data:
+----+----+----+
|Col0|Col1|Col2|
+----+----+----+
|1.0 |3.0 |5.0 |
|2.0 |4.0 |6.0 |
+----+----+----+

Related

how make elements of a list lower case?

I have a df tthat one of the columns is a set of words. How I can make them lower case in the efficient way?
The df has many column but the column that I am trying to make it lower case is like this:
B
['Summer','Air Bus','Got']
['Parmin','Home']
Note:
In pandas I do df['B'].str.lower()
If I understood you correctly, you have a column that is an array of strings.
To lower the string, you can use lower function like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"B": ["Summer", "Air Bus", "Got"]},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn("result", F.expr("transform(B, x -> lower(x))"))
Result:
+----------------------+----------------------+
|B |result |
+----------------------+----------------------+
|[Summer, Air Bus, Got]|[summer, air bus, got]|
+----------------------+----------------------+
A slight variation on #vladsiv's answer, which tries to answer a question in the comments about passing a dynamic column name.
# set column name
m = "B"
# use F.tranform directly, rather than in a F.expr
df = df.withColumn("result", F.transform(F.col(m), lambda x:F.lower(x)))

Spark Scala - Extract elements of an array into new row

I have a following piece of code, where I see the result, but do not understand how exactly it is made:
val Df = Seq(Seq(4,7,9)).toDf("x")
val Ds = Df.withColumn("t", $"x").as[(Seq[Int], Seq[Int])]
ds.flatMap{
case(x1,x2) => x2.map((x1,_))
}.toDf("v1","v2")
Result looks like this:
+---------+---+
|v1 |v2 |
+---------+---+
|[4, 7, 9]|4 |
|[4, 7, 9]|7 |
|[4, 7, 9]|9 |
+---------+---+
My questions are:
1) How come this:
Df.withColumn("t", $"x").as[(Seq[Int], Seq[Int])]
enters same content to both columns, even though this specific Seq does not have a name to refer to? Why doesn't it create empty sequences?
2) result of the flatmap should be list/array, why it becomes a dataset with 2 columns?
3) what does mean case (x1,x2) in this particular situation? Why is it in brackets?
4) x2.map((x1,_)) which exactly operations does map function perform here? I see, that it takes x2 (second column), I understand that "_" means an element of a Seq, but I totally miss the whole coherent picture.
1) thas the same content as x, so have a a dataframe with two columns (x,t), both array-type with same contents
2) map in DataFrame API maps over rows, not over the element of one row. x2.map((x1,_)) becomes a Seq of tuples, the first being x1 (i.e. your x column), the second is one element of your t column array
3) this is pattern-matching (unapply) on a tuple2 (i.e. Seq[Int], Seq[Int])), x1 und x2 are both Seqs/arrays
4) this is the same as select($"x",explode($"t")) in DataFrame API. For every element in t, a new row is created (thus you get 3 rows)

How to round decimal in Scala Spark

I have a (large ~ 1million) Scala Spark DataFrame with the following data:
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
How do I discretise/round the scores to the nearest 0.05 decimal place?
Expected result:
id,score
1,0.95
2,1.00
3,0.85
4,0.85
...
Would like to avoid using UDF to maximise performance.
The answer can be simplifier:
dataframe.withColumn("rounded_score", round(col("score"), 2))
there is a method
def round(e: Column, scale: Int)
Round the value of e to scale decimal places with HALF_UP round mode
You can do it using spark built in functions like so
dataframe.withColumn("rounded_score", round(col("score") * 100 / 5) * 5 / 100)
Multiply it so that the precision you want is a whole number.
Then divide that number by 5, and round.
Now the number is divisable by 5, so multiply it by 5 to get back the entire number
Divide by 100 to get the precision correct again.
result
+---+-----+-------------+
| id|score|rounded_score|
+---+-----+-------------+
| 1|0.956| 0.95|
| 2|0.977| 1.0|
| 3|0.855| 0.85|
| 4|0.866| 0.85|
+---+-----+-------------+
You can specify your schema when convert into dataframe ,
Example :
DecimalType(10, 2) for the column in your customSchema when loading data.
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
import org.apache.spark.sql.types._
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("score", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(mySchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").show

Iterate over elements of columns Scala

I have a dataframe composed of two Arrays of Doubles. I would like to create a new column that is the result of applying a euclidean distance function to the first two columns, i.e if I had:
A B
(1,2) (1,3)
(2,3) (3,4)
Create:
A B C
(1,2) (1,3) 1
(2,3) (3,4) 1.4
My data schema is:
df.schema.foreach(println)
StructField(col1,ArrayType(DoubleType,false),false)
StructField(col2,ArrayType(DoubleType,false),true)
Whenever I call this distance function:
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
I get a type error:
df.withColumn("distances" , distance($"col1",$"col2"))
<console>:68: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Array[Double]
ids_with_predictions_centroids3.withColumn("distances" , distance($"col1",$"col2"))
I understand I have to iterate over the elements of each column, but I cannot find an explanation of how to do this anywhere. I am very new to Scala programming.
To use a custom function on a dataframe you need to define it as an UDF. This can be done, for example, as follows:
val distance = udf((xs: WrappedArray[Double], ys: WrappedArray[Double]) => {
math.sqrt((xs zip ys).map { case (x,y) => math.pow(y - x, 2) }.sum)
})
df.withColumn("C", distance($"A", $"B")).show()
Note that WrappedArray (or Seq) need to be used here.
Resulting dataframe:
+----------+----------+------------------+
| A| B| C|
+----------+----------+------------------+
|[1.0, 2.0]|[1.0, 3.0]| 1.0|
|[2.0, 3.0]|[3.0, 4.0]|1.4142135623730951|
+----------+----------+------------------+
Spark functions work on column based and your only mistake is that you are mixing column and primitives in the function
And the error message is clear enough which says that you are passing a column in the distance function i.e. $"col1" and $"col2" are columns but the distance function is defined as distance(xs: Array[Double], ys: Array[Double]) taking primitive types.
The solution is to make the distance function fully column based as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def distance(xs: Column, ys: Column) = {
sqrt(pow(ys(0)-xs(0), 2) + pow(ys(1)-xs(1), 2))
}
df.withColumn("distances" , distance($"col1",$"col2")).show(false)
which should give you the correct result without errors
+------+------+------------------+
|col1 |col2 |distances |
+------+------+------------------+
|[1, 2]|[1, 3]|1.0 |
|[2, 3]|[3, 4]|1.4142135623730951|
+------+------+------------------+
I hope the answer is helpful

Stratified sampling in Spark

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.