Convert Array into dataframe with columns and index in Scala - scala

Initially I have a matrix
0.0 0.4 0.4 0.0
0.1 0.0 0.0 0.7
0.0 0.2 0.0 0.3
0.3 0.0 0.0 0.0
The matrix matrix is converted into a normal_array by
`val normal_array = matrix.toArray`
and I have an array of string
inputCols : Array[String] = Array(p1, p2, p3, p4)
I need to convert this matrix into a following data frame. (Note: The number of rows and columns in the matrix will be the same as the length of the inputCols)
index p1 p2 p3 p4
p1 0.0 0.4 0.4 0.0
p2 0.1 0.0 0.0 0.7
p3 0.0 0.2 0.0 0.3
p4 0.3 0.0 0.0 0.0
In python, this can be easily achieved by pandas library.
arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)
But how can I do this in Scala?

You can do something like below
//convert your data to Scala Seq/List/Array
val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
//Define your Array of desired columns
val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")
//Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc
val df = sparkSession.createDataFrame(list)
//Getting the list of column names from dataframe
val dfColumns=df.columns
//Creating query to rename columns
val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))
//Firing above query
val newDf=df.selectExpr(query:_*)
//Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns
val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))
//Adding temporary column row_no which contains index of row and removing after adding index column
val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")
dfWithRow.show
Sample Output:
+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0| p1|
|0.1|0.0|0.0|0.7| p2|
|0.0|0.2|0.0|0.3| p3|
|0.3|0.0|0.0|0.0| p4|
+---+---+---+---+-----+

Here is another way:
val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")
Zip the collection and convert it into DataFrame.
data.zip(cols).map {
case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)
Output:
+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1 |
|0.1|0.0|0.0|0.7|p2 |
|0.0|0.2|0.0|0.3|p3 |
|0.3|0.0|0.0|0.0|p4 |
+---+---+---+---+-----+

Newer and shorter version should look like
for Spark version > 2.4.5.
Please find the inline description of the statements
val spark = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import spark.implicits._
val cols = (1 to 4).map( i => s"p$i")
val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
.toDF(cols: _*) // Map the data to new column names
.withColumn("index", // Create a column with auto increasing id
functions.concat(functions.lit("p"),functions.monotonically_increasing_id()))
listDf.show()

Related

Pyspark calculate average of non-zero elements for each column

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
(-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),],
['col1', 'col2', 'col3'])
df1 = df.agg(F.avg('col1'))
df2 = df.agg(F.avg('col2'))
df3 = df.agg(F.avg('col3'))
If I have a dataframe,
ID COL1 COL2 COL3
1 0.0 1.2 -1.3
2 0.0 0.0 0.0
3 -17.2 20.3 15,2
4 23.4 1.4 0.0
I want to calculate mean for each column.
avg1 avg2 avg3
1 3.1 7.6 6.9
The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.
How can I do it?
None values are not affecting average so if you turn zero values to null you can have average of none zero values
(
df
.agg(
F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()

Convert org.apache.spark.mllib.linalg.Matrix to spark dataframe in Scala

I have an input dataframe input_df as:
+---------------+--------------------+
|Main_CustomerID| Vector|
+---------------+--------------------+
| 725153|[3.0,2.0,6.0,0.0,9.0|
| 873008|[4.0,1.0,0.0,1.0,...|
| 625109|[1.0,0.0,6.0,1.0,...|
| 817171|[0.0,4.0,0.0,7.0,...|
| 611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+
The input_df is of schema type,
root
|-- Main_CustomerID: integer (nullable = true)
|-- Vector: vector (nullable = true)
By referring to Calculate Cosine Similarity Spark Dataframe, I have created the indexed row matrix and then I do:
val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix
to find cosine similarity between columns. Now I have a resultant mllib matrix,
cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0 0.4199605255658081 0.5744269579035528 0.22075539284417395 0.561434614044346
0.0 0.0 0.2791452631195413 0.7259079527665503 0.6206918387272496
0.0 0.0 0.0 0.31792539222893695 0.6997167152675132
0.0 0.0 0.0 0.0 0.6776404124278828
0.0 0.0 0.0 0.0 0.0
Now, I need to convert this lm which is of type org.apache.spark.mllib.linalg.Matrix into a dataframe. I expect my output dataframe to look as follows:
+---+------------------+------------------+-------------------+------------------+
| _1| _2| _3| _4| _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0| 0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0| 0.0| 0.0|0.31792539222893695|0.6997167152675132|
|0.0| 0.0| 0.0| 0.0|0.6776404124278828|
|0.0| 0.0| 0.0| 0.0| 0.0|
+---+------------------+------------------+-------------------+------------------+
How can I do this in Scala?
To convert the Matrix to a dataframe as specified, do the following. It first converts the matrix to a dataframe containing a single column with an array. Then foldLeft is used to break the array into separate columns.
import spark.implicits._
val cols = (0 until lm.numCols).toSeq
val df = lm.transpose
.colIter.toSeq
.map(_.toArray)
.toDF("arr")
val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
.drop("arr")

How to evaluate binary classifier evaluation metrics per group (in scala)?

I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:
| problem | score | label |
|:--------|:------|-------|
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
I thought about doing something like:
df.groupBy("problem").agg(getMetrics)
but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?
There's a module built just for binary metrics - see it in the python docs
This code should work,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)
... and so on for the other problems
This seems to me to be the easiest way :)
Spark has very useful classes to get metrics from binary or multiclass classification. But they are available for the RDD based api version. So, doing a little bit of code and playing around with dataframes and rdd it can be possible. A ful example could be like the following:
object TestMetrics {
def main(args: Array[String]) : Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
Seq(metrics.areaUnderROC(),
metrics.areaUnderROC()))
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
}
}

Convert arbitrary number of columns to Vector

How to convert a group of arbitrary columns to a Mllib Vector?
Basically, I have first column of my DataFrame with a fixed name and then a number of arbitrary named columns each having Double values inside.
like so:
name | a | b | c |
val1 | 0.0 | 1.0 | 1.0 |
val2 | 2.0 | 1.0 | 5.0 |
Could be any number of columns. I need to get a DataSet of the following:
final case class ValuesRow(name: String, values: Vector)
This can be done in a simple way using VectorAssembler. The columns that are to be merged into a Vector are used as input, in this case all columns except the first.
val df = spark.createDataFrame(Seq(("val1", 0, 1, 1), ("val2", 2, 1, 5)))
.toDF("name", "a", "b", "c")
val columnNames = df.columns.drop(1) // drop the name column
val assembler = new VectorAssembler()
.setInputCols(columnNames)
.setOutputCol("values")
val df2 = assembler.transform(df).select("name", "values").as[ValuesRow]
The result will be a dataset containing the name and values columns:
+----+-------------+
|name| values|
+----+-------------+
|val1|[0.0,1.0,1.0]|
|val2|[2.0,1.0,5.0]|
+----+-------------+
Here's one way to do it:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val ds = Seq(
("val1", 0.0, 1.0, 1.0),
("val2", 2.0, 1.0, 5.0)
).toDF("name", "a", "b", "c").
as[(String, Double, Double, Double)]
val colList = ds.columns
val keyCol = colList(0)
val valCols = colList.drop(1)
def arrToVec = udf(
(s: Seq[Double]) => new DenseVector(s.toArray)
)
ds.select(
col(keyCol), arrToVec( array(valCols.map(x => col(x)): _*) ).as("values")
).show
// +----+-------------+
// |name| values|
// +----+-------------+
// |val1|[0.0,1.0,1.0]|
// |val2|[2.0,1.0,5.0]|
// +----+-------------+

Scala - Spark In Dataframe retrieve, for row, column name with have max value

I have a DataFrame:
name column1 column2 column3 column4
first 2 1 2.1 5.4
test 1.5 0.5 0.9 3.7
choose 7 2.9 9.1 2.5
I want a new dataframe with a column with contain, the column name with have max value for row :
| name | max_column |
|--------|------------|
| first | column4 |
| test | column4 |
| choose | column3 |
Thank you very much for support.
There might some better way of writing UDF. But this could be the working solution
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
//implicits for magic functions like .toDf
import spark.implicits._
import org.apache.spark.sql.functions.udf
//We have hard code number of params as UDF don't support variable number of args
val maxval = udf((c1: Double, c2: Double, c3: Double, c4: Double) =>
if(c1 >= c2 && c1 >= c3 && c1 >= c4)
"column1"
else if(c2 >= c1 && c2 >= c3 && c2 >= c4)
"column2"
else if(c3 >= c1 && c3 >= c2 && c3 >= c4)
"column3"
else
"column4"
)
//create schema class
case class Record(name: String,
column1: Double,
column2: Double,
column3: Double,
column4: Double)
val df = Seq(
Record("first", 2.0, 1, 2.1, 5.4),
Record("test", 1.5, 0.5, 0.9, 3.7),
Record("choose", 7, 2.9, 9.1, 2.5)
).toDF();
df.withColumn("max_column", maxval($"column1", $"column2", $"column3", $"column4"))
.select("name", "max_column").show
Output
+------+----------+
| name|max_column|
+------+----------+
| first| column4|
| test| column4|
|choose| column3|
+------+----------+
You get the job done making a detour to an RDD and using 'getValuesMap'.
val dfIn = Seq(
("first", 2.0, 1., 2.1, 5.4),
("test", 1.5, 0.5, 0.9, 3.7),
("choose", 7., 2.9, 9.1, 2.5)
).toDF("name","column1","column2","column3","column4")
The simple solution is
val dfOut = dfIn.rdd
.map(r => (
r.getString(0),
r.getValuesMap[Double](r.schema.fieldNames.filter(_!="name"))
))
.map{case (n,m) => (n,m.maxBy(_._2)._1)}
.toDF("name","max_column")
But if you want to take back all columns from the original dataframe (like in Scala/Spark dataframes: find the column name corresponding to the max), you have to play a bit with merging rows and extending the schema
import org.apache.spark.sql.types.{StructType,StructField,StringType}
import org.apache.spark.sql.Row
val dfOut = sqlContext.createDataFrame(
dfIn.rdd
.map(r => (r, r.getValuesMap[Double](r.schema.fieldNames.drop(1))))
.map{case (r,m) => Row.merge(r,(Row(m.maxBy(_._2)._1)))},
dfIn.schema.add(StructField("max_column",StringType))
)
I want post my final solution:
val finalDf = originalDf.withColumn("name", maxValAsMap(keys, values)).select("cookie_id", "max_column")
val maxValAsMap = udf((keys: Seq[String], values: Seq[Any]) => {
val valueMap:Map[String,Double] = (keys zip values).filter( _._2.isInstanceOf[Double] ).map{
case (x,y) => (x, y.asInstanceOf[Double])
}.toMap
if (valueMap.isEmpty) "not computed" else valueMap.maxBy(_._2)._1
})
It's work very fast.