I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:
| problem | score | label |
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
I thought about doing something like:
but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?
There's a module built just for binary metrics - see it in the python docs
This code should work,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
... and so on for the other problems
This seems to me to be the easiest way :)
Spark has very useful classes to get metrics from binary or multiclass classification. But they are available for the RDD based api version. So, doing a little bit of code and playing around with dataframes and rdd it can be possible. A ful example could be like the following:
object TestMetrics {
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
I'm using Scala on Spark and I have dense matrix like this:
res63: org.apache.spark.ml.linalg.DenseMatrix =
-0.26035262239241047 -0.9349256695883184
0.08719326360909431 -0.06393629243008418
0.006698866707269257 0.04124873027993731
0.011979122705128064 -0.005430767154896149
0.049075485175059275 0.04810618828561001
0.001605411530424612 0.015016736357799364
0.9587151228619724 -0.2534046936998956
-0.04668498310146597 0.06015550772431999
-0.022360873382091598 -0.22147143481166376
-0.014153052584280682 -0.025947327705852636
I want use VectorAssembler to create feature column so I transform vp:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val c = vp.toArray.toSeq
val vp_df = c.toDF("number")
val vp_list = vp_df.collect.map(_.toSeq).flatten
val vp_string = vp_list.map(_.toString)
res64: Array[String] = Array(-0.26035262239241047, 0.08719326360909431, 0.006698866707269257, 0.011979122705128064, 0.049075485175059275, 0.001605411530424612, 0.9587151228619724, -0.04668498310146597, -0.022360873382091598, -0.014153052584280682, -0.9349256695883184, -0.06393629243008418, 0.04124873027993731, -0.005430767154896149, 0.04810618828561001, 0.015016736357799364, -0.2534046936998956, 0.06015550772431999, -0.22147143481166376, -0.025947327705852636)
Then I use VectorAssembler:
val assembler = new VectorAssembler().setInputCols(vp_string).setOutputCol("features")
val output = assembler.transform(vp_df)
But I have an error and I don't understand why
IllegalArgumentException: -0.26035262239241047 does not exist. Available: number
I don't know how this is possible, I've done the AssemblerVector several times and this is the first time I've seen this
Your "number" column is already in an array of double, so all you need is to convert this column into a dense vector.
val arrayToVectorUDF = udf((array : Seq[Double]) => {
vp_df.withColumn("vector", arrayToVectorUDF(col("number")))
Update: I misunderstood your code.
The number column is a DoubleType column, so all you need to do is pass the column name to the vector assembler.
import org.apache.spark.ml.linalg.DenseMatrix
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val data = (1 to 20).map(_.toDouble).toArray
val dm = new DenseMatrix(2, 10, data)
val vp_df = dm.toArray.toSeq.toDF("number")
val assembler = new VectorAssembler().setInputCols(Array("number")).setOutputCol("features")
val output = assembler.transform(vp_df)
|[1.0] |
|[2.0] |
|[3.0] |
|[4.0] |
|[5.0] |
|[6.0] |
|[7.0] |
|[8.0] |
|[9.0] |
|[10.0] |
|[11.0] |
|[12.0] |
|[13.0] |
|[14.0] |
|[15.0] |
|[16.0] |
|[17.0] |
|[18.0] |
|[19.0] |
|[20.0] |
let's say one have a plurality of files in a directory, each file being
adjacent lines with same date and value corresponding to the same event.
Two lines in two separate files can't be adjacent.
expected result:
where eN is here an arbitrary value (e1 <> e2 <> e3 ...) to clarify the sample.
does the following code provide a unique event id for all lines of all files:
case class Event(
LineNumber: Long, var EventId: Long,
Date: String, Value: String //,..
val lines = sc.textFile("theDirectory")
val rows = lines.filter(l => !l.startsWith("someString")).zipWithUniqueId
.map(l => l._2.toString +: l._1.split("""\|""", -1));
var lastValue: Float = 0;
var lastDate: String = "00010101";
var eventId: Long = 0;
var rowDF = rows
.map(c => {
var e = Event(
c(0).toLong, 0, c(1), c(2) //,...
if ( e.Date != lastDate || e.Value != lastValue) {
lastDate = e.Date
lastValue = e.Value
eventId = e.LineNumber
e.EventId = eventId
basically I use the unique line number given by zipWithUniqueId as a key for a sequence of adjacent lines.
I think my underlying question is: Is there a probabilty that the second map operation split the content of the files accross multiple process ?
Here is an idiomatic solution. Hope this helps. I have used filenames to distinguish files. A groupBy involving file name, zipindex and then join back to original input dataframe resulted in desired output.
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
scala> val lines = spark.read.textFile("file:///home/fsdjob/theDir").withColumn("filename", input_file_name())
scala> lines.show(false)
|value |filename |
scala> val linesGrpWithUid = lines.groupBy("value", "filename").count.drop("count").rdd.zipWithUniqueId
linesGrpWithUid: org.apache.spark.rdd.RDD[(org.apache.spark.sql.Row, Long)] = MapPartitionsRDD[135] at zipWithUniqueId at <console>:31
scala> val linesGrpWithIdRdd = linesGrpWithUid.map( x => { org.apache.spark.sql.Row(x._1.get(0),x._1.get(1), x._2) })
linesGrpWithIdRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[136] at map at <console>:31
scala> val schema =
| StructType(
| StructField("value", StringType, false) ::
| StructField("filename", StringType, false) ::
| StructField("id", LongType, false) ::
| Nil)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(value,StringType,false), StructField(filename,StringType,false), StructField(id,LongType,false))
scala> val linesGrpWithIdDF = spark.createDataFrame(linesGrpWithIdRdd, schema)
linesGrpWithIdDF: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> linesGrpWithIdDF.show(false)
|value |filename |id |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
scala> val output = lines.join(linesGrpWithIdDF, Seq("value", "filename"))
output: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> output.show(false)
|value |filename |id |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
1 | Luke | [baseball, soccer]
should become
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
return resultDf;
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
1 | Luke | [baseball, soccer]
2 | Lucy | null
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Spark 2.2+
You can use explode_outer function:
import org.apache.spark.sql.functions.explode_outer
df.withColumn("likes", explode_outer($"likes")).show
// +---+----+--------+
// | id|name| likes|
// +---+----+--------+
// | 1|Luke|baseball|
// | 1|Luke| soccer|
// | 2|Lucy| null|
// +---+----+--------+
Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use import static).
import org.apache.spark.sql.functions.{array, col, explode, lit, when}
val df = Seq(
(1, "Luke", Some(Array("baseball", "soccer"))),
(2, "Lucy", None)
).toDF("id", "name", "likes")
df.withColumn("likes", explode(
when(col("likes").isNotNull, col("likes"))
// If null explode an array<string> with a single null
The idea here is basically to replace NULL with an array(NULL) of a desired type. For complex type (a.k.a structs) you have to provide full schema:
val dfStruct = Seq((1L, Some(Array((1, "a")))), (2L, None)).toDF("x", "y")
val st = StructType(Seq(
StructField("_1", IntegerType, false), StructField("_2", StringType, true)
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
If array Column has been created with containsNull set to false you should change this first (tested with Spark 2.1):
df.withColumn("array_column", $"array_column".cast(ArrayType(SomeType, true)))
You can use explode_outer() function.
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
def explodeOuter(df: Dataset[Row], columnsToExplode: List[String]) = {
val arrayFields = df.schema.fields
.map(field => field.name -> field.dataType)
.collect { case (name: String, type: ArrayType) => (name, type.asInstanceOf[ArrayType])}
columnsToExplode.foldLeft(df) { (dataFrame, arrayCol) =>
dataFrame.withColumn(arrayCol, explode(when(size(col(arrayCol)) =!= 0, col(arrayCol))
Edit: it seems that spark 2.2 and newer have this built in.
To handle empty map type column: for Spark <= 2.1
List((1, Array(2, 3, 4), Map(1 -> "a")),
(2, Array(5, 6, 7), Map(2 -> "b")),
(3, Array[Int](), Map[Int, String]())).toDF("col1", "col2", "col3").show()
df.select('col1, explode(when(size(map_keys('col3)) === 0, map(lit("null"), lit("null"))).
from pyspark.sql.functions import *
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[col(nc + '.' + c).alias(nc + '_' + c)
for nc in nested_cols
for c in nested_df.select(nc + '.*').columns])
print("flatten_df_count :", flat_df.count())
return flat_df
def explode_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct' and c[1][:5] != 'array']
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for array_col in array_cols:
schema = new_df.select(array_col).dtypes[0][1]
nested_df = nested_df.withColumn(array_col, when(col(array_col).isNotNull(), col(array_col)).otherwise(array(lit(None)).cast(schema)))
nested_df = nested_df.withColumn("tmp", arrays_zip(*array_cols)).withColumn("tmp", explode("tmp")).select([col("tmp."+c).alias(c) for c in array_cols] + flat_cols)
print("explode_dfs_count :", nested_df.count())
return nested_df
new_df = flatten_df(myDf)
while True:
array_cols = [c[0] for c in new_df.dtypes if c[1][:5] == 'array']
if len(array_cols):
new_df = flatten_df(explode_df(new_df))
Used arrays_zip and explode to do it faster and address the null issue.
I have data in one RDD and the data is as follows:
scala> c_data
res31: org.apache.spark.rdd.RDD[String] = /home/t_csv MapPartitionsRDD[26] at textFile at <console>:25
scala> c_data.count()
res29: Long = 45212
scala> c_data.take(2).foreach(println)
I want to split the data into another rdd and I am using:
scala> val csv_data = c_data.map{x=>
| val w = x.split(";")
| val age = w(0)
| val job = w(1)
| val marital_stat = w(2)
| val education = w(3)
| val default = w(4)
| val balance = w(5)
| val housing = w(6)
| val loan = w(7)
| val contact = w(8)
| val day = w(9)
| val month = w(10)
| val duration = w(11)
| val campaign = w(12)
| val pdays = w(13)
| val previous = w(14)
| val poutcome = w(15)
| val Y = w(16)
| }
that returns :
csv_data: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[28] at map at <console>:27
when I query csv_data it returns Array((),....).
How can I get the data with first row as header and rest as data ?
Where I am doing wrong ?
Thanks in Advance.
Your mapping function returns Unit, so you map to an RDD[Unit]. You can get a tuple of your values by changing your code to
val csv_data = c_data.map{x=>
val w = x.split(";")
val Y = w(16)
(w, age, job, marital_stat, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome, Y)
I am new to apache spark and scala. I have data set like this which I am taking from csv file and converting it into RDD using scala.
| recent | Freq | Monitor |
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
I want to calculate z-score value or to standardize the data. So I am calculating the z-score for each column and then try to combine them so I get standard scale.
Here is my code for calculating the z-score for first column
val scores1 = sorted.map(_.split(",")(0)).cache
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / count)
val zscore = sorted.map(x => math.round((x.toDouble - mean)/stddev))
How do I calculate for each column ? Or is there any other way to normalize or standardize the data ?
My requirement is to assign the rank(or scale).
If you want to standardize the columns, you can use the StandardScaler class from Spark MLlib. Data should be in the form of RDD[Vectors[Double], where Vectors are a part of MLlib Linalg package. You can choose to use mean or standard deviation or both to standardize your data.
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
val data = sc.parallelize(Array(
// Converting RDD[Array] to RDD[Vectors]
val features = data.map(a => Vectors.dense(a))
// Creating a Scaler model that standardizes with both mean and SD
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
// Scale features using the scaler model
val scaledFeatures = scaler.transform(features)
This scaledFeatures RDD contains the Z-score of all columns.
Hope this answer helps. Check the Documentation for more info.
You may want to use below code to perform Standard Scaling on required columns.Vector Assembler is used to select required columns that need to be transformed. StandardScaler constructor also provides you an option to select values of Mean and Standard deviation
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature
import org.apache.spark.ml.feature.StandardScaler
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/hadoop/data/your_dataset.csv")
val assembler = new VectorAssembler().setInputCols(Array("recent","Freq","Monitor")).setOutputCol("features")
val transformVector = assembler.transform(df)
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)
val scalerModel = scaler.fit(transformVector)
val scaledData = scalerModel.transform(transformVector)
scaledData.show() 20, False
scaledData.show(20, false)