I am using spark-sql-2.4.1v with Java 8.
I have data columns like below
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979),
("Indus_2","Indus_2_Name","Country1", "State2",21789933),
("Indus_3","Indus_3_Name","Country1", "State3",21789978),
("Indus_4","Indus_4_Name","Country2", "State1",41789978),
("Indus_5","Indus_5_Name","Country3", "State3",27789978),
("Indus_6","Indus_6_Name","Country1", "State1",27899790),
("Indus_7","Indus_7_Name","Country3", "State1",27899790),
("Indus_8","Indus_8_Name","Country1", "State2",27899790),
("Indus_9","Indus_9_Name","Country4", "State1",27899790)
).toDF("industry_id","industry_name","country","state","revenue");
Given the below inputs list :
val countryList = Seq("Country1","Country2");
val stateMap = Map("Country1" -> {"State1","State2"}, "Country2" -> {"State2","State3"});
In spark job , for each country for each state I need to calculate few industries total revenue.
In other languages we do in for loop.
i.e.
for( country <- countryList ){
for( state <- stateMap.get(country){
// do some calculation for each state industries
}
}
In spark , what i understood we should do like this, i.e. all executors not been utilized by doing this.
so what is the correct way to handle this ?
I have added few extra rows to your sample data to differentiate aggregation. I have used scala parallel collection, For each country it will get states & then uses those values to filter the given dataframe & then do aggregation, end it will join all the result back.
scala> val df = Seq(
| ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
| ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
| ("Indus_2","Indus_2_Name","Country1", "State2",31789933),
| ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
| ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",81789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",51789978),
| ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
| ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
| ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
| ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
| ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
| ).toDF("industry_id","industry_name","country","state","revenue")
df: org.apache.spark.sql.DataFrame = [industry_id: string, industry_name: string ... 3 more fields]
scala> val countryList = Seq("Country1","Country2","Country4","Country5");
countryList: Seq[String] = List(Country1, Country2, Country4, Country5)
scala> val stateMap = Map("Country1" -> ("State1","State2"), "Country2" -> ("State2","State3"),"Country3" -> ("State31","State32"));
stateMap: scala.collection.immutable.Map[String,(String, String)] = Map(Country1 -> (State1,State2), Country2 -> (State2,State3), Country3 -> (State31,State32))
scala>
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList
.par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map{data =>
df.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
}.reduce(_ union _).show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
Edit - 1 : Separated Agg code into different function block.
scala> def processDF(data:(String,(String,String)),adf:DataFrame) = adf.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
processDF: (data: (String, (String, String)), adf: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList.
par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map(data => processDF(data,df))
.reduce(_ union _)
.show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
It really depent on what you want to do, if you don`t need to share state between states(country states), then u should create your DataFrame that each row is (country,state) and then you can control how much rows will be process parallely (num partitions and num cores).
You can use flatMapValues to create key-value pairs and then make your calculations in .map step.
scala> val data = Seq(("country1",Seq("state1","state2","state3")),("country2",Seq("state1","state2")))
scala> val rdd = sc.parallelize(data)
scala> val rdd2 = rdd.flatMapValues(s=>s)
scala> rdd2.foreach(println(_))
(country1,state1)
(country2,state1)
(country1,state2)
(country2,state2)
(country1,state3)
Here you can perform operations, I've added # to each state
scala> rdd2.map(s=>(s._1,s._2+"#")).foreach(println(_))
(country1,state1#)
(country1,state2#)
(country1,state3#)
(country2,state1#)
(country2,state2#)
I want to achieve the below for a spark a dataframe. I want to keep appending new rows to a dataframe as shown in the below example.
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.asInstanceOf[Double], count.asInstanceOf[Double]))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data2 = data1.union(row)
val data1 = data2 --> currently this assignment is not possible.
}
I have also tried
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.asInstanceOf[Double], count.asInstanceOf[Double]))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data1 = data1.union(row) --> Union with self is not possible
}
How can I achieve this in spark.
Dataframes are immutable, you will need to use mutable structure. Here is the solution that might help you.
scala> val value = Array(1.0, 2.0, 55.0)
value: Array[Double] = Array(1.0, 2.0, 55.0)
scala> import scala.collection.mutable.ListBuffer
import scala.collection.mutable.ListBuffer
scala> var data = new ListBuffer[(Double, Double)]
data: scala.collection.mutable.ListBuffer[(Double, Double)] = ListBuffer()
scala> for(a <- value)
| {
| val num = a
| val count = a+10
| data += ((num.asInstanceOf[Double], count.asInstanceOf[Double]))
| println(data)
| }
ListBuffer((1.0,11.0))
ListBuffer((1.0,11.0), (2.0,12.0))
ListBuffer((1.0,11.0), (2.0,12.0), (55.0,65.0))
scala> val DF = spark.sparkContext.parallelize(data).toDF("Number","count")
DF: org.apache.spark.sql.DataFrame = [Number: double, count: double]
scala> DF.show()
+------+-----+
|Number|count|
+------+-----+
| 1.0| 11.0|
| 2.0| 12.0|
| 55.0| 65.0|
+------+-----+
scala>
Just create one DataFrame using the for-loop and then union with data1 like this:
val df = ( for(a <- values) yield (a, a+10) ).toDF("Number", "count")
val result = data1.union(df)
This would be much more efficient than doing unions inside the for-loop.
your data1 must be declared as var:
var data1:DataFrame = ???
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.toDouble, count.toDouble))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data2 = data1.union(row)
data1 = data2
}
But I would not suggest to do this, better convert your entire value (must be a Seq?) to a dataframe, then union once. Many unions tend to be inefficient....
val newDF = value.toDF("Number")
.withColumn("count",$"Number" + 10)
val result= data1.union(newDF)
let's say one have a plurality of files in a directory, each file being
File1
20100101|12.34|...
20100101|12.34|...
20100101|36.00|...
20100102|36.00|...
20100101|14.00|...
20100101|14.00|...
File2
20100101|12.34|...
20100101|12.34|...
20100101|36.00|...
20100102|36.00|...
20100101|14.00|...
20100101|14.00|...
adjacent lines with same date and value corresponding to the same event.
Two lines in two separate files can't be adjacent.
expected result:
e1|20100101|12.34|...
e1|20100101|12.34|...
e2|20100101|36.00|...
e3|20100102|36.00|...
e4|20100101|14.00|...
e4|20100101|14.00|...
e5|20100101|12.34|...
e5|20100101|12.34|...
e6|20100101|36.00|...
e7|20100102|36.00|...
e8|20100101|14.00|...
e8|20100101|14.00|...
where eN is here an arbitrary value (e1 <> e2 <> e3 ...) to clarify the sample.
does the following code provide a unique event id for all lines of all files:
case class Event(
LineNumber: Long, var EventId: Long,
Date: String, Value: String //,..
)
val lines = sc.textFile("theDirectory")
val rows = lines.filter(l => !l.startsWith("someString")).zipWithUniqueId
.map(l => l._2.toString +: l._1.split("""\|""", -1));
var lastValue: Float = 0;
var lastDate: String = "00010101";
var eventId: Long = 0;
var rowDF = rows
.map(c => {
var e = Event(
c(0).toLong, 0, c(1), c(2) //,...
);
if ( e.Date != lastDate || e.Value != lastValue) {
lastDate = e.Date
lastValue = e.Value
eventId = e.LineNumber
}
e.EventId = eventId
e
}).toDF();
basically I use the unique line number given by zipWithUniqueId as a key for a sequence of adjacent lines.
I think my underlying question is: Is there a probabilty that the second map operation split the content of the files accross multiple process ?
Here is an idiomatic solution. Hope this helps. I have used filenames to distinguish files. A groupBy involving file name, zipindex and then join back to original input dataframe resulted in desired output.
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
scala> val lines = spark.read.textFile("file:///home/fsdjob/theDir").withColumn("filename", input_file_name())
scala> lines.show(false)
+--------------+------------------------------------+
|value |filename |
+--------------+------------------------------------+
|20100101|12.34|file:///home/fsdjob/theDir/file1.txt|
|20100101|12.34|file:///home/fsdjob/theDir/file1.txt|
|20100101|36.00|file:///home/fsdjob/theDir/file1.txt|
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|
+--------------+------------------------------------+
scala> val linesGrpWithUid = lines.groupBy("value", "filename").count.drop("count").rdd.zipWithUniqueId
linesGrpWithUid: org.apache.spark.rdd.RDD[(org.apache.spark.sql.Row, Long)] = MapPartitionsRDD[135] at zipWithUniqueId at <console>:31
scala> val linesGrpWithIdRdd = linesGrpWithUid.map( x => { org.apache.spark.sql.Row(x._1.get(0),x._1.get(1), x._2) })
linesGrpWithIdRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[136] at map at <console>:31
scala> val schema =
| StructType(
| StructField("value", StringType, false) ::
| StructField("filename", StringType, false) ::
| StructField("id", LongType, false) ::
| Nil)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(value,StringType,false), StructField(filename,StringType,false), StructField(id,LongType,false))
scala> val linesGrpWithIdDF = spark.createDataFrame(linesGrpWithIdRdd, schema)
linesGrpWithIdDF: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> linesGrpWithIdDF.show(false)
+--------------+------------------------------------+---+
|value |filename |id |
+--------------+------------------------------------+---+
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
|20100101|36.00|file:///home/fsdjob/theDir/file1.txt|146|
|20100101|12.34|file:///home/fsdjob/theDir/file1.txt|165|
+--------------+------------------------------------+---+
scala> val output = lines.join(linesGrpWithIdDF, Seq("value", "filename"))
output: org.apache.spark.sql.DataFrame = [value: string, filename: string ... 1 more field]
scala> output.show(false)
+--------------+------------------------------------+---+
|value |filename |id |
+--------------+------------------------------------+---+
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|12.34|file:///home/fsdjob/theDir/file2.txt|3 |
|20100101|36.00|file:///home/fsdjob/theDir/file2.txt|6 |
|20100102|36.00|file:///home/fsdjob/theDir/file2.txt|20 |
|20100102|36.00|file:///home/fsdjob/theDir/file1.txt|30 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file1.txt|36 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
|20100101|14.00|file:///home/fsdjob/theDir/file2.txt|56 |
|20100101|36.00|file:///home/fsdjob/theDir/file1.txt|146|
|20100101|12.34|file:///home/fsdjob/theDir/file1.txt|165|
|20100101|12.34|file:///home/fsdjob/theDir/file1.txt|165|
+--------------+------------------------------------+---+
I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:
| problem | score | label |
|:--------|:------|-------|
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
I thought about doing something like:
df.groupBy("problem").agg(getMetrics)
but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?
There's a module built just for binary metrics - see it in the python docs
This code should work,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)
... and so on for the other problems
This seems to me to be the easiest way :)
Spark has very useful classes to get metrics from binary or multiclass classification. But they are available for the RDD based api version. So, doing a little bit of code and playing around with dataframes and rdd it can be possible. A ful example could be like the following:
object TestMetrics {
def main(args: Array[String]) : Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
Seq(metrics.areaUnderROC(),
metrics.areaUnderROC()))
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
}
}
I have this input DataFrame
input_df:
|C1|C2|C3 |
|-------------|
|A| 1 | 12/06/2012 |
|A| 2 | 13/06/2012 |
|B| 3 | 12/06/2012 |
|B| 4 | 17/06/2012 |
|C| 5 | 14/06/2012 |
|----------|
and after transformations, i want to get this kind of DataFrame grouping by C1 and creating C4 column wich is form by a list of couple from C2 and C3
output_df:
|C1 | C4 |
|---------------------------------------------|
|A| (1, 12/06/2012), (2, 12/06/2012) |
|B| (3, 12/06/2012), (4, 12/06/2012) |
|C| (5, 12/06/2012) |
|---------------------------------------------|
I appoach the result when I try this:
val output_df = input_df.map(x => (x(0), (x(1), x(2))) ).groupByKey()
I obtain this result
(A,CompactBuffer((1, 12/06/2012), (2, 13/06/2012)))
(B,CompactBuffer((3, 12/06/2012), (4, 17/06/2012)))
(C,CompactBuffer((5, 14/06/2012)))
But I don't know how to convert this into DataFrame and if this is the good way to do it.
Any advise is welcome even with another approach
//please, try this
val conf = new SparkConf().setAppName("groupBy").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val rdd = sc.parallelize(
Seq(("A",1,"12/06/2012"),("A",2,"13/06/2012"),("B",3,"12/06/2012"),("B",4,"17/06/2012"),("C",5,"14/06/2012")) )
val v1 = rdd.map(x => (x._1, x ))
val v2 = v1.groupByKey()
val v3 = v2.mapValues(v => v.toArray)
val df2 = v3.toDF("aKey","theValues")
df2.printSchema()
val first = df2.first
println (first)
println (first.getString(0))
val values = first.getSeq[Row](1)
val firstArray = values(0)
println (firstArray.getString(0)) //B
println (firstArray.getInt(1)) //3
println (firstArray.getString(2)) //12/06/2012