Spark Get years as array To compare - scala

I have a data which contains
+------------+----------+
|BaseFromYear|BaseToYear|
+------------+----------+
| 2013| 2013|
+------------+----------+
I need to check difference of the two years and compare in another dataframe wheather the required year exists in base years so created a query
val df = DF_WE.filter($"id"===3 && $"status"===1).select("BaseFromYear","BaseToYear").withColumn("diff_YY",$"BaseToYear"-$"BaseFromYear".cast(IntegerType)).withColumn("Baseyears",when($"diff_YY"===0,$BaseToYear))
+------------+----------+-------+---------+
|BaseFromYear|BaseToYear|diff_YY|Baseyears|
+------------+----------+-------+---------+
| 2013| 2013| 0| 2013|
+------------+----------+-------+---------+
So I get above output But if basefromyear 2014 and basetoyear is 2017 then differnce will be 3 I need to get [2014,2015,2016,2017] as Baseyears .. So that in the next step I have a required year say 2016 so need compare with base year. I see isin function will it work??

I have added comment in the code, let me know if you need further explanation.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// This is a user defined function(udf) which will populate an array of Int from BaseFromYear to BaseToYear
val generateRange: (Int, Int) => Array[Int] = (baseFromYear: Int, baseToYear: Int) => (baseFromYear to baseToYear).toArray
val sqlfunc = udf(generateRange) // Registering the UDF with spark
val df = DF_WE.filter($"id" === 3 && $"status" === 1)
.select("BaseFromYear", "BaseToYear")
.withColumn("diff_YY", $"BaseToYear" - $"BaseFromYear".cast(IntegerType))
.withColumn("Baseyears", sqlfunc($"BaseFromYear", $"BaseToYear")) // using the UDF to populate new columns
df.show()
// Now lets say we are selecting records which has 2016 in the Baseyears
val filteredDf = df.where(array_contains(df("Baseyears"), 2016))
filteredDf.show()
// Seq[Row] is not type safe, please be careful about that
val isIn: (Int, Seq[Row] ) => Boolean = (num: Int, years: Seq[Row] ) => years.contains(num)
val sqlIsIn = udf(isIn)
val filteredDfBasedOnAnotherCol = df.filter(sqlIsIn(df("YY"), df("Baseyears")))

Related

Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
row1|row2|row3
A1,B1,C1
A2,B2,C2
A3,B3,C3
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
function(value1,value2,value3)
Please help me.
emphasized text
You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
Example:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

This question already has an answer here:
How to get Array[Seq[String]] from DataFrame?
(1 answer)
Closed 3 years ago.
I have a DataFrame and I want to convert it into a sequence of sequences and vice versa.
Now the thing is, I want to do it dynamically, and write something which runs for DataFrame with any number/type of columns.
In summary, these are the questions:
How to convert Seq[Seq[String]] to a DataFrame?
How to convert DataFrame to Seq[Seq[String]?
How to perform 2 but also make the DataFrame infer the schema and decide column types by itself?
UPDATE 1
This is not a duplicate of this question because in answer to that question solution provided is not dynamic, it works for two columns or how many columns is to be hardcoded. I am trying to find a dynamic solution.
This is how you can dynamically create a dataframe from Seq[Seq[String]]:
scala> val seqOfSeq = Seq(Seq("a","b", "c"),Seq("3","4", "5"))
seqOfSeq: Seq[Seq[String]] = List(List(a, b, c), List(3, 4, 5))
scala> val lengthOfRow = seqOfSeq(0).size
lengthOfRow: Int = 3
scala> val tempDf = sc.parallelize(seqOfSeq).toDF
tempDf: org.apache.spark.sql.DataFrame = [value: array<string>]
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias(s"col$i")): _*)
requiredDf: org.apache.spark.sql.DataFrame = [col0: string, col1: string ... 1 more field]
scala> requiredDf.show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| 3| 4| 5|
+----+----+----+
How to convert DataFrame to Seq[Seq[String]:
val newSeqOfSeq = requiredDf.collect().map(row => row.toSeq.map(_.toString).toSeq).toSeq
To use custom column names:
scala> val myCols = Seq("myColA", "myColB", "myColC")
myCols: Seq[String] = List(myColA, myColB, myColC)
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias( myCols(i) )): _*)
requiredDf: org.apache.spark.sql.DataFrame = [myColA: string, myColB: string ... 1 more field]

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Split Spark dataframe and calculate average based on one column value

I have two dataframes, the first dataframe classRecord has 10 different entries like the following:
Class, Calculation
first, Average
Second, Sum
Third, Average
Second dataframe studentRecord has around 50K entries like the following:
Name, height, Camp, Class
Shae, 152, yellow, first
Joe, 140, yellow, first
Mike, 149, white, first
Anne, 142, red, first
Tim, 154, red, Second
Jake, 153, white, Second
Sherley, 153, white, Second
From second dataframe, based on class type, I would like to perform calculation on height (for class first: average, for class second: sum, etc.) based on the camp separately (if class is first, avg of yellow, white and so on separately).
I tried the following code:
//function to calculate average
def averageOnName(splitFrame : org.apache.spark.sql.DataFrame ) : Array[(String, Double)] = {
val pairedRDD: RDD[(String, Double)] = splitFrame.select($"Name",$"height".cast("double")).as[(String, Double)].rdd
var avg_by_key = pairedRDD.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect
return avg_by_key
}
//required schema for further modifications
val schema = StructType(
StructField("name", StringType, false) ::
StructField("avg", DoubleType, false) :: Nil)
// for each loop on each class type
classRecord.rdd.foreach{
//filter students based on camps
var campYellow =studentRecord.filter($"Camp" === "yellow")
var campWhite =studentRecord.filter($"Camp" === "white")
var campRed =studentRecord.filter($"Camp" === "red")
// since I know that calculation for first class is average, so representing calculation only for class first
val avgcampYellow = averageOnName(campYellow)
val avgcampWhite = averageOnName(campWhite)
val avgcampRed = averageOnName(campRed)
// union of all
val rddYellow = sc.parallelize (avgcampYellow).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfYellow = sqlContext.createDataFrame(rddYellow, schema)
//union with yellow camp data
val rddWhite = sc.parallelize (avgcampWhite).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfWhite = sqlContext.createDataFrame(rddWhite, schema)
var dfYellWhite = dfYellow.union(dfWhite)
//union with yellow,white camp data
val rddRed = sc.parallelize (avgcampRed).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfRed = sqlContext.createDataFrame(rddRed, schema)
var dfYellWhiteRed = dfYellWhite .union(dfRed)
// other modifications and final result to hive
}
Here I am struggling with:
Hardcoding yellow, red and white, there may be additional camp types as well.
The dataframe is currently being filtered many times which could be improved.
I'm not able to figure out how to calculate differently according to class calculation type (i.e. use sum/averge depending on the class type).
Any help is appreciated.
You could simply do the average and sum calculations for all combinations of Class/Camp and then parse the classRecord dataframe separately and extract what you need. You can do this easily in spark by using the groupBy() method and aggregate the values.
Using your example dataframe:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
studentRecord.show()
+-------+------+------+------+
| Name|height| Camp| Class|
+-------+------+------+------+
| Shae| 152|yellow| first|
| Joe| 140|yellow| first|
| Mike| 149| white| first|
| Anne| 142| red| first|
| Tim| 154| red|Second|
| Jake| 153| white|Second|
|Sherley| 153| white|Second|
+-------+------+------+------+
val df = studentRecord.groupBy("Class", "Camp")
.agg(
sum($"height").as("Sum"),
avg($"height").as("Average"),
collect_list($"Name").as("Names")
)
df.show()
+------+------+---+-------+---------------+
| Class| Camp|Sum|Average| Names|
+------+------+---+-------+---------------+
| first| white|149| 149.0| [Mike]|
| first| red|142| 142.0| [Anne]|
|Second| red|154| 154.0| [Tim]|
|Second| white|306| 153.0|[Jake, Sherley]|
| first|yellow|292| 146.0| [Shae, Joe]|
+------+------+---+-------+---------------+
After doing this, you can simply check your first classRecord dataframe after which rows you need. Example of what it can look like, can be changed after your actual needs:
// Collects the dataframe as an Array[(String, String)]
val classRecs = classRecord.collect().map{case Row(clas: String, calc: String) => (clas, calc)}
for (classRec <- classRecs){
val clas = classRec._1
val calc = classRec._2
// Matches which calculation you want to do
val df2 = calc match {
case "Average" => df.filter($"Class" === clas).select("Class", "Camp", "Average")
case "Sum" => df.filter($"Class" === clas).select("Class", "Camp", "Sum")
}
// Do something with df2
}
Hope it helps!

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join.
Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like:
val dfA : DataFrame
val dfB : DataFrame
dfA.join(dfB, dfB("date") === dfA("date") )
However for Dataset there is the .joinWith method, but the same approach does not work:
val dfA : Dataset
val dfB : Dataset
dfA.joinWith(dfB, ? )
What is the argument required by .joinWith ?
To use joinWith you first have to create a DataSet, and most likely two of them. To create a DataSet, you need to create a case class that matches your schema and call DataFrame.as[T] where T is your case class. So:
case class KeyValue(key: Int, value: String)
val df = Seq((1,"asdf"),(2,"34234")).toDF("key", "value")
val ds = df.as[KeyValue]
// org.apache.spark.sql.Dataset[KeyValue] = [key: int, value: string]
You could also skip the case class and use a tuple:
val tupDs = df.as[(Int,String)]
// org.apache.spark.sql.Dataset[(Int, String)] = [_1: int, _2: string]
Then if you had another case class / DF, like this say:
case class Nums(key: Int, num1: Double, num2: Long)
val df2 = Seq((1,7.7,101L),(2,1.2,10L)).toDF("key","num1","num2")
val ds2 = df2.as[Nums]
// org.apache.spark.sql.Dataset[Nums] = [key: int, num1: double, num2: bigint]
Then, while the syntax of join and joinWith are similar, the results are different:
df.join(df2, df.col("key") === df2.col("key")).show
// +---+-----+---+----+----+
// |key|value|key|num1|num2|
// +---+-----+---+----+----+
// | 1| asdf| 1| 7.7| 101|
// | 2|34234| 2| 1.2| 10|
// +---+-----+---+----+----+
ds.joinWith(ds2, df.col("key") === df2.col("key")).show
// +---------+-----------+
// | _1| _2|
// +---------+-----------+
// | [1,asdf]|[1,7.7,101]|
// |[2,34234]| [2,1.2,10]|
// +---------+-----------+
As you can see, joinWith leaves the objects intact as parts of a tuple, while join flattens out the columns into a single namespace. (Which will cause problems in the above case because the column name "key" is repeated.)
Curiously enough, I have to use df.col("key") and df2.col("key") to create the conditions for joining ds and ds2 -- if you use just col("key") on either side it does not work, and ds.col(...) doesn't exist. Using the original df.col("key") does the trick, however.
From https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html
it looks like you could just do
dfA.as("A").joinWith(dfB.as("B"), $"A.date" === $"B.date" )
For the above example, you can try the below:
Define a case class for your output
case class JoinOutput(key:Int, value:String, num1:Double, num2:Long)
Join two Datasets with Seq("key"), this will help you to avoid two duplicate key columns in the output, which will also help to apply the case class or fetch the data in the next step
val joined = ds.join(ds2, Seq("key")).as[JoinOutput]
// res27: org.apache.spark.sql.Dataset[JoinOutput] = [key: int, value: string ... 2 more fields]
The result will be flat instead:
joined.show
+---+-----+----+----+
|key|value|num1|num2|
+---+-----+----+----+
| 1| asdf| 7.7| 101|
| 2|34234| 1.2| 10|
+---+-----+----+----+