Aggregate data in scala - scala

I have data in a file like :
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70
Data is : Year, Month, Date, temperature.
I want to read this data and output data year and month wise temperatures.
example : 2015-08: 38, 50, 52, 70.
temperature is sorted in ascending order.
What should be the spark scala code for the same? Answer in RDD transformations would appreciate a lot.
Until now I have done this so far :
val conf= new SparkConf().setAppName("demo").setMaster("local[*]")
val spark = new SparkContext(conf)
val input = spark.textFile("src/main/resources/someFile.txt")
val fields = input.flatMap(_.split(","))
What I am thinking is, to have year-month as a key and then list of temperatures as values. But I am not able to get this into the code.

val myData = sc.parallelize(Array((2005, 8, 20, 50), (2005, 8, 21, 52), (2005, 8, 22, 38), (2005, 8, 23, 70)))
myData.sortBy(_._4).collect
returns:
res1: Array[(Int, Int, Int, Int)] = Array((2005,8,22,38), (2005,8,20,50), (2005,8,21,52), (2005,8,23,70))
Leave you to do the concat function

From file
val filesRDD = sc.textFile("/FileStore/tables/Weather2.txt",1)
val linesRDD = filesRDD.map(line => (line.trim.split(","))).map(entries=>(entries(0).toInt,entries(1).toInt,entries(2).toInt,entries(3).toInt))
linesRDD.sortBy(_._4).collect
returns:
res13: Array[(Int, Int, Int, Int)] = Array((2005,7,22,7), (2005,7,15,10), (2005,8,22,38), (2005,8,20,50), (2005,7,19,50), (2005,8,21,52), (2005,7,21,52), (2005,8,23,70))
You can think of the concat yourself, and, what if sort values are common? Multiple sorts, but this I think answers your first less well-formed question.

Related

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

spark dataframe : finding employees who is having salary more than the average salary of the organization

I am trying to run a test spark/scala code to find employees who is having salary more than the avarage salary with a test data using below spark dataframe . But this is failing while executing :
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot evaluate expression: avg(input[4, double, false])
What might be the correct syntax to achieve this ?
val dataDF20 = spark.createDataFrame(Seq(
(11, "emp1", 2, 45, 1000.0),
(12, "emp2", 1, 34, 2000.0),
(13, "emp3", 1, 33, 3245.0),
(14, "emp4", 1, 54, 4356.0),
(15, "emp5", 2, 76, 56789.0)
)).toDF("empid", "name", "deptid", "age", "sal")
val condition1 : Column = col("sal") > avg(col("sal"))
val d0 = dataDF20.filter(condition1)
println("------ d0.show()----", d0.show())
You can get this done in two steps:
val avgVal = dataDF20.select(avg($"sal")).take(1)(0)(0)
dataDF20.filter($"sal" > avgVal).show()
+-----+----+------+---+-------+
|empid|name|deptid|age| sal|
+-----+----+------+---+-------+
| 15|emp5| 2| 76|56789.0|
+-----+----+------+---+-------+

Why is Key always 0 when creating map

My code is supposed to extract a Map from a dataframe. The map will be used later for some calculations (mapping Credit to best matching original Billing). However the first step is failing already - the TransactionId is always retrieved as 0.
Simplified version of the code:
case class SalesTransaction(
CustomerId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition($"CustomerId")
//map generation:
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs("TransactionId")
, new SalesTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
)
).collect.toMap
m2.foreach(m => println("key: " + m._1 +" Value: "+ m._2))
The output has only the very last record, because all values captured by row.getAs("TransactionId") is null (i.e. translates as 0 in the m2 Map) thus tuple created in each iteration is (null, [current row SalesTransaction]).
Could you please advice me what might be wrong with my code? I'm quite new to Scala and must be missing some syntactical nuance here.
You can also use row.getAs[Int]("TransactionId") as shown below :
val m2 : Map[Int, SalesTransaction] =
df.map(row => (
row.getAs[Int]("TransactionId"),
new SalesTransaction(row.getAs("CustomerId"),
row.getAs("TransactionAttributesScore"),
row.getAs("Revenue"),
row.getAs("TransactionType"))
)
).collect.toMap
It is always better to use the casted version of getAs to avoid errors like this.
The issue is related to data type obtained from row.getAs("TransactionId"). Despite underlying $"TransactionId" being integer. Converting the input explicitly resolved the issue:
//… code above unchanged
val m2 : Map[Int, SlTransaction] =
df.map(row => {
val mKey : Int = row.getAs("TransactionId") //forcing into Int variable
val mValue : SlTransaction = new SlTransaction(row.getAs("CustomerId")
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
(mKey, mValue)
}
).collect.toMap

pyspark substring and aggregation

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!
Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()

Spark SQL - Generate array of arrays from the sql function

I want to create an array of arrays. This is my data table:
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
I want to create an array of integer column "age". For that I use "collect_list":
sqlContext.sql("SELECT collect_list(age) as age from df").show
But now I want to generate an array containing multiple arrays as created above:
sqlContext.sql("SELECT collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
But this does not work , or use the function org.apache.spark.sql.functions.array. Any ideas?
Ok, things can't get more simple. Let's consider the same data you are working on and go step by step from there
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
sqlContext.sql("select collect_list(age) as age from df").show
// +----------------+
// | age|
// +----------------+
// |[21, 26, 52, 31]|
// +----------------+
sqlContext.sql("select collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
As the error message says :
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Exactly one argument is expected..; line 1 pos 52 [...]
collest_list takes just one argument. Let's check the documentation here.
It actually takes one argument ! But let's go further in the documentation of the functions object. You seem to have noticed that the array function allows you to create a new array column out of a Column or a repeated Column parameter. So let's use that :
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df").show(false)
The array function create indeed a column from the column list create before-hand by collect_list on both age and salary :
// +-------------------------------------------------------------------+
// |arrayInt |
// +-------------------------------------------------------------------+
// |[WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450)]|
// +-------------------------------------------------------------------+
Where do we go from here ?
You have to remember that a Row from a DataFrame is just another collection wrapped by a Row.
The first thing I'll do is work on that collection. So How do we flatten a WrappedArray[WrappedArray[Int]] ?
Scala is kind of magical you just need to use .flatten
import scala.collection.mutable.WrappedArray
val firstRow: mutable.WrappedArray[mutable.WrappedArray[Int]] =
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df")
.first.get(0).asInstanceOf[WrappedArray[WrappedArray[Int]]]
// res26: scala.collection.mutable.WrappedArray[scala.collection.mutable.WrappedArray[Int]] =
// WrappedArray(WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450))
firstRow.flatten
// res27: scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(21, 26, 52, 31, 905, 1130, 1890, 1450)
Now let's wrap it in a UDF so we can use it on the DataFrame :
def flatten(array: WrappedArray[WrappedArray[Int]]) = array.flatten
sqlContext.udf.register("flatten", flatten(_: WrappedArray[WrappedArray[Int]]))
Since we registered the UDF, we can now use it inside the sqlContext :
sqlContext.sql("select flatten(array(collect_list(age), collect_list(salary))) as arrayInt from df").show(false)
// +---------------------------------------+
// |arrayInt |
// +---------------------------------------+
// |[21, 26, 52, 31, 905, 1130, 1890, 1450]|
// +---------------------------------------+
I hope this helps !
Let's create the DataFrame the way have created above.
// A case class for our sample table
import org.apache.spark.sql.functions._
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = spark.createDataFrame(x)
Here we can use array_union function to achieve the desired result. array_unionfunction will return the union of all elements from the input arrays. This function is available since spark 2.4.0
// Scala Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
// Pyspark Ref : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_union
df.select(collect_list("age").as("age"), collect_list("salary").as("salary"))
.withColumn("new_col", array_union($"age", $"salary")).show(truncate=false)
// Output
+----------------+-----------------------+---------------------------------------+
|age |salary |new_col |
+----------------+-----------------------+---------------------------------------+
|[21, 26, 52, 31]|[905, 1130, 1890, 1450]|[21, 26, 52, 31, 905, 1130, 1890, 1450]|
+----------------+-----------------------+---------------------------------------+
I hope this helps.