Get sum of a column in a DF1 based on date range from DF2 in Spark - scala

I have two dataframes and I want to get sum of value in the dataframe 1 based on the date range from dataframe 2 (startDate and endDate) and sort the results from maximum to minimum in Spark
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
("2019-01-01", 100), ("2019-01-02", 150),
("2019-01-03", 120), ("2019-01-04", 38),
("2019-01-05", 200), ("2019-01-06", 381),
("2019-01-07", 220), ("2019-01-08", 183),
("2019-01-09", 160), ("2019-01-10", 109),
("2019-01-11", 130), ("2019-01-12", 282),
("2019-01-13", 10), ("2019-01-14", 348),
("2019-01-15", 20), ("2019-01-16", 190)
)).toDF("date", "value").withColumn("date", to_date($"date"))
val df_dates = sc.parallelize(Seq(
("2019-01-01", "2019-01-04"),
("2019-01-05", "2019-01-08"),
("2019-01-09", "2019-01-12"),
("2019-01-13", "2019-01-16")
)).toDF("startDate", "endDate").withColumn("startDate", to_date($"startDate")).withColumn("endDate", to_date($"endDate"))
The resulting output will add a column to the df_date dataframe sum_value. I really do not know where to start. I searched web and couldm't find a solution.

You first have to join date values to the date ranges, then aggregate:
df_dates
.join(df, $"date".between($"startDate", $"endDate"), "left")
.groupBy($"startDate", $"endDate").agg(
sum($"value").as("sum_value")
)
.orderBy($"sum_value".desc)
.show()
+----------+----------+---------+
| startDate| endDate|sum_value|
+----------+----------+---------+
|2019-01-05|2019-01-08| 984|
|2019-01-09|2019-01-12| 681|
|2019-01-13|2019-01-16| 568|
|2019-01-01|2019-01-04| 408|
+----------+----------+---------+

Related

'list' object has no attribute 'map' in pyspark error

""" df = sc.textFile("/content/Shakespeare.txt")
llist = df.collect()
for line in llist:
t= simple_tokenize(line)
rdd2 = t.map(lambda word: (word,1)) # error on this line
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
"""
I am facing an error on rdd2. Can someone please help?
I think you would like a simple word count using rdd. You can do it by
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
# read file from url
url="https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt"
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("shakespeare.txt"), header=True)
df.show(4)
+------------------------------------------+
|THE SONNETS |
+------------------------------------------+
|by William Shakespeare |
|From fairest creatures we desire increase |
|That thereby beauty's rose might never die|
|But as the riper should by time decease |
+------------------------------------------+
only showing top 4 rows
# convert to rdd taking only the strings of the row
rdd=df.rdd.map(lambda x: x["THE SONNETS"])
rdd.take(4)
['by William Shakespeare',
'From fairest creatures we desire increase',
"That thereby beauty's rose might never die",
'But as the riper should by time decease']
# you can also parallelize a Python list of strings
data=["From fairest creatures we desire increase",
"That thereby beauty's rose might never die",
"But as the riper should by time decease",
"His tender heir might bear his memory",
]
rdd=spark.sparkContext.parallelize(data)
Now run the basic three steps
split by words
count the words
reduce by key
rdd1=rdd.flatMap(lambda x: x.split(" "))
rdd2=rdd1.map(lambda word: (word,1))
rdd3=rdd2.reduceByKey(lambda a,b: a+b)
rdd3.take(20)
[('by', 66),
('William', 1),
('Shakespeare', 1),
('From', 14),
('fairest', 5),
('creatures', 2),
('we', 11),
('desire', 6),
('increase', 3),
('That', 83),
('thereby', 1),
("beauty's", 16),
('rose', 5),
('might', 19),
('never', 10),
('die', 5),
('But', 89),
('as', 66),
('the', 311),
('riper', 2)]

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

Spark Dataframe: Calculate variance between groups

With R's dplyr I would calculate variance between groups like so:
df %>% group_by(group) %>% summarise(total = sum(value)) %>% summarise(variance_between_groups = var(total))
Trying to perform the same action with Sparks DataFrame API:
df.groupBy(group).agg(sum(value).alias("total")).agg(var_samp(total).alias("variance_between_groups"))
I receive an error in the second agg saying that it can't find total. I am clearly misunderstanding something so any help would be appreciated.
var_samp() takes a String-type column name, hence you need to provide a String as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("a", 1.0),
("a", 2.5),
("a", 1.5),
("b", 2.0),
("b", 1.6)
).toDF("group", "value")
df.groupBy("group").
agg(sum("value").alias("total")).
agg(var_samp("total").alias("variance_between_groups")).
show
// +-----------------------+
// |variance_between_groups|
// +-----------------------+
// | 0.9799999999999999|
// +-----------------------+
It can also take a column (of Column type), e.g. var_samp($"total"). See Spark's API doc for more details.

pyspark substring and aggregation

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!
Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code