PySpark: date interval in PySpark's sequence function? - pyspark

I want to generate a DataFrame with dates using PySpark's sequence() function (not looking for work-arounds using other methods). I got this working with the default step of 1. But how do I generate a sequence with dates, say, 1 week apart? I can't figure out what type/value to feed into the step parameter of the function.
df = (spark.createDataFrame([{'date':1}])
.select(explode(sequence(to_date(lit('2021-01-01')),to_date(lit(date.today())))).alias('calendar_date')))
df.show()

You have to use an INTERVAL literal. From your code:
df = (
spark
.createDataFrame([{'date':1}])
.select(
explode(sequence(
to_date(lit('2021-01-01')), # start
to_date(lit(date.today())), # stop
expr("INTERVAL 1 WEEK") # step
)).alias('calendar_date')
)
)
df.show()
https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal

Related

How would I create bins of date ranges in spark scala?

Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).

Pyspark sql add letter in datetype value

I have epoch time values in Spark dataframe like 1569872588019 and I'm using pyspark sql in jupyter notebook.
I'm using the from_unixtime method to convert it to date.
Here is my code:
SELECT from_unixtime(dataepochvalues/1000,'yyyy-MM-dd%%HH:MM:ss') AS date FROM testdata
The result is like: 2019-04-30%%11:09:11
But what I want is like: 2019-04-30T11:04:48.366Z
I tried to add T and Z instead of %% in date but failed.
How can I insert T and Z letter?
You can specify those letters using single quotes. For your desired output, use the following date and time pattern:
"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
Using your example:
spark.sql(
"""SELECT from_unixtime(1569872588019/1000,"yyyy-MM-dd'T'HH:MM:ss'Z'") AS date"""
).show()
#+--------------------+
#| date|
#+--------------------+
#|2019-09-30T14:09:08Z|
#+--------------------+

Spark multiple dynamic aggregate functions, countDistinct not working

Aggregation on Spark dataframe with multiple dynamic aggregation operations.
I want to do aggregation on a Spark dataframe using Scala with multiple dynamic aggregation operations (passed by user in JSON). I'm converting the JSON to a Map.
Below is some sample data:
colA colB colC colD
1 2 3 4
5 6 7 8
9 10 11 12
The Spark aggregation code which I am using:
var cols = ["colA","colB"]
var aggFuncMap = Map("colC"-> "sum", "colD"-> "countDistinct")
var aggregatedDF = currentDF.groupBy(cols.head, cols.tail: _*).agg(aggFuncMap)
I have to pass aggFuncMap as Map only, so that user can pass any number of aggregations through the JSON configuration.
The above code is working fine for some aggregations, including sum, min, max, avg and count.
However, unfortunately this code is not working for countDistinct (maybe because it is camel case?).
When running the above code, I am getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Undefined function: 'countdistinct'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'
Any help will be appreciated!
It's currently not possible to use agg with countDistinct inside a Map. From the documentation we see:
The available aggregate methods are avg, max, min, sum, count.
A possible fix would be to change the Map to a Seq[Column],
val cols = Seq("colA", "colB")
val aggFuncs = Seq(sum("colC"), countDistinct("colD"))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)
but that won't help very much if the user are to specify the aggregations in a configuration file.
Another approach would be to use expr, this function will evaluate a string and give back a column. However, expr won't accept "countDistinct", instead "count(distinct(...))" needs to be used.
This could be coded as follows:
val aggFuncs = Seq("sum(colC)", "count(distinct(colD))").map(e => expr(e))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)

get date difference from the columns in dataframe and get seconds -Spark scala

I have a dataframe with two date columns .Now I need to get the difference and the results should be seconds
UNIX_TIMESTAMP(SUBSTR(date1, 1, 19)) - UNIX_TIMESTAMP(SUBSTR(date2, 1, 19)) AS delta
that hive query I am trying to convert into dataframe query using scala
df.select(col("date").substr(1,19)-col("poll_date").substr(1,19))
from here I am not able to convert into seconds , Can any body help on this .Thanks in advance
Using DataFrame API, you can calculate the date difference in seconds simply by subtracting one column from the other in unix_timestamp:
val df = Seq(
("2018-03-05 09:00:00", "2018-03-05 09:01:30"),
("2018-03-06 08:30:00", "2018-03-08 15:00:15")
).toDF("date1", "date2")
df.withColumn("tsdiff", unix_timestamp($"date2") - unix_timestamp($"date1")).
show
// +-------------------+-------------------+------+
// | date1| date2|tsdiff|
// +-------------------+-------------------+------+
// |2018-03-05 09:00:00|2018-03-05 09:01:30| 90|
// |2018-03-06 08:30:00|2018-03-08 15:00:15|196215|
// +-------------------+-------------------+------+
You could perform the calculation in Spark SQL as well, if necessary:
df.createOrReplaceTempView("dfview")
spark.sql("""
select date1, date2, (unix_timestamp(date2) - unix_timestamp(date1)) as tsdiff
from dfview
""")

How to divide the value of current row with the following one?

In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show